22.8. Processing Files Larger Than Available Memory
22.8.1. Problem
You want to work with a large XML
file, but you can''t read it into memory to form a DOM or other kind
of tree because it''s too big.
22.8.2. Solution
Use SAX (as described in Recipe 22.3) to
process events instead of building a tree.Alternatively, use XML::Twig to build
trees only for the parts of the document you want to work with (as
specified by XPath expressions):
use XML::Twig;
my $twig = XML::Twig->new( twig_handlers => {
$XPATH_EXPRESSION => \&HANDLER,
# ...
});
$twig->parsefile($FILENAME);
$twig->flush( );
You can call a lot of DOM-like functions from within a handler, but
only the elements identified by the XPath expression (and whatever
those elements enclose) go into a tree.
22.8.3. Discussion
DOM modules turn the entire document into a tree, regardless of
whether you use all of it. With SAX modules, there are no trees
built—if your task depends on document structure, you must keep
track of that structure yourself. A happy middle ground is XML::Twig,
which creates DOM trees only for the bits of the file that you''re
interested in. Because you work with files a piece at a time, you can
cope with very large files by processing pieces that fit in memory.For example, to print the titles of books in
books.xml (Example 22-1), you
could write:
use XML::Twig;
my $twig = XML::Twig->new( twig_roots => { ''/books/book'' => \&do_book });
$twig->parsefile("books.xml");
$twig->purge( );
sub do_book {
my($title) = $_->find_nodes("title");
print $title->text, "\n";
}
For each book element, XML::Twig calls
do_book on its contents. That subroutine finds the
title node and prints its text. Rather than having
the entire file parsed into a DOM structure, we keep only one
book element at a time.Consult the XML::Twig manpages for details on how much DOM and XPath
the module supports—it''s not complete, but it''s growing all the
time. XML::Twig uses XML::Parser for its XML parsing, and as a result
the functions available on nodes are slightly different from those
provided by XML::LibXSLT''s DOM parsing.
22.8.4. See Also
Recipe 22.6; the documentation for the module
XML::Twig