In Perl, how can I parse an XML file that is too large to fit in available memory?
I have a very large XML file (If you care, it's an AIXM file from EAD, but that's not important). In order to figure out how it is used, I want to write a simp开发者_开发问答le script that goes through and for every node, record what subnodes occur below it and how many times, so I can see which nodes contain <AptUid>
and whether most <Rdn>
nodes have a <GeoLat>
node or not, that sort of thing.
I tried to just load the whole thing into a hashref using XML::Simple, but it's too big to fit into memory. Is there an XML parser that will allow me to just look at the file a piece at a time?
See Processing an XML document chunk by chunk in XML::Twig.
You want to use a SAX parser XML::SAX Implement start_element and end_element methods to build your node tree
Try the XML::Parser module. Should be what you need.
another link
You should use a streaming parser, such as XML::Parser
(which in turn is a layer above expat). You will have to register handlers for the tags you are interested in, and do the book-keeping yourself. As with other streaming models, such as SAX, you do not get a whole view of the file at once (except for the subset you explicitly consume in your code).
Here's a solution using XML::Parser. Comments welcome.
use XML::Parser;
%elemMap = ();
@context = ();
sub on_start {
my ($p, $elemName, @alist) = @_;
my $parent = @context[-1];
if ($parent) {
$elemMap{$parent}{$elemName}++;
}
push(@context, $elemName);
}
sub on_end {
pop(@context);
}
$p = new XML::Parser(Handlers => {Start => \&on_start, End => \&on_end});
$p->parse(STDIN);
while (my ($elem, $childElems) = each(%elemMap)) {
while (my ($childElem, $count) = each(%{$childElems})) {
print "$elem > $childElem: $count\n";
}
}
When you are first trying to figure out the structure of an unknown XML file, open it in less or more and start paging through it. Don't use an editor that tries to load the entire file into memory unless you like waiting for your machine a lot.
Building a parser when you have no idea how the data is structured is going to be very frustrating so don't jump into coding first, jump into exploring until you know enough to begin coding.
精彩评论