What is the fastest way to pull a few element values out of XML files in Perl?

2022-12-22 17:58 问答作者：

I have a bunch of XML files that are about 1-2 megabytes in size. Actually, more than a bunch, there are millions. They're all well-formed and many are even validated against their schema (confirmed with libxml2).

All were created by the same app, so they're in a开发者_运维知识库 consistent format (though this could theoretically change in the future).

I want to check the values of one element in each file from within a Perl script. Speed is important (I'd like to take less than a second per file) and as noted I already know the files are well-formed.

I am sorely tempted to simply 'open' the files in Perl and scan through until I see the element I am looking for, grab the value (which is near the start of the file), and close the file.

On the other hand, I could use an XML parser (which might protect me from future changes to the XML formatting) but I suspect it will be slower than I'd like.

Can anyone recommend an appropriate approach and/or parser?

Thanks in advance.

Update

Here's the structure/complexity of the data I am trying to pull out:

<doc>
  ...
  <someparentnode attrib="notme" attrib2="5">
    <node>Not this one</node>
  </someparentnode>
  <someparentnode attrib="pickme" attrib2="5">
    <node>This is the data I want</node>
  </someparentnode>
  <someparentnode attrib="notme" 
     attrib2="reallyreallylonglineslikethisonearewrapped">
    <node>Not this one either and it may be 
      wrapped too.</node>
  </someparentnode>
  ...    
</doc>

The hierarchy goes a several levels deeper than that, but I think that covers off the sorts of things I am trying to do.

2 stand-alone XML-aware options (which I wrote, so I might be biased ;--) are xml_grep (included with XML::Twig) and xml_grep2 (in App::xml_grep2).

You would write xml_grep -t '*[@attrib="pickme"]' *.xml or xml_grep2 -t '//*[@attrib="pickme"]' *.xml (the -t option gives you the result as text instead of XML). Also in both cases all of the documents will be parsed, but the next version of xml_grep will add an option to limit the number of results per file, and to stop parsing each file as soon as this number is reached.

Otherwise, if you need speed and if the code needs to be integrated, you can use XML::Twig, with a handler triggered on the element(s) you want, and a call to finish_now when you've found it, which will abort parsing and let go on to the next file.

XML::LibXML is also an option, although you will then have to parse completely each document and use XPath (easy but might be slower), use SAX (may be faster but is painful to code) or use the pull-parser (probably the best option but I have never used it).

Update after your update: the code with XML::Twig would look like this:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });

foreach my $file (@ARGV)
  { $twig->parsefile( $file); }

sub pickme
  { my( $twig, $node)= @_;
    print $node->text, "\n";
    $twig->finish_now;
  }

If you want to do it fast, I would recommend you use XML::Bare instead of XML::Simple or XML::Twig.

I'm using it to parse through several 2-5Mb XML files and the speedup is amazing: 0.2 seconds vs 4 minutes, in some cases. Details here: http://darkpan.com/files/xml-parsing-perl-gripes.txt.

Awk

awk 'BEGIN{
 RS="</doc>"
 FS="</someparentnode>"
}

{
  for(i=1;i<=NF;i++){
     if( $i~/pickme/){
        m=split($i,a,"</node>")
        for(o=1;o<=m;o++){
          if(a[o]~/<node>/){
            gsub(/.*<node>/,"",a[o])
            print a[o]
          }
        }
     }
  }
}' file

Perl

#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
    chomp;
    @F = split $FS,;
    for ($i=0;$i<=$#F; $i++) {
        if ($F[$i] =~ /pickme/) {
            $M=(@a=split('</node>', $F[$i]));
            for ($o=0; $o<$M; $o++) {
                if ($a[$o]=~/<node>/) {
                    $a[$o] =~ s/.*<node>//sg;
                    print $a[$o];
                }
            }
        }
    }
}

output

$ perl script.pl file
This is the data I want

$ ./shell.sh
This is the data I want

继续阅读：performance perl xml

What is the fastest way to pull a few element values out of XML files in Perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？