开发者

XML::LibXML: Speedquestion

This script takes about 50 minutes, ( filesize: 22,3 MiB, cpu: atom ).

Is this normal (the 50 minutes)?

Could I tweak the script, to make it faster?

#!/u开发者_StackOverflowsr/local/bin/perl
use XML::LibXML;
use DBI;

my $dbh = DBI->connect( "DBI:SQLite:dbname=$db", undef, undef, $options );
my $sth = $dbh->prepare( "INSERT INTO $table ( id, titel, ... ) VALUES ( ?, ?, ... )" );

my $parser = XML::LibXML->new();
my $doc = $parser->load_xml( location => $file );
my @nodes = $doc->findnodes( '//Mediathek/Filme' );

my @keys = qw( Id Titel ... );

for my $node ( @nodes ) {
    my @nodes = $node->findnodes( './*' );
    my %hash;
    @hash{@keys} = ();
    for my $node ( @nodes ) {
        $hash{$node->nodeName} = $node->textContent;
    }
    $sth->execute( @hash{@keys} );
}


I'm pretty sure Ashley is right when pointing to the transactions and the associated costly IO.

As for the XML part, given the input doc size of 22 MB, you're going to need about 200 MB of memory but processing should be reasonably fast, in the range of seconds, not minutes.

One thing that looks inefficient is your whole-doc-scan XPath expression. Can Mediathek/Filme really appear anywhere in the document? Or is it rather something like /Archiv/Mediathek/Filme? Using // is inefficient unless the engine optimizes this expression (which XML::LibXML doesn't do, as far as I know).

Another thing is that you could use $node->getChildElements instead of $node->findnodes("*") (no need to write ./*), but I don't think it'll matter much.


XML::LibXML is very fast. And so is SQLite if you batch INSERTs. SQLite write activity is limited by spin speed as part of its guarantee to not write broken data. So the speed gain you're looking for is probably in a transaction. Batch up many/all of your INSERTs—limiting factor to size of batches will be RAM I think—before committing. The DBI docs describe to do this.

Again, this is untested, but it’s good to learn transactions even if I’m wrong. :P


You could try a couple of things.

  1. Given you are effectively streaming your XML, could you re-implement it using a SAX Processor? - XML::SAX::ExpatXS is blindingly fast and uses the standard SAX interfaces.

  2. You could consider using a bulk insert for your SQL, inserting multiple rows in a single statement, this will limit the number of index rebuilds.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜