XML::LibXML: Speedquestion
This script takes about 50 minutes, ( filesize: 22,3 MiB, cpu: atom ).
Is this normal (the 50 minutes)? Could I tweak the script, to make it faster?#!/u开发者_StackOverflowsr/local/bin/perl
use XML::LibXML;
use DBI;
my $dbh = DBI->connect( "DBI:SQLite:dbname=$db", undef, undef, $options );
my $sth = $dbh->prepare( "INSERT INTO $table ( id, titel, ... ) VALUES ( ?, ?, ... )" );
my $parser = XML::LibXML->new();
my $doc = $parser->load_xml( location => $file );
my @nodes = $doc->findnodes( '//Mediathek/Filme' );
my @keys = qw( Id Titel ... );
for my $node ( @nodes ) {
my @nodes = $node->findnodes( './*' );
my %hash;
@hash{@keys} = ();
for my $node ( @nodes ) {
$hash{$node->nodeName} = $node->textContent;
}
$sth->execute( @hash{@keys} );
}
I'm pretty sure Ashley is right when pointing to the transactions and the associated costly IO.
As for the XML part, given the input doc size of 22 MB, you're going to need about 200 MB of memory but processing should be reasonably fast, in the range of seconds, not minutes.
One thing that looks inefficient is your whole-doc-scan XPath expression. Can Mediathek/Filme
really appear anywhere in the document? Or is it rather something like /Archiv/Mediathek/Filme
? Using //
is inefficient unless the engine optimizes this expression (which XML::LibXML doesn't do, as far as I know).
Another thing is that you could use $node->getChildElements
instead of $node->findnodes("*")
(no need to write ./*
), but I don't think it'll matter much.
XML::LibXML is very fast. And so is SQLite if you batch INSERT
s. SQLite write activity is limited by spin speed as part of its guarantee to not write broken data. So the speed gain you're looking for is probably in a transaction. Batch up many/all of your INSERT
s—limiting factor to size of batches will be RAM I think—before committing. The DBI docs describe to do this.
Again, this is untested, but it’s good to learn transactions even if I’m wrong. :P
You could try a couple of things.
Given you are effectively streaming your XML, could you re-implement it using a SAX Processor? - XML::SAX::ExpatXS is blindingly fast and uses the standard SAX interfaces.
You could consider using a bulk insert for your SQL, inserting multiple rows in a single statement, this will limit the number of index rebuilds.
精彩评论