开发者

How can I create the XML::Simple data structure using a Perl XML SAX parser?

Summary: I am looking a fast XML parser (most likely a wrapper around some standard SAX parser) which will produce per-record data structure 100% identical to those produced by XML::Simple.

Details:

We have a large code infrastructure which depends on processing records one-by-one and expects the record to be a data structure in a format produced by XML::Simple since it always used XML::Simple since early Jurassic era.

An example simple XML is:

<root>
    <rec><f1>v1</f1><f2>v2</f2></rec>
    <rec><f1>v1b</f1><f2>v2b</f2></rec>
    <rec><f1>v1c</f1><f2>v2c</f2></rec>
</root>

And example rough code is:

sub process_record { my ($obj, $record_hash) = @_; # do_stuff }
my $records = XML::Simple->XMLin(@args)->{root};
foreach my $record (@$records) { $obj->process_record($record) };

As everyone knows XML::Simple is, well, simple. And more importantly, it is very slow and a memory hog—due to being a DOM parser and needing to build/store 100% of data in memory. So, it's not the best tool for parsing an XML file consisting of large amount of small records record-by-record.

However, re-writing the entire code (which consist of large amount of "process_record"-like methods) to work with standard SAX parser seems like an big task not worth the resources, even at the cost of living with XML::Si开发者_高级运维mple.

I'm looking for an existing module which will probably be based on a SAX parser (or anything fast with small memory footprint) which can be used to produce $record hashrefs one by one based on the XML pictured above that can be passed to $obj->process_record($record) and be 100% identical to what XML::Simple's hashrefs would have been.

I don't care much what the interface of the new module is; e.g whether I need to call next_record() or give it a callback coderef accepting a record.


XML::Twig has a simplify method which you can call on a XML element which according to docs says:

Return a data structure suspiciously similar to XML::Simple's

Here is an example:

use XML::Twig;
use Data::Dumper;

my $twig = XML::Twig->new(
    twig_handlers => {
        rec => \&rec,
    }
)->parsefile( 'data.xml' );


sub rec {
    my ($twig, $rec) = @_;
    my $data = $rec->simplify;
    say Dumper $data;
    $rec->purge;
}

NB. The $rec->purge cleans out the record immediately from memory.

Running this against your XML example produces this:

$VAR1 = {
          'f1' => 'v1',
          'f2' => 'v2'
        };

$VAR1 = {
          'f1' => 'v1b',
          'f2' => 'v2b'
        };

$VAR1 = {
          'f1' => 'v1c',
          'f2' => 'v2c'
        };

Which I hope is suspiciously like what comes out of XML::Simple :)

/I3az/


As the author of XML::Simple, I'd just like to correct some misconceptions in your question.

XML::Simple isn't a DOM parser, in fact it isn't a parser at all. It delegates all parsing duties to either a SAX parser or XML::Parser. The speed of parsing will depend on which parser module is the default on your system. When you run 'make test' for the XML::Simple distribution, the output will list the default parser.

If the default parser on your system is XML::SAX::PurePerl then it will be slow and more importantly buggy too. If that's the case then I'd recommend installing either XML::Expat or XML::ExpatXS for an immediate speed up. (Whichever SAX parser is installed last will be the default from that point).

Having said that, your requirements are a bit contradictory, you want something that returns your whole document as a hash and yet you don't want a parser that slurps the whole document into memory.

I understand your short-term goals, but as a longer term solution, I'd recommend migrating your code to XML::LibXML. It is a DOM parser but it's very fast because all the grunt work is done in C. Best of all the built-in XPath support makes it even simpler to use than XML::Simple - see this article.


Take a look at XML::LibXML::Reader.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜