Parsing XML file with perl - regex
i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
What i'd like to d开发者_StackOverflow中文版o is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter
Never ever use Regex to handle markup languages.
The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:
XML::XPathis an old and unmaintained module.XML::LibXMLis a modern, maintained module with an almost identical API and it's faster too.
so I made a new version that uses XML::LibXML (thanks, Grant):
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(location => 'articles.xml');
my $xp = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath) ) {
# now do something with $article
print $article.": ".$article->getName."\n";
}
For me this prints:
XML::LibXML::Element=SCALAR(0x346ef90): article XML::LibXML::Element=SCALAR(0x346ef30): article XML::LibXML::Element=SCALAR(0x346efa8): article
Links to the relevant documentation:
- The type of
$docwill beXML::LibXML::Document. - The type of
$xpisXML::LibXML::XPathContext. - The return type of
$xp->findnodes()isXML::LibXML::NodeList. - The type
$articleisXML::LibXML::Element.
Original version of the answer, based on the XML::XPath package:
use warnings;
use strict;
use XML::XPath;
my $xp = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
# now do something with $article
print $article.": ".$article->getName ."\n";
}
which prints this for me:
XML::XPath::Node::Element=REF(0x38067b8): article XML::XPath::Node::Element=REF(0x38097e8): article XML::XPath::Node::Element=REF(0x3809ae8): article
- The type of
$xpisXML::XPath, obviously. - The return type of
$xp->findnodes()isXML::XPath::NodeSet. - The type of
$articlewill beXML::XPath::Node::Elementin this case.
Have a look at the docs to find out what you can do with them.
Here:
open my $input, "<", "file.xml" or die $!;
open my $output, ">", "truncated-file.xml" or die $!;
my $n_articles = 0;
while (<$input>) {
print $output $_;
if (m:</article>:) {
$n_articles++;
if ($n_articles >= 3) {
last;
}
}
}
close $input or die $!;
close $output or die $!;
You really don't need an XML parser to do such a simple job.
加载中,请稍侯......
精彩评论