Parsing XML file with perl - regex
i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
What i'd like to d开发者_StackOverflow中文版o is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter
Never ever use Regex to handle markup languages.
The original version of this answer (see below) used XML::XPath
. Grant McLean said in the comments:
XML::XPath
is an old and unmaintained module.XML::LibXML
is a modern, maintained module with an almost identical API and it's faster too.
so I made a new version that uses XML::LibXML
(thanks, Grant):
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(location => 'articles.xml');
my $xp = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath) ) {
# now do something with $article
print $article.": ".$article->getName."\n";
}
For me this prints:
XML::LibXML::Element=SCALAR(0x346ef90): article XML::LibXML::Element=SCALAR(0x346ef30): article XML::LibXML::Element=SCALAR(0x346efa8): article
Links to the relevant documentation:
- The type of
$doc
will beXML::LibXML::Document
. - The type of
$xp
isXML::LibXML::XPathContext
. - The return type of
$xp->findnodes()
isXML::LibXML::NodeList
. - The type
$article
isXML::LibXML::Element
.
Original version of the answer, based on the XML::XPath
package:
use warnings;
use strict;
use XML::XPath;
my $xp = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
# now do something with $article
print $article.": ".$article->getName ."\n";
}
which prints this for me:
XML::XPath::Node::Element=REF(0x38067b8): article XML::XPath::Node::Element=REF(0x38097e8): article XML::XPath::Node::Element=REF(0x3809ae8): article
- The type of
$xp
isXML::XPath
, obviously. - The return type of
$xp->findnodes()
isXML::XPath::NodeSet
. - The type of
$article
will beXML::XPath::Node::Element
in this case.
Have a look at the docs to find out what you can do with them.
Here:
open my $input, "<", "file.xml" or die $!;
open my $output, ">", "truncated-file.xml" or die $!;
my $n_articles = 0;
while (<$input>) {
print $output $_;
if (m:</article>:) {
$n_articles++;
if ($n_articles >= 3) {
last;
}
}
}
close $input or die $!;
close $output or die $!;
You really don't need an XML parser to do such a simple job.
精彩评论