开发者

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of co开发者_高级运维mmon tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:

...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...

And then I would like to store a mapping DATA_2 => DATA_1 in a hash


Since it is HTML I think this could work for you?

https://metacpan.org/pod/XML::XPath

XPath is the way.


Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.

First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.


Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.

Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.


You might try this module: HTML::TreeBuilder::XPath. The doc says:

This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜