Grep and Extract Data in Perl

2022-12-31 11:09 问答作者：

I have HTML content stored in a variable. How do I extract data that is found between a set of co开发者_高级运维mmon tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:

...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...

And then I would like to store a mapping DATA_2 => DATA_1 in a hash

Since it is HTML I think this could work for you?

https://metacpan.org/pod/XML::XPath

XPath is the way.

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.

First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.

Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

You might try this module: HTML::TreeBuilder::XPath. The doc says:

This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

继续阅读：extract grep perl tags

Grep and Extract Data in Perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？