开发者

Parse html using Perl works for 2 lines but not multiple

I have written the following Perl script-

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);

<span class=time>1 h </span> 
<a href="http://foo.com/User">User</a>开发者_Python百科: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML

my $source   = "foo";
my @time     = "10-14-2011";
my $name     = $html->find('a')->as_text;  
my $comment  = $html->as_text;
my @keywords = map { $_->as_text } $html->find('b');

Which outputs- foo, 10-14-2011, User, 1h User: There are not enough big fish in the lake, big fish Which is perfect and what I wanted from the test html but this only works fine when I put in the aforementioned HTML, which I did for test purposes.

However the full HTML file has multiple references of 'a' and 'b' for instances therefore when printing out the results for these columns are blank.

How can I account for multiple values for specific searches?


Without sight of your real HTML it is hard to help, but $html->find returns a list of <a> elements, so you could write something like

foreach my $anchor ($html->find('a')) {
  print $anchor->as_text, "\n";
}

But that will find all <a> elements, and it is unlikely that that is what you want. $html->look_down() is far more flexible, and provides for searching by attribute as well as by tag name.

I cannot begin to guess about your problem with comments without seeing what data you are dealing with.


If you need to process each text element independently then you probably need to call the objectify_text method. This turns every text element in the tree into a pseudo element with a ~text tag name and a text attribute, for instance <p>paragraph text</p> would be transformed into <p><~text text="paragraph text" /></p>. These elements can be discovered using $html->find('~text') as normal. Here is some code to demonstrate

use strict;
use warnings;

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_content(<<END_HTML);

<span class=time>1 h </span> 
<a href="http://foo.com/User">User</a>: There are not enough <b>big</b>
<b>fish</b> in the lake ;
END_HTML

$html->objectify_text;
print $_->attr('text'), "\n" for $html->find('~text');

OUTPUT

1 h 

User
: There are not enough 
big

fish
 in the lake ; 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜