How to use HTML::TokeParser to extract data

2023-03-22 20:42 问答作者：

I want to write a code to extract specific information from the imdb.com Awards Section. With the below snippet I can print the text as a whole

use strict; 
use warnings;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;

#Example
my $url = 'http://www.imdb.com/title/tt1375666/awards';

my $mech = WWW::Mechanize->new;
$mech->agent_alias( 'Windows Mozilla' );
$mech->get( $url );

if ($mech->find_link(text_regex => qr/(?:Academy Awards|Golden Globes)/i)) {

    my $tp = HTML::TokeParser->new(\$mech->content);

    while (my $token = $tp->get_tag('big')) {
        $token = $tp->get_trimmed_text('big');
        if ( $token =~ /(?:Academy Awards|Golden Globes)/ ) {

            print "$token\n";

        }
    }

}

but I don't know how to separate the different tokens because most of them have the same tags and also how to define the loop for each 'category/recipient' and print on new line if present.

my $year = $tp->get_trimmed_text();
my $result = $tp->get_trimmed_text();
my $award = $tp->get_trimmed_text();
my $category = $tp->get_trimmed_text();
my $recipient = $tp->get_trimmed_text();

print "$year $result $award $category $recipient\n"

$year Won Oscar $category $recipient1..n
etc.
$year Nominated Oscar $category $recipient1..n

开发者_JS百科

etc.
$year Won Golden Globe $category $recipient1..n
etc
$year Nominated Golden Globe $category $recipient1..n
etc.

I'm not sure if this is the most efficient approach but I also tried HTML::TableExtract with much less success.

Thanks.

tokeparser is low level, tokeparser is what someone might use to implement HTML::TreeBuilder, you want to use HTML::TreeBuilder::XPath , combine with firefox plugin xpather , and you end up with something like

for my $result ( $tree->findnodes(q{id('tn15content')//table//td}) ) {
    print $result->as_trimmed_text,"\n";
}

xpath not quite your cup of team, i'm sure you could do similar with pQuery

pQuery( $content)
->find('#tn15content')
->find('td')
->each(sub{
    print  pQuery($_)->text, "\n"
});

or same with plain HTML::TreeBuilder look_down

$tree->look_down( id => 'tn15content' )
->look_down( qw/_tag td /,
  sub { print $_[0]->as_trimmed_text, "\n"; return } ,
);

继续阅读：perl

How to use HTML::TokeParser to extract data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？