开发者

How to use HTML::TokeParser to extract data

I want to write a code to extract specific information from the imdb.com Awards Section. With the below snippet I can print the text as a whole

use strict; 
use warnings;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;

#Example
my $url = 'http://www.imdb.com/title/tt1375666/awards';

my $mech = WWW::Mechanize->new;
$mech->agent_alias( 'Windows Mozilla' );
$mech->get( $url );

if ($mech->find_link(text_regex => qr/(?:Academy Awards|Golden Globes)/i)) {

    my $tp = HTML::TokeParser->new(\$mech->content);

    while (my $token = $tp->get_tag('big')) {
        $token = $tp->get_trimmed_text('big');
        if ( $token =~ /(?:Academy Awards|Golden Globes)/ ) {

            print "$token\n";

        }
    }

}

but I don't know how to separate the different tokens because most of them have the same tags and also how to define the loop for each 'category/recipient' and print on new line if present.

my $year = $tp->get_trimmed_text();
my $result = $tp->get_trimmed_text();
my $award = $tp->get_trimmed_text();
my $category = $tp->get_trimmed_text();
my $recipient = $tp->get_trimmed_text();

print "$year $result $award $category $recipient\n"

  1. $year Won Oscar $category $recipient1..n
  2. etc.
  3. $year Nominated Oscar $category $recipient1..n
  4. 开发者_JS百科
  5. etc.
  6. $year Won Golden Globe $category $recipient1..n
  7. etc
  8. $year Nominated Golden Globe $category $recipient1..n
  9. etc.

I'm not sure if this is the most efficient approach but I also tried HTML::TableExtract with much less success.

Thanks.


tokeparser is low level, tokeparser is what someone might use to implement HTML::TreeBuilder, you want to use HTML::TreeBuilder::XPath , combine with firefox plugin xpather , and you end up with something like

for my $result ( $tree->findnodes(q{id('tn15content')//table//td}) ) {
    print $result->as_trimmed_text,"\n";
}

xpath not quite your cup of team, i'm sure you could do similar with pQuery

pQuery( $content)
->find('#tn15content')
->find('td')
->each(sub{
    print  pQuery($_)->text, "\n"
});

or same with plain HTML::TreeBuilder look_down

$tree->look_down( id => 'tn15content' )
->look_down( qw/_tag td /,
  sub { print $_[0]->as_trimmed_text, "\n"; return } ,
);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜