How to use HTML::TokeParser to extract data
I want to write a code to extract specific information from the imdb.com Awards Section. With the below snippet I can print the text as a whole
use strict;
use warnings;
use autodie;
use utf8;
use WWW::Mechanize;
use HTML::TokeParser;
#Example
my $url = 'http://www.imdb.com/title/tt1375666/awards';
my $mech = WWW::Mechanize->new;
$mech->agent_alias( 'Windows Mozilla' );
$mech->get( $url );
if ($mech->find_link(text_regex => qr/(?:Academy Awards|Golden Globes)/i)) {
my $tp = HTML::TokeParser->new(\$mech->content);
while (my $token = $tp->get_tag('big')) {
$token = $tp->get_trimmed_text('big');
if ( $token =~ /(?:Academy Awards|Golden Globes)/ ) {
print "$token\n";
}
}
}
but I don't know how to separate the different tokens because most of them have the same tags and also how to define the loop for each 'category/recipient' and print on new line if present.
my $year = $tp->get_trimmed_text();
my $result = $tp->get_trimmed_text();
my $award = $tp->get_trimmed_text();
my $category = $tp->get_trimmed_text();
my $recipient = $tp->get_trimmed_text();
print "$year $result $award $category $recipient\n"
- $year Won Oscar $category $recipient1..n
- etc.
- $year Nominated Oscar $category $recipient1..n 开发者_JS百科
- etc.
- $year Won Golden Globe $category $recipient1..n
- etc
- $year Nominated Golden Globe $category $recipient1..n
- etc.
I'm not sure if this is the most efficient approach but I also tried HTML::TableExtract with much less success.
Thanks.
tokeparser is low level, tokeparser is what someone might use to implement HTML::TreeBuilder, you want to use HTML::TreeBuilder::XPath , combine with firefox plugin xpather , and you end up with something like
for my $result ( $tree->findnodes(q{id('tn15content')//table//td}) ) {
print $result->as_trimmed_text,"\n";
}
xpath not quite your cup of team, i'm sure you could do similar with pQuery
pQuery( $content)
->find('#tn15content')
->find('td')
->each(sub{
print pQuery($_)->text, "\n"
});
or same with plain HTML::TreeBuilder look_down
$tree->look_down( id => 'tn15content' )
->look_down( qw/_tag td /,
sub { print $_[0]->as_trimmed_text, "\n"; return } ,
);
精彩评论