extract all links from a HTML page, exclude links from a specific table
I'm pretty new to Perl/HTML. Here is what I'm trying to do with WWW::Mechanize and HTML::TreeBuilder:
For each chemical element page on Wikipedia, I need to extract all hyperlinks that point to the other chemical elements' pages 开发者_如何学运维on wiki and print each unique pair in this format:
Atomic_Number1 (Chemical Element Title1) -> Atomic_Number2 (Chemical Element Title2)
The only problem is that there is a mini periodic table on every chemical element's page (top-right of the page). So this tiny periodic table will just make the result same for every element. I'm having trouble on extracting all links from the page EXCEPT from that very table.
[Note: I only looked at $elem == 6
(Carbon) (@line 42) for the ease of debugging.]
Here is my code:
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech = WWW::Mechanize->new();
my $table_url = "http://en.wikipedia.org/wiki/Periodic_table";
$mech->agent('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) /
AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 /
Safari/533.17.8');
$mech->get($table_url);
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
my %elem_set;
my $atomic_num;
## obtain a hash array of elements and corresponding titles and links
foreach my $td ($tree->look_down(_tag => 'td')) {
# If there's no <a> in this <td>, then skip it:
my $a = $td->look_down(_tag => 'a') or next;
my $tdText = $td->as_text;
my $aText = $a->as_text;
if($tdText =~ m/^(\d+)\S+$/){
if($1 <= 114){ #only investigate up to 114th element
$atomic_num = $1;
}
$elem_set{$atomic_num} = [$a->attr('title'), $a->attr('href')];
}
}
## In each element's page. look for links to other elements in the set
foreach my $elem (keys %elem_set) {
if($elem == 6){
# reconstruct element url to ensure only fetch pages in English
my $elem_url = "http://en.wikipedia.org" . $elem_set{$elem}[1];
$mech->get($elem_url);
#####################################################################
### need help here to exclude links from that mini periodic table ###
#####################################################################
my @target_links = $mech->links();
for my $link ( @target_links ) {
if( $link->url =~ m/^\/(wiki)\/.+$/ && $link->text =~ m/^\w+$/ ){
printf("%s, %s\n", $link->text, $link->url);
}
}
}
}
Use WWW::Mechanize's update_html method to remove that table before finding the links. This method allows you to do whatever you want to the source code in $mech->content
.
精彩评论