开发者

Why does WWW::Mechanize GET certain pages but not others?

I'm new to Perl/HTML things. I'm trying to use $mech->get($url) to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept retu开发者_开发百科rning error message like this:

Error GETing http://en.wikipedia.org/wiki/Periodic_table: Forbidden at PeriodicTable.pl line 13

But $mech->get($url) works fine if $url is http://search.cpan.org/.

Any help will be much appreciated!


Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );


It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.

You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the get(), for example:

$mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );

That worked for me with the URL in your posting. Shorter strings will probably work too.

(You should remove the trailing slash from the URL too I think.)

WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the agent() method.

You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.


When you have these sorts of problems, you need to watch the HTTP transactions so you can see what the webserver is sending back to you. In this case, you'd see that Mech connects and gets a response, but Wikipedia is declining to respond to your bot. I like HTTP Scoop on the Mac.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜