Check if an url is blocked by robots.txt using Perl

2023-03-09 04:46 问答作者：

Can a开发者_开发技巧nybody tell me sample code to check if an url has been blocked by robots.txt? We can specify full url or directory in the robots.txt. Is there any helper function in Perl?

Check out WWW::RobotRules:

   The following methods are provided:

   $rules = WWW::RobotRules->new($robot_name)
  This is the constructor for WWW::RobotRules objects.  The first
  argument given to new() is the name of the robot.

   $rules->parse($robot_txt_url, $content, $fresh_until)
  The parse() method takes as arguments the URL that was used to
  retrieve the /robots.txt file, and the contents of the file.

   $rules->allowed($uri)
  Returns TRUE if this robot is allowed to retrieve this URL.

WWW::RobotRules is the standard class for parsing robots.txt files and then checking URLs to see if they're blocked.

You may also be interested in LWP::RobotUA, which integrates that into LWP::UserAgent, automatically fetching and checking robots.txt files as needed.

Load the robots.txt file and search for "Disallow:" in the file. Then check if the following pattern (after the Disallow:) is within your URL. If so, the URL is banned by the robots.txt

Example - You find the following line in the robots.txt:

Disallow: /cgi-bin/

Now remove the "Disallow: " and check, if "/cgi-bin/" (the remaining part) is directly after the TLD.

If your URL looks like:

www.stackoverflow.com/cgi-bin/somwhatelse.pl

it is banned.

If your URL looks like:

www.stackoverflow.com/somwhatelse.pl

it is ok. The complete set of rules you'll find on http://www.robotstxt.org/. This is the way, if you can not install additional modules for any reason.

Better would be to use a module from cpan: There is a great module on cpan that I use to deal with it: LWP::RobotUA. LWP (libwww) is imho the standard for webaccess in perl - and this module is part of it and ensures your behaviour is nice.

Hum, you don't seem to have even looked! On the first page of search results, I see various download engines that handle robots.txt automatically for you, and at least one that does exactly what you asked.

WWW::RobotRules skips rules "substring"

User-agent: *
Disallow: *anytext*

url http://example.com/some_anytext.html be passed (not banned)

继续阅读：perl

Check if an url is blocked by robots.txt using Perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？