Slow down spidering of website
Is there 开发者_如何学Pythona way to force a spider to slow down its spidering of a website? Anything that can be put in headers or robots.txt?
I thought i remembered reading something about this being possible but cannot find anything now.
If you're referring to Google, you can throttle the speed at which Google spiders your site by using your Google Webmaster account (Google Webmaster Tools).
There is also this, which you can put in robots.txt
User-agent: *
Crawl-delay: 10
Where crawl delay is specified as the number of seconds between each page crawl. Of course, like everything else in robots.txt, the crawler has to respect it, so YMMV.
Beyond using the Google Webmaster tools for the Googlebot (see Robert Harvey's answer), Yahoo! and Bing support the nonstandard Crawl-delay
directive in robots.txt
:
http://en.wikipedia.org/wiki/Robots.txt#Nonstandard_extensions
When push comes to shove, however, a misbehaving bot that's slamming your site will just have to be blocked at a higher level (e.g. load balancer, router, caching proxy, whatever is appropriate for your architecture).
See Throttling your web server for a solution using Perl. Randal Schwartz said that he survived a Slashdot attack using this solution.
I don't think robots will do anything except allow or disallow. Most of the search engines will allow you to customize how they index your site.
For example: Bing and Google
If you have a specific agent that is causing issues, you might either block it specifically, or see if you can configure it.
精彩评论