开发者

Blocking Web Scrapers [duplicate]

This question already has answers here: Detecting 'stealth' web-crawlers (11 answers) Closed 9 years ago. 开发者_StackOverflow社区

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?


  • Captchas
  • Form submitted in less than a second
  • Hidden (by css) field gets a value submitted during form submit
  • Frequent page visits

Simple bots can not scrap text from flash, images or sound.


Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.

However, here are some methods that can be implemented:

  1. Check User-Agent (this could be spoofed though)
  2. Use robots.txt (proper bots will - hopefully respect this)
  3. Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
  4. Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
  5. Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.


You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.


I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...


Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.


Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/

From their site:

Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.

Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜