What are the best measures to protect content from being crawled?
I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions:
- Robots.txt (yea right)
- IP blacklists
What can be done to catch bot activity? What can be开发者_开发问答 done to make data extraction difficult? What can be done to give them bad data?
Regarding the concern of SEO, is there a way to limit the access to certain blocks of data (kind of like a <nofollow>
block of text?)
Just looking for ideas, no right/wrong answer
Use a client-side decryption/decoding scheme. Basically send some encoded data back and then depend on JavaScript to decode it to something readable. Crawlers will get your content but it will be useless to them (at least until your site becomes so big that people would target it specifically).
However, why would you want to do that? Do you not want the site to be indexed by search engines?
Trying to stop webscrapers isnt easy. Without a complex, constantly evolving solution, all you can do is raise the bar of difficulty and hope that they arent determined enough to keep going. Some things you can do are:
- Rate limit. Make sure you dont do this based on an IP, but rather unique sessions to avoid blocking users behind a NAT.
- Force users to execute javascript to access the page. There are several ways to do this and it makes it significantly harder to scrape but still not impossible. There are lots of scripting tools (ruby, selenium, etc) that allow you to scrape using a real web browser.
- IP Blacklists. Block proxy servers, TOR, amazon ec2, etc.
It is also important to note that you should whitelist search engines to avoid SEO / traffic loss. You can whitelist most search engines by looking at their user agent and comparing that to a whois of their IP.
For full disclosure, I am the cofounder of Distil Networks and we offer an anti-scraping solution as a service. This makes me biased in that I dont believe there is a static answer to your question, aka, you cant do one thing and stop. It is an arms race that you will always have to keep fighting.
Track activity by ip (maybe combined with user agent) and try to detect a bot by the delay between page calls. Too many urls asked for within a certain interval - start sending back modified content, redirect, or whatever you had in mind.
Have javascript set a cookie on the client. On the server side, check for the existence of this cookie, and serve your content only if the cookie is present.
If no cookie is present, send a page with the javascript which sets the cookie and reloads the page.
This should prevent all automated web tools which do not execute any javascript.
you can't prevent crawling if the crawler REALLY want to, but you can make fun of them.
Ways to detect bots
- by user agent
- by ip
- by log analysis (most of the time, bots load one page every x seconds)
- Make a javascript load an specific file ie [yadda.gif]. If you loaded a given page but didn't downloaded yadda.gif, you don't have js enabled and the odds are that you are a bot (or are using noscript)
Possible punishments:
- redirect to microsoft.com :-)
- set the thoutput rate really low so it takes forever to download anything. you can do this with apache's mod_throughput or php's output buffering functions
- return gibberish, devowel the content or something like that.
Implement a Captcha to only allow humans to view your site.
精彩评论