What are the best measures to protect content from being crawled?

2023-02-09 08:58 问答作者：

I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions:

Robots.txt (yea right)
IP blacklists

What can be done to catch bot activity? What can be开发者_开发问答 done to make data extraction difficult? What can be done to give them bad data?

Regarding the concern of SEO, is there a way to limit the access to certain blocks of data (kind of like a <nofollow> block of text?) Just looking for ideas, no right/wrong answer

Use a client-side decryption/decoding scheme. Basically send some encoded data back and then depend on JavaScript to decode it to something readable. Crawlers will get your content but it will be useless to them (at least until your site becomes so big that people would target it specifically).

However, why would you want to do that? Do you not want the site to be indexed by search engines?

Trying to stop webscrapers isnt easy. Without a complex, constantly evolving solution, all you can do is raise the bar of difficulty and hope that they arent determined enough to keep going. Some things you can do are:

Rate limit. Make sure you dont do this based on an IP, but rather unique sessions to avoid blocking users behind a NAT.
Force users to execute javascript to access the page. There are several ways to do this and it makes it significantly harder to scrape but still not impossible. There are lots of scripting tools (ruby, selenium, etc) that allow you to scrape using a real web browser.
IP Blacklists. Block proxy servers, TOR, amazon ec2, etc.

It is also important to note that you should whitelist search engines to avoid SEO / traffic loss. You can whitelist most search engines by looking at their user agent and comparing that to a whois of their IP.

For full disclosure, I am the cofounder of Distil Networks and we offer an anti-scraping solution as a service. This makes me biased in that I dont believe there is a static answer to your question, aka, you cant do one thing and stop. It is an arms race that you will always have to keep fighting.

Track activity by ip (maybe combined with user agent) and try to detect a bot by the delay between page calls. Too many urls asked for within a certain interval - start sending back modified content, redirect, or whatever you had in mind.

Have javascript set a cookie on the client. On the server side, check for the existence of this cookie, and serve your content only if the cookie is present.

If no cookie is present, send a page with the javascript which sets the cookie and reloads the page.

This should prevent all automated web tools which do not execute any javascript.

you can't prevent crawling if the crawler REALLY want to, but you can make fun of them.

Ways to detect bots

by user agent
by ip
by log analysis (most of the time, bots load one page every x seconds)
Make a javascript load an specific file ie [yadda.gif]. If you loaded a given page but didn't downloaded yadda.gif, you don't have js enabled and the odds are that you are a bot (or are using noscript)

Possible punishments:

redirect to microsoft.com :-)
set the thoutput rate really low so it takes forever to download anything. you can do this with apache's mod_throughput or php's output buffering functions
return gibberish, devowel the content or something like that.

Implement a Captcha to only allow humans to view your site.

继续阅读：security web-crawler web-scraping

What are the best measures to protect content from being crawled?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？