how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

2022-12-22 14:13 问答作者：

I just want to let Google, B开发者_StackOverflow中文版ing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?

You can prevent Google, etc., from indexing your website, but you cannot prevent a malicious crawler from doing that.

Why not try track browsing patterns - if you are getting lots of clicks or weird browsing patterns that wouldnt come from a human throw up a captcha page.

try crawling google.com with a custom crawler and see what they do , you can do the same :). Browsing patterns is the key to your problem :).

There are many ways to detect the crawls but its difficult when we need to differentiate between good crawlers and bad ones. But there is a way to do this. Its actually you have to use the hidden link on your website this will detect the all crawlers and for good crawlers on the bases of user agent dont let them read the hidden links. This will help you lots not 100% but more then 70%. I have tried it.

I want the world to be able to find me, but I want to be invisible? At least one of us is confused...

There are two type of crawler 1. Renderless crawler that can request your website content without using any other technologies such as css, javascript and of course it's renderless 2. Rendered crawler that could does exactly like most browsers you are using

To prevent all crawlers, you may want to put captcha on your site and it's annoying. But to allow a certain crawler, you may put some litle script to monitor and prevent the bad crawler as these following factors: 1. Browser Agent 2. How many pages the ip address could surf your site in a period of time 3. Check whether the user can execute JavaScript ( not recommend because google may use renderless crawler too )

If someone is out to steal your content they most likely won't care for nor obey the restrictions anyway.

Only option I can think of is knowing where they crawl from and block them from seeing the site at all.

It's a complex problem but sure it can be solved or minimized.

The perfect scenario is apply some complex IA techniques to identify patterns and keep blocking, banning them. You can treat it as a security threat to you business but keep in mind that you need to measure the trade-off here. For example, spend a lot money with a perfect solution don't justify or compensate if the primary reason is avoid waste of bandwidth. See my point?

I know the question is too old but maybe someone can step-by here and see another point of view.

You need to block crawlers IP Addresses.

Crawlers Fresh IP Addresses -

http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html

继续阅读：web-crawler

how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？