how to prevent all crawlers except good ones (google, bing, yahoo) access website content?
I just want to let Google, B开发者_StackOverflow中文版ing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?
You can prevent Google, etc., from indexing your website, but you cannot prevent a malicious crawler from doing that.
Why not try track browsing patterns - if you are getting lots of clicks or weird browsing patterns that wouldnt come from a human throw up a captcha page.
try crawling google.com with a custom crawler and see what they do , you can do the same :). Browsing patterns is the key to your problem :).
There are many ways to detect the crawls but its difficult when we need to differentiate between good crawlers and bad ones. But there is a way to do this. Its actually you have to use the hidden link on your website this will detect the all crawlers and for good crawlers on the bases of user agent dont let them read the hidden links. This will help you lots not 100% but more then 70%. I have tried it.
I want the world to be able to find me, but I want to be invisible? At least one of us is confused...
There are two type of crawler 1. Renderless crawler that can request your website content without using any other technologies such as css, javascript and of course it's renderless 2. Rendered crawler that could does exactly like most browsers you are using
To prevent all crawlers, you may want to put captcha on your site and it's annoying. But to allow a certain crawler, you may put some litle script to monitor and prevent the bad crawler as these following factors: 1. Browser Agent 2. How many pages the ip address could surf your site in a period of time 3. Check whether the user can execute JavaScript ( not recommend because google may use renderless crawler too )
If someone is out to steal your content they most likely won't care for nor obey the restrictions anyway.
Only option I can think of is knowing where they crawl from and block them from seeing the site at all.
It's a complex problem but sure it can be solved or minimized.
The perfect scenario is apply some complex IA techniques to identify patterns and keep blocking, banning them. You can treat it as a security threat to you business but keep in mind that you need to measure the trade-off here. For example, spend a lot money with a perfect solution don't justify or compensate if the primary reason is avoid waste of bandwidth. See my point?
I know the question is too old but maybe someone can step-by here and see another point of view.
You need to block crawlers IP Addresses.
Crawlers Fresh IP Addresses -
http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html
精彩评论