开发者

How to determine real user are browsing my site or just crawling or else in PHP

I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).

I know two method will work.

  1. Javascript.

    If the page was load by the browser, 开发者_Go百科it will run the js code automatically, except forbid by the browser. Then use AJAX to call back the server.

  2. 1×1 transparent image of in the html.

    Use img to call back the server.

Do anyone know the pitfall of these method or any better method?

Also, I don't know how to determine a 0×0 or 1×1 iframe to prevent the above method.


  1. A bot can access a browser, e.g. http://browsershots.org

  2. The bot can request that 1x1 image.

In short, there is no real way to tell. Best you could do is use a CAPTCHA, but then it degrades the experience for humans.

Just use a CAPTCHA where required (user sign up, etc).


I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).

The image way seems better, as Javascript might be turned off by normal users as well. Robots generally don't load images, so this should indeed work. Nonetheless, if you're just looking to filter a known set of robots (say Google and Yahoo), you can simply check for the HTTP User Agent header, as those robots will actually identify themselves as being a robot.


you can create an google webmasters account and it tells you how to configure your site for bots also show how robot will read your website


I agree with others here, this is really tough - generally nice crawlers will identify themselves as crawlers so using the User-Agent is a pretty good way to filter out those guys. A good source for user agent strings can be found at http://www.useragentstring.com. I've used Chris Schulds php script (http://chrisschuld.com/projects/browser-php-detecting-a-users-browser-from-php/) to good effect in the past.

You can also filter these guys at the server level using the Apache config or .htaccess file, but I've found that to be a losing battle keeping up with it.

However, if you watch your server logs you'll see lots of suspect activity with valid (browser) user-agents or funky user-agents so this will only work so far. You can play the blacklist/whitelist IP game, but that will get old fast.

Lots of crawlers do load images (i.e. Google image search), so I don't think that will work all the time.

Very few crawlers will have Javascript engines, so that is probably a good way to differentiate them. And lets face it, how many users actually turn of Javascript these days? I've seen the stats on that, but I think those stats are very skewed by the sheer number of crawlers/bots out there that don't identify themselves. However, a caveat is that I have seen that the Google bot does run Javascript now.

So, bottom line, its tough. I'd go with a hybrid strategy for sure - if you filter using user-agent, images, IP and javascript I'm sure you'll get most bots, but expect some to get through despite that.

Another idea, you could always use a known Javascript browser quirk to test if the reported user-agent (if its a browser) is really actually that browser?


"Nice" robots like those from google or yahoo will usually respect a robots.txt file. Filtering by useragent might also help.

But in the end - if someone wants to gain automated access it will be very hard to prevent that; you should be sure it is worth the effort.


Inspect the User-Agent header of the http request. Crawlers should set this to anything but a known browser.

here are the google-bot header http://code.google.com/intl/nl-NL/web/controlcrawlindex/docs/crawlers.html

In php you can get the user-agent with :

$Uagent=$_SERVER['HTTP_USER_AGENT'];

Then you just compare it with the known headers as a tip preg_match() could be handy to do this all in a few lines of code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜