开发者

Best way to detect bot from user agent?

Time goes by, but still no perfect solution... See if someone has a bright idea to differentiate bot from human-loaded web page? State of the art is still loading a long list of well-known SE bots and parse USER AGENT?

Testing has to be done be开发者_如何学Cfore the page is loaded! No gifs or captchas!


If possible, I would try a honeypot approach to this one. It will be invisible to most users, and will discourage many bots, though none that are determined to work, as they could implement special code for your site that just skipped the honeypot field once they figure out your game. But it would take a lot more attention by the owners of the bot than is probably worth it for most. There will be tons of other sites accepting spam without any additional effort on their part.

One thing that gets skipped over from time to time is it is important to let the bot think that everything went fine, no error messages, or denial pages, just reload the page as you would for any other user, except skip adding the bots content to the site. This way there are no red flags that can be picked up in the bots logs, and acted upon by the owner, it will take much more scrutiny to figure out you are disallowing the comments.


Without a challenge (like CAPTCHA), you're just shooting in the dark. User agent can trivially be set to any arbitrary string.


What the others have said is true to an extent... if a bot-maker wants you to think a bot is a genuine user, there's no way to avoid that. But many of the popular search engines do identify themselves. There's a list here (http://www.jafsoft.com/searchengines/webbots.html) among other places. You could load these into a database and search for them there. I seem to remember that it's against Google's user agreement to make custom pages for their bots though.


The user agent is set by the client and thus can be manipulated. A malicious bot thus certainly would not send you an I-Am-MalBot user agent, but call himself some version of IE. Thus using the User Agent to prevent spam or something similar is pointless.

So, what do you want to do? What's your final goal? If we knew that, we could be better help.


The creators of SO should know why they are using Captcha in order to prevent bots from editing content. The reason is there is actually no way to be sure that a client is not a bot. And i think there never will be.


I myself is coding web crawlers for different purposes. And I use a web browser UserAgent.

As far as I know, you cannot distinguish bots from humans if a bot is using a legit UserAgent. Like:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.11 (KHTML, like Gecko) Chrome/9.0.570.1 Safari/534.11

The only thing I can think of is JavaScript. Most custom web bots (like those that I code) can't execute javascript codes because it's a browser job. But if the bot is linked or using a web browser (Like firefox) then it will be undetected.


I'm sure I'm going to take a votedown on this, but I had to post it: Constructive

In any case, captchas are the best way right now to protect against bots, short of approving all user-submitted content.

-- Edit --

I just noticed your P.S., and I'm not sure of anyway to diagnose a bot without interacting with it. Your best bet in this case might be to catch the bots as early as possible and implement a 1 month IP restriction, after which time the BOT should give up if you constantly return HTTP 404 to it. Bot's are often run from a server and don't change their IP, so this should work as a mediocre approach.


I would suggest using Akismet, a spam prevention plugin, rather than any sort of Captcha or CSS trick because it is very excellent at catching spam without ruining the user experience.


Honest bots, such as search engines, will typically access your robots.txt. From that you can learn their useragent string and add it to your bot list.

Clearly this doesn't help with malicious bots which are pretending to be human, but for some applications this could be good enough if all you want to do is filter search engine bots out of your logs (for example).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜