Regexp that matches user-agents of end-user browsers but NOT crawlers with >90 % accuracy
I'm trying to construct a regexp that will evaluate to true for User-Agent
:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough.
My approach so far is to target the User-Agent
string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.).
Currently I'm using th开发者_JAVA百科e following regexp which appears to achieve the desired precision:
^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$
I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match:
(BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson)
My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp?
Please note that the question is specifically about regexp-based User-Agent
matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.
You could construct a blacklist by checking which user agents access robots.txt.
Many crawlers don’t send an Accept-Language header, while AFAIK all browsers do. You could combine this information with your regex to get more accurate results.
I'd rather using the opposite, having a pattern for bots is much simpler
personally I use the following regex
/bot\b|index|spider|crawl|wget|slurp|Mediapartners-Google/i
精彩评论