开发者

Analysing alogithm possibly based on regular intervals to check for bots and spiders

I'm trying to build a script which shows me a list of IP's that are bots/spiders.

I wrote a script which imports the access log of Apache to a mysql db so I can try to manage it with php and mysql.

I've noticed a lot of bots have regular intervals, they send out a request every 2 or 3 seconds. Is there an easy way of showing these patterns with a query or php script? Or, even harder I think, is there an algorithm that can recogni开发者_Go百科se these bots / spiders.

DB:

CREATE TABLE IF NOT EXISTS `access_log` (
  `IP` varchar(16) NOT NULL,
  `datetime` datetime NOT NULL,
  `method` varchar(255) NOT NULL,
  `status` varchar(255) NOT NULL,
  `referrer` varchar(255) NOT NULL,
  `agent` varchar(255) NOT NULL,
  `site` smallint(6) NOT NULL
);


Official bots will identify themselves. There's a list at http://www.robotstxt.org/db.html

For the unofficial ones I guess you could try looking for some of the following:

  • Page requests with no other resource requests (images, css and JavaScript etc)
  • Strange URL requests (lot's of requests for login pages, especially ones that don't exist such as wp-admin on a drupal site)
  • Successive page view's in a short amount of time
  • Exactly the same URL signatures coming from many different IP's
  • No HTTP referrer for IP's that you've never seen before
  • Lot's of comment posts in a short session
  • Requests from public proxy servers

That's some of the thing's I've noticed about the annoying ba***s that keep trying to scrape and spam my site anyway. Some of them would probably need to be combined in order to filter out real requests with the same characteristics.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜