Analysing alogithm possibly based on regular intervals to check for bots and spiders
I'm trying to build a script which shows me a list of IP's that are bots/spiders.
I wrote a script which imports the access log of Apache to a mysql db so I can try to manage it with php and mysql.
I've noticed a lot of bots have regular intervals, they send out a request every 2 or 3 seconds. Is there an easy way of showing these patterns with a query or php script? Or, even harder I think, is there an algorithm that can recogni开发者_Go百科se these bots / spiders.
DB:
CREATE TABLE IF NOT EXISTS `access_log` (
`IP` varchar(16) NOT NULL,
`datetime` datetime NOT NULL,
`method` varchar(255) NOT NULL,
`status` varchar(255) NOT NULL,
`referrer` varchar(255) NOT NULL,
`agent` varchar(255) NOT NULL,
`site` smallint(6) NOT NULL
);
Official bots will identify themselves. There's a list at http://www.robotstxt.org/db.html
For the unofficial ones I guess you could try looking for some of the following:
- Page requests with no other resource requests (images, css and JavaScript etc)
- Strange URL requests (lot's of requests for login pages, especially ones that don't exist such as wp-admin on a drupal site)
- Successive page view's in a short amount of time
- Exactly the same URL signatures coming from many different IP's
- No HTTP referrer for IP's that you've never seen before
- Lot's of comment posts in a short session
- Requests from public proxy servers
That's some of the thing's I've noticed about the annoying ba***s that keep trying to scrape and spam my site anyway. Some of them would probably need to be combined in order to filter out real requests with the same characteristics.
精彩评论