BOT/Spider Trap Ideas

2023-01-17 23:45 问答作者：

I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Windows Zombies.

The clients had issues in the past with SPAM attacks--even had to point MX at Postini to get the 6.7 GB/day of junk to stop server-side.

I want to setup a BOT trap in a directory disallowed by robots.txt... just never attempted anything like this before, hoping someone out there has a creative ideas for trapping BOTs!

EDIT: I already have plenty of ideas for catching one.. it's 开发者_开发百科what to do to it when lands in the trap.

You can set up a PHP script whose URL is explicitly forbidden by robots.txt. In that script, you can pull the source IP of the suspected bot hitting you (via $_SERVER['REMOTE_ADDR']), and then add that IP to a database blacklist table.

Then, in your main app, you can check the source IP, do a lookup for that IP in your blacklist table, and if you find it, throw a 403 page instead. (Perhaps with a message like, "We've detected abuse coming from your IP, if you feel this is in error, contact us at ...")

On the upside, you get automatic blacklisting of bad bots. On the downside, it's not terribly efficient, and it can be dangerous. (One person innocently checking that page out of curiosity can result in the ban of a large swath of users.)

Edit: Alternatively (or additionally, I suppose) you can fairly simply add a GeoIP check to your app, and reject hits based on country of origin.

What you can do is get another box (a kind of sacrificial lamb) not on the same pipe as your main host then have that host a page which redirects to itself (but with a randomized page name in the url). this could get the bot stuck in a infinite loop tieing up the cpu and bandwith on your sacrificial lamb but not on your main box.

I tend to think this is a problem better solved with network security more so than coding, but I see the logic in your approach/question.

There are a number of questions and discussions about this on server fault which may be worthy of investigating.

https://serverfault.com/search?q=block+bots

Well I must say, kinda disappointed--I was hoping for some creative ideas. I did find the ideal solutions here.. http://www.kloth.net/internet/bottrap.php

<html>
    <head><title> </title></head>
    <body>
    <p>There is nothing here to see. So what are you doing here ?</p>
    <p><a href="http://your.domain.tld/">Go home.</a></p>
    <?php
      /* whitelist: end processing end exit */
      if (preg_match("/10\.22\.33\.44/",$_SERVER['REMOTE_ADDR'])) { exit; }
      if (preg_match("Super Tool",$_SERVER['HTTP_USER_AGENT'])) { exit; }
      /* end of whitelist */
      $badbot = 0;
      /* scan the blacklist.dat file for addresses of SPAM robots
         to prevent filling it up with duplicates */
      $filename = "../blacklist.dat";
      $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
      while ($line = fgets($fp,255)) {
        $u = explode(" ",$line);
        $u0 = $u[0];
        if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
      }
      fclose($fp);
      if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */
      /* send a mail to hostmaster */
        $tmestamp = time();
        $datum = date("Y-m-d (D) H:i:s",$tmestamp);
        $from = "badbot-watch@domain.tld";
        $to = "hostmaster@domain.tld";
        $subject = "domain-tld alert: bad robot";
        $msg = "A bad robot hit $_SERVER['REQUEST_URI'] $datum \n";
        $msg .= "address is $_SERVER['REMOTE_ADDR'], agent is $_SERVER['HTTP_USER_AGENT']\n";
        mail($to, $subject, $msg, "From: $from");
      /* append bad bot address data to blacklist log file: */
        $fp = fopen($filename,'a+');
        fwrite($fp,"$_SERVER['REMOTE_ADDR'] - - [$datum] \"$_SERVER['REQUEST_METHOD'] $_SERVER['REQUEST_URI'] $_SERVER['SERVER_PROTOCOL']\" $_SERVER['HTTP_REFERER'] $_SERVER['HTTP_USER_AGENT']\n");
        fclose($fp);
      }
    ?>
    </body>
</html>

Then to protect pages throw <?php include($DOCUMENT_ROOT . "/blacklist.php"); ?> on the first line of every page.. blacklist.php contains:

<?php
    $badbot = 0;
    /* look for the IP address in the blacklist file */
    $filename = "../blacklist.dat";
    $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
    while ($line = fgets($fp,255))  {
      $u = explode(" ",$line);
      $u0 = $u[0];
      if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
    }
    fclose($fp);
    if ($badbot > 0) { /* this is a bad bot, reject it */
      sleep(12);
      print ("<html><head>\n");
      print ("<title>Site unavailable, sorry</title>\n");
      print ("</head><body>\n");
      print ("<center><h1>Welcome ...</h1></center>\n");
      print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
      print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
             if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
      print ("</body></html>\n");
      exit;
    }
?>

I plan to take Scott Chamberlain's advice and to be safe I plan to implement Captcha on the script. If user answers correctly then it'll just die or redirect back to site root. Just for fun I'm throwing the trap in a directory named /admin/ and of coursed adding Disallow: /admin/ to robots.txt.

EDIT: In addition I am redirecting the bot ignoring the rules to this page: http://www.seastory.us/bot_this.htm

You could first take a look at where the ip's are coming from. My guess is that they are all coming from one country like china or Nigeria, in which case you could set up something in htaccess to disallow all ip's from those two countries, as for creating a trap for bots, i havent the slightest idea

继续阅读：bots php robots.txt web-crawler zombie-process

BOT/Spider Trap Ideas

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？