What are some methods I can use to detect robots?

2023-03-22 16:05 问答作者：

Just because software is automated doesn't mean it will abide by your robots.txt. What are some methods available to detect when someone is crawling or DDOSing your website? Assume your site has 100s of 1000s of pages and is worth crawling or DDOSing.

Here's a dumb idea I had that probably doesn't work: give each user a cookie with a unique value, and use the cookie to know when someone is making second/third/etc requests. This probably doesn't work because crawlers pr开发者_如何学运维obably don't accept cookies, and thus in this scheme a robot will look like a new user with each request.

Does anyone have better ideas?

You could put links in your pages that are not visible, or clickable by end-users. Many bots just follow all links. Once someone requests one of those links you almost certainly have a crawler/robot.

Project Honey Pot keeps a list of 'bad' bots.

Here's a class I wrote to contact their web-service. You'll have to modify it some since I have a couple proprietary libs in it, but mostly it should be good to go. Sometimes their service sends back errors, but it does help cut down on some of the bad traffic.

using System;
using System.Linq;
using System.Net;
using System.Xml.Linq;
using SeaRisenLib2.Text;
using XmlLib;

/// <summary>
/// Summary description for HoneyPot
/// </summary>
public class HoneyPot
{
    private const string KEY = "blacklistkey"; // blacklist key - need to register at httpbl.org to get it
    private const string HTTPBL = "dnsbl.httpbl.org"; // blacklist lookup host
public HoneyPot()
{
}

public static Score GetScore_ByIP(string ip)
{
    string sendMsg = "", receiveMsg = "";
    int errorCount = 0; // track where in try/catch we fail for debugging
    try
    {
        // for testing: ip = "188.143.232.31";
        //ip = "173.242.116.72";
        if ("127.0.0.1" == ip) return null; // localhost development computer
        IPAddress address;
        if (!IPAddress.TryParse(ip, out address))
            throw new Exception("Invalid IP address to HoneyPot.GetScore_ByIP:" + ip);
        errorCount++; // 1
        string reverseIP = ip.ToArray('.').Reverse().ToStringCSV(".");
        sendMsg = string.Format("{0}.{1}.{2}", KEY, reverseIP, HTTPBL);
        errorCount++; // 2
        //IPHostEntry value = Dns.GetHostByName(sendMsg);
        IPHostEntry value = Dns.GetHostEntry(sendMsg);
        errorCount++; // 3
        address = value.AddressList[0];
        errorCount++; // 4
        receiveMsg = address.ToString();
        errorCount++; // 5
        int[] ipArray = receiveMsg.ToArray('.').Select(s => Convert.ToInt32(s)).ToArray();
        errorCount++; // 6
        if (127 != ipArray[0]) // error
            throw new Exception("HoneyPot error");
        errorCount++; // 7
        Score score = new Score()
        {
            DaysSinceLastSeen = ipArray[1],
            Threat = ipArray[2],
            BotType = ipArray[3]
        };
        errorCount++; // 8
        return score;
    }
    catch (Exception ex)
    {
        Log.Using("VisitorLog/HoneyPotErrors", log =>
        {
            log.SetString("IPrequest", ip);
            log.SetString("SendMsg", sendMsg, XmlFile.ELEMENT);
            log.SetString("RecvMsg", receiveMsg, XmlFile.ELEMENT);
            log.SetString("Exception", ex.Message, XmlFile.ELEMENT);
            log.SetString("ErrorCount", errorCount.ToString());
        });
    }
    return null;
}

// Bitwise values
public enum BotTypeEnum : int
{
    SearchEngine = 0,
    Suspicious = 1,
    Harvester = 2,
    CommentSpammer = 4
}

public class Score
{
    public Score()
    {
        BotType = -1;
        DaysSinceLastSeen = -1;
        Threat = -1;
    }

    public int DaysSinceLastSeen { get; internal set; }
    public int Threat { get; internal set; }
    /// <summary>
    /// Use BotTypeEnum to understand value.
    /// </summary>
    public int BotType { get; internal set; }

    /// <summary>
    /// Convert HoneyPot Score values to String (DaysSinceLastSeen.Threat.BotType)
    /// </summary>
    /// <returns></returns>
    public override string ToString()
    {
        return string.Format("{0}.{1}.{2}",
            DaysSinceLastSeen,
            Threat,
            BotType);
    }

    public static explicit operator XElement(Score score)
    {
        XElement xpot = new XElement("HoneyPot");
        if (null != score)
        {
            if (score.DaysSinceLastSeen >= 0)
                xpot.SetString("Days", score.DaysSinceLastSeen);
            if (score.Threat >= 0)
                xpot.SetString("Threat", score.Threat);
            if (score.BotType >= 0)
                xpot.SetString("Type", score.BotType);
            foreach (BotTypeEnum t in Enum.GetValues(typeof(BotTypeEnum)))
            {
                // Log enum values as string for each bitwise value represented in score.BotType
                int value = (int)t;
                if ((value == score.BotType) || ((value & score.BotType) > 0))
                    xpot.GetCategory(t.ToString());
            }
        }
        return xpot;
    }

    public static explicit operator Score(XElement xpot)
    {
        Score score = null;
        if (null != xpot)
            score = new Score()
            {
                DaysSinceLastSeen = xpot.GetInt("Days"),
                Threat = xpot.GetInt("Threat"),
                BotType = xpot.GetInt("Type")
            };
        return score;
    }
}

/// <summary>
/// Log score value to HoneyPot child Element (if score not null).
/// </summary>
/// <param name="score"></param>
/// <param name="parent"></param>
public static void LogScore(HoneyPot.Score score, XElement parent)
{
    if ((null != score) && (null != parent))
    {
        parent.Add((XElement)score);
    }
}

}

Though technically, it would not "detect" bot crawlers, I have an interesting way to block them. My method is to create an IIS filter or Apache plug-in. What you would do is encrypt all your html, asp, php, etc... pages. The only page not encrypted would be the index page. The index page simply installs a cookie with an encrypted public key, then redirects to a second index page. The IIS filter or Apache plug-in would then check each vistor to make sure they have this cookie. If it does, the filter would decrypt the requested page, then pass the page onto the web server for processing.

This method would allow a normal vistor to view your web page, but if a bot, which rejects cookies, tried to read your pages, they would all be encrypted.

A blacklist is probably not a good way of doing it, it would be better to have a whitelist of known bots that are allowed to make over a certain amount of hits per second. If someone not on that whitelist makes too many hits per second, start dropping their connections for a few seconds. This will help prevent ddosing, and still let unknown bots scan your site (although a lot slower than the ones you think matter).

You can keep a log of the offenders and see who are breaking the rules repeatedly :)

继续阅读：asp.net

What are some methods I can use to detect robots?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？