开发者

How can I optimize this UserAgent parser for loop in C#?

I am writing a C# program to analyze the the number of browsers in the UserAgent column of a web server log. I wish to output the browser type, browser major version, and the number of hits.

How can I optimize this?

I am using regex to compare the UserAgent string with predefined strings to test for Firefox, Opera, etc. I then use regex to cancel out a possible mismatch. I then use a regex to obtain the major version开发者_开发知识库. I use a struct to hold this information for each browser:

private struct Browser
{
    public int ID;
    public string name;
    public string regex_match;
    public string regex_not;
    public string regex_version;
    public int regex_group;
}

I then load the browser information and loop over all of the records for the UserAgent:

Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
    browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
    foreach (Browser b in browsers)
    {
        if (Regex.IsMatch(csUserAgent[i], b.regex_match))
        {
            if (b.regex_not != "")
            {
                if (Regex.IsMatch(csUserAgent[i], b.regex_not))
                {
                    continue;
                }
            }
            string strBrowser = b.name;
            if (b.regex_version != "")
            {
                string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
                int intPeriod = strVersion.IndexOf('.');
                if (intPeriod > 0)
                {
                    strBrowser += " " + strVersion.Substring(0, intPeriod);
                }
            }
            if (!browser_counts.ContainsKey(strBrowser))
            {
                browser_counts.Add(strBrowser, 1);
            }
            else
            {
                browser_counts[strBrowser]++;
            }
            break;
        }
    }
}


You could

  • construct a hashtable or most-frequently matches user-agent and avoid matching the regexen.

  • store compile new Regex(pattern, RegexOptions.Compiled) instead of just pattern

  • combine the regexes into a single regex and take advantage of RegexOptions.Compiled and RegexOptions.CultureInvariantIgnoreCase

  • instead of matching twice (once with IsMatch and once with Matches) match once (Matches) and check whether the MatchCollection is empty

This is only a starting point - I might come up with more ideas on reading the code :)

Edit One more:

  • avoid parsing the version with another regex - only safari requires special treaetment according to your config. Try to 'catch' the version with the same regex as the browserid. (I'd simply make an exception for safari for now)

E.g. you could have a single static regex instance like this:

private static readonly Regex _regex = new Regex(
    "(?i)" 
    + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
    + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

You can conveniently access the proper subgroups by using match.Groups["browserid"] and match.Groups["version"]. This nearly eliminates all the use for your list of Browser structs.

The only thing it still caters for is the exclusion regex (regex_not). I suggest re-profiling with the single positive regex first, though and see whether there is still a performance problem left before frying smaller fish.

Benchmark

I wrote a benchmark (see below). I'll be updating this incrementally until I loose interest :) (I know my dataset isn't representative. If you upload a file, I'll test it with that)

  1. replacing the separate regexes by the single statically compiled regex, speeds up from 14s to 2.1s (a 6x speedup); this is only with the outermost match replaced

  2. replacing the regex_not/regex_version by precompiled regexes did not make much of a difference with my test set (but I don't have actual matching useragents, so that makes sense)

.

using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;


public class Program
{
    private struct Browser
    {
        public int ID;
        public string name;
        public Regex regex_match, regex_not, regex_version;
        public int regex_group;
    }

    private static readonly Regex _regex = new Regex("(?i)" 
        + "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
        + "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);

    public static void Main(string[] args)
    {

        Browser[] browsers = new Browser[5];
        for (int i = 0; i < 5; i++)
        {
            browsers[i].ID = i;
        }
        browsers[0].name = "Firefox";
        browsers[1].name = "Opera";
        browsers[2].name = "Chrome";
        browsers[3].name = "Safari";
        browsers[4].name = "Internet Explorer";
        browsers[0].regex_match = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_match = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_match = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_match = new Regex("(?i)safari/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_match = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_not = new Regex("(?i)flock", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_not = null;
        browsers[2].regex_not = null;
        browsers[3].regex_not = new Regex("(?i)android|arora|chrome|shiira", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_not = new Regex("(?i)webtv|omniweb|opera", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        // OPTIMIZATION #2
        browsers[0].regex_version = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[1].regex_version = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[2].regex_version = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[3].regex_version = new Regex("(?i)version/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[4].regex_version = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
        browsers[0].regex_group = 1;
        browsers[1].regex_group = 1;
        browsers[2].regex_group = 1;
        browsers[3].regex_group = 1;
        browsers[4].regex_group = 2;
        Dictionary<string, int> browser_counts = new Dictionary<string, int>();

        var lookupBrowserId = new Dictionary<string, int> {
            { "firefox/", 0 },
            { "opera/", 1 },
            { "chrome/", 2 },
            { "safari/", 3 },
            { "msie+", 4 },
            { "msie_", 4 },
            { "msie ", 4 },
            { "msie", 4 },
        };

        for (int i=1; i<20; i++)
        foreach (var line in System.IO.File.ReadAllLines("/etc/dictionaries-common/words"))
        {
            // OPTIMIZATION #1 START
            Match match = _regex.Match(line);

            {
                if (match.Success)
                {
                    Browser b = browsers[lookupBrowserId[match.Groups["browserid"].Value]];
                    // OPTIMIZATION #1 END

                    // OPTIMIZATION #2
                    if (b.regex_not != null && b.regex_not.IsMatch(line))
                            continue;

                    string strBrowser = b.name;
                    if (b.regex_version != null)
                    {
                        // OPTIMIZATION #2
                        string strVersion = b.regex_version.Match(line).Groups[b.regex_group].Value;
                        int intPeriod = strVersion.IndexOf('.');
                        if (intPeriod > 0)
                        {
                            strBrowser += " " + strVersion.Substring(0, intPeriod);
                        }
                    }
                    if (!browser_counts.ContainsKey(strBrowser))
                    {
                        browser_counts.Add(strBrowser, 1);
                    }
                    else
                    {
                        browser_counts[strBrowser]++;
                    }
                    break;
                }
            }
        }
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜