How can I optimize this UserAgent parser for loop in C#?
I am writing a C# program to analyze the the number of browsers in the UserAgent column of a web server log. I wish to output the browser type, browser major version, and the number of hits.
How can I optimize this?
I am using regex to compare the UserAgent string with predefined strings to test for Firefox, Opera, etc. I then use regex to cancel out a possible mismatch. I then use a regex to obtain the major version开发者_开发知识库. I use a struct to hold this information for each browser:
private struct Browser
{
public int ID;
public string name;
public string regex_match;
public string regex_not;
public string regex_version;
public int regex_group;
}
I then load the browser information and loop over all of the records for the UserAgent:
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
foreach (Browser b in browsers)
{
if (Regex.IsMatch(csUserAgent[i], b.regex_match))
{
if (b.regex_not != "")
{
if (Regex.IsMatch(csUserAgent[i], b.regex_not))
{
continue;
}
}
string strBrowser = b.name;
if (b.regex_version != "")
{
string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
You could
construct a hashtable or most-frequently matches user-agent and avoid matching the regexen.
store compile new
Regex(pattern, RegexOptions.Compiled)
instead of justpattern
combine the regexes into a single regex and take advantage of RegexOptions.Compiled and RegexOptions.CultureInvariantIgnoreCase
instead of matching twice (once with
IsMatch
and once withMatches
) match once (Matches
) and check whether the MatchCollection is empty
This is only a starting point - I might come up with more ideas on reading the code :)
Edit One more:
- avoid parsing the version with another regex - only safari requires special treaetment according to your config. Try to 'catch' the version with the same regex as the browserid. (I'd simply make an exception for safari for now)
E.g. you could have a single static regex instance like this:
private static readonly Regex _regex = new Regex(
"(?i)"
+ "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
+ "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
You can conveniently access the proper subgroups by using match.Groups["browserid"]
and match.Groups["version"]
. This nearly eliminates all the use for your list of Browser structs.
The only thing it still caters for is the exclusion regex (regex_not). I suggest re-profiling with the single positive regex first, though and see whether there is still a performance problem left before frying smaller fish.
Benchmark
I wrote a benchmark (see below). I'll be updating this incrementally until I loose interest :) (I know my dataset isn't representative. If you upload a file, I'll test it with that)
replacing the separate regexes by the single statically compiled regex, speeds up from 14s to 2.1s (a 6x speedup); this is only with the outermost match replaced
replacing the regex_not/regex_version by precompiled regexes did not make much of a difference with my test set (but I don't have actual matching useragents, so that makes sense)
.
using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Program
{
private struct Browser
{
public int ID;
public string name;
public Regex regex_match, regex_not, regex_version;
public int regex_group;
}
private static readonly Regex _regex = new Regex("(?i)"
+ "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
+ "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
public static void Main(string[] args)
{
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_match = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[2].regex_match = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[3].regex_match = new Regex("(?i)safari/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_match = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
// OPTIMIZATION #2
browsers[0].regex_not = new Regex("(?i)flock", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_not = null;
browsers[2].regex_not = null;
browsers[3].regex_not = new Regex("(?i)android|arora|chrome|shiira", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_not = new Regex("(?i)webtv|omniweb|opera", RegexOptions.Compiled | RegexOptions.CultureInvariant);
// OPTIMIZATION #2
browsers[0].regex_version = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_version = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[2].regex_version = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[3].regex_version = new Regex("(?i)version/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_version = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
var lookupBrowserId = new Dictionary<string, int> {
{ "firefox/", 0 },
{ "opera/", 1 },
{ "chrome/", 2 },
{ "safari/", 3 },
{ "msie+", 4 },
{ "msie_", 4 },
{ "msie ", 4 },
{ "msie", 4 },
};
for (int i=1; i<20; i++)
foreach (var line in System.IO.File.ReadAllLines("/etc/dictionaries-common/words"))
{
// OPTIMIZATION #1 START
Match match = _regex.Match(line);
{
if (match.Success)
{
Browser b = browsers[lookupBrowserId[match.Groups["browserid"].Value]];
// OPTIMIZATION #1 END
// OPTIMIZATION #2
if (b.regex_not != null && b.regex_not.IsMatch(line))
continue;
string strBrowser = b.name;
if (b.regex_version != null)
{
// OPTIMIZATION #2
string strVersion = b.regex_version.Match(line).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
}
}
精彩评论