Grouping millions of strings with substitution
I have a large amount(XXM-XXXM) strings that look like (a small sample):
I have no idea of all the possible error strings, nor of the permutations thereof.开发者_开发百科 I want to group all similar errors together, and generate some statistics showing an error count for each error string group.
So, essentially, I'd like to group the most similar strings together, and strings can belong to multiple groups.
Thanks!
Disclaimer: I have never solved a problem like this before.
I can think of a couple of ways to think of your problem:
- you are trying to cluster each line to a set of clusters
- check into datamining algorithms
- the diff between each line in a cluster will be small, between lines from two different clusters will be rather bigger
- you could quickly gather similar lines with, by comparing the set intersection of two lines:
set(line1.split) & set(line2.split)
- the element count in the resulting set is an indicator of how close these two lines are.
A bit of python code could look like this:
import fileinput
CLUSTER_COUNT = 5
MAX_DISTANCE = 5
def main():
clusters = [Cluster() for i in range(CLUSTER_COUNT)]
MAXDISTANCE = 3
for line in fileinput.input():
words = set(line.split())
cluster = sorted(clusters, key=lambda c: c.distanceTo(words))[0]
cluster.addLine(words, line)
# print out results (FIXME: write clusters to separate files)
for cluster in clusters:
print "CLUSTER:", cluster.intersection
for line in cluster.lines:
print line
print "-" * 80
print
class Cluster(object):
def __init__(self):
self.intersection = set()
self.lines = []
def distanceTo(self, words):
if len(self.intersection) == 0:
return MAX_DISTANCE
return len(words) - len(self.intersection & words)
def addLine(self, words, line):
self.lines.append(line)
if len(self.intersection) == 0:
self.intersection = words
else:
self.intersection = self.intersection & words
if __name__ == '__main__':
main()
If you run it on your main data, you should end up with a couple of clusters. Note: alter the code to write the clusters to separate files. I think you will want to run the clusters through the code again, recursively, until you find the subsets you're interested in.
What I would do here is write a parameter recogniser. This recognises the following sub-strings (when surrounded by whitespace) as parameters:
- decimal number
- a.b.c.d, where a,b,c,d are decimal numbers
- filename.php
No doubt there will be more, but the list shouldn't grow too big. Then you replace each parameter by a place-holder: %d, %url, %phpfile. Now you can just sort the strings.
You can find unrecognised parameter types by looking through the output strings that occur rarely. For instance, if there is a parameter type h:m:s for a time of day, strings containing this unsubstituted parameter will be unique, or nearly so, and you can find this new parameter type by simply eyeballing a list of the 100 or so 'most nearly unique' strings. Then add h:m:s to your list, replace all such occurrences with %time, and run it again.
After reading some more on the pages from CRM114 the Controllable Regex Mutilator:
Spam is the big target with CRM114, but it's not a specialized Email-only tool. CRM114 has been used to sort web pages, resumes, blog entries, log files, and lots of other things. Accuracy can be as high as 99.9 %. In other words, CRM114 learns, and it learns fast.
This might, in short, be exactly what you need. And you can count on it to be optimized.
Interesting question... Just a spontaneous idea that can be coded quite quickly:
- Decide for each word whether it's data (11.22.33.55, 3829, somepage.php, etc.) or description (Load, Connection, page, etc.) by counting the frequency of each word in a sample of your data set. Description words presumably occur significantly more often than a particular data word. You'll have to tweak the threshold to find one that partitions the words into the two categories. If this doesn't work for all cases, e.g., because there's an IP address that occurs very often, a manual black list should fix this.
- Process each line by computing a signature from the set of its description words only (a string hash of the concatenated description words should work). Count the occurrence of each signature.
Performance (very rough estimate): in the 1. phase it suffices to sample your data, in the 2. phase you have to sequentially process each line, so you are within O(n) where n is the number of lines in your data set (use a hash map for O(1) insert in the 1. phase and O(1) test in the 2. phase). Memory consumption depends on the number of distinct words (1. phase) and distinct sentences (2. phase). If you have a large variation of words, the dictionary for counting occurrences in the first phase could become problematic.
In Python, as data structures I would try with the special (performant) dict data type Counter.
well, i think the comparison idea is out of scope for this problem. even comparing the hash/mapped value consumes much time. so bucketing based on the mapping of error message(string) and then validating could be the possible way.
for this, you need to find some idea to map the individual string( its just reverse of hashing techniques which emphasizes more on avoiding collisons) but i guess here the collisions for nearly same string is ok. so the mapping function should be function of positional property of alphabets, the length of the string.
when the mapping is done we create a bucket with value range group 1: (0-10)(or like this according to value) group 2: (10-20)..... and so on.... since we said that the similar string should have similar value,the similar string are placed on same bucket
so whenever the new message encounter we mapp it with numeric equivalent and put to suitable bucket.
It is very hard to answer without any assumptions on the input.
My approach would be:
- Take a look at your input and make rules about the grouping, and implement these rules (possibly with regexes).
- Then create a "misc" group for those that didn't match any.
- Collect data, and see if there are patterns in that group that you can write regexes for.
- Repeat until you have a satisfyingly small "misc" group.
Only you can tell (without assumptions) if you want to group together a particular set of "similar" strings or not. There must be a finite amount of error message types, so this should yield a good enough result after a few iterations.
Another approach might be, with the assumption that the similarity is based on the numbers and IP addresses in the strings, is to search for those, and use these numbers to classify the items. This again, involves being more concrete about your rules, but might spare you the iterated rule-making (if the assumption is actually valid).
One could think of measuring the levenshtein-distance between strings (or using a similar algorithm), but for classification that would require way too many steps.
Hmm, why re-invent the wheel - have a look at the free version of splunk, it is designed for these kind of tasks.
If it were up to me, I would use python and regular expressions. Disclaimer: I've never used python on the scale that you're talking about.
Here are the first two items from your sample output, rewritten in regular expression notation:
- r"Load of page \w+.php by client [.\d]+ failed due to error \d+"
- r"Connection to [.\d]+ port \d+ timed out client [.\d]+ source port \d+"
Not so bad, right? You can use these as the "keys" for your top-level buckets.
Then your code for matching a given string (s) to a bucket becomes incredibly simple:
for bucket in buckets:
if re.match(bucket.regex, s):
bucket.matchingStrings.append(s)
break
# else if no buckets match, generate a new bucket/regex for s
Generating these regular expressions will be the tricky part. You need to have rules for choosing parts of the string that need to be generalized out. In your case, the general parts seem to be numbers, IP addresses and filenames. You'll have to come up with the right expressions for each scenario, but here's a simple substitution that replaces numbers in the string with the regex pattern representing a number.
pattern = r"\d+"
re.sub(pattern, pattern, "Failed to count from 0 to 600")
# returns r"Failed to count from \d+ to \d+"
I bet you would get pretty far with the \d+ substitution and nothing more.
Not an algorithmic solution in itself, but might come in handy and give you some more freedom in terms of computational complexity to spend: Log file analysis is one of the primary use cases of the MapReduce implementation Hadoop and its friends. So if you're running into scalability problems, you might start to think about solving the problem on a managable subset (the map step) and then combining the output of the subsets into a single one (the reduce step). For example, you might find buckets in subsets, then compare all the sets of buckets and merge the similar ones.
From your recent edit, it seems like you are comparing strings when doing intersections. Don't compare strings; just compare hashes. The chance that a 64-bit hash collides is basically zero. This will accelerate your string comparison time by saving you many, many cache misses.
I have done something rather similar, where I need to find similar strings from a database. A Trie with some additions will help a lot for this. The common implementations of tries just support insertion and search/search prefix. However it is possible to also calculate the Levensthein Distance to all data in a trie and then use this to get the closest strings.
The Levensthein algorithm implemented in a trie can also be used to figure out what changed between the two strings and produce a template. Something like this:
similars = get_similar_strings( input_string, max_distance );
for each similar in similars do
if is string then
//construct template from string
else
// increase count
done
I would do something like this.
map<string, int> mStringToInt;
struct structOneRecord
{
vector<int> vWords;
vector<vector<int>> vLines; // All the Lines under this record
vector<int> vIndexOfMutableWords;
bool operator < (const structOneRecord &n) const
{
if(vWords.size() != n.vWords.size())
return vWords.size() < n.vWords.size();
else
{
// Count diferences
vector<int> vCurrentIndexs;
for(int i=0; i<vWords.size(); i++)
{
if(vWords[i] != n.vWords[i])
vCurrentIndexs.push_back(i);
}
if(vCurrentIndexs.size() == 0)
return false;
int iHalf = vWords.size() / 2; // The diferences can't be bigger than hald the phrase
if(vCurrentIndexs.size() < iHalf)
{
if(vIndexOfMutableWords.size() == 0)
return false;
else
{
if(vIndexOfMutableWords.size() == vCurrentIndexs.size())
{
for(int i=0; i<vIndexOfMutableWords.size(); i++)
{
if(vIndexOfMutableWords[i] != vCurrentIndexs[i])
vWords[vCurrentIndexs[0]] < n.vWords[vCurrentIndexs[0]]; // Not equal
}
}
}
}
return vWords[vCurrentIndexs[0]] < n.vWords[vCurrentIndexs[0]];
}
}
};
vector<string> SplitString(const string &strInput, char cDelimiter, bool bSkipEmpty)
{
vector<string> vRetValue;
stringstream ss(strInput);
string strItem;
while(std::getline(ss, strItem, cDelimiter))
{
// Skip Empty
if(bSkipEmpty && strItem.size()==0)
continue;
vRetValue.push_back(strItem);
}
return vRetValue;
}
void main()
{
// To Test
vector<string> vInput;
vInput.push_back("Connection to 11.22.33.44 port 3940 timed out client 1.2.3.4 source port 3940");
vInput.push_back("Error loading page somepage.php by client 2.3.4.5");
vInput.push_back("Load of page someotherpage.php by client 2.3.4.8 failed due to error 4930");
vInput.push_back("Connection to 11.22.33.55 port 3829 timed out client 1.2.3.6 source port 3944");
vInput.push_back("Load of page alt.php by client 2.3.4.92 failed due to error 3829");
vInput.push_back("Load of page alt2.php by client 2.3.4.95 failed due to error 3829");
set<structOneRecord> sRecords;
for(int i=0; i<vInput.size(); i++)
{
vector<string> vWords = CMkDevStringUtilities::SplitString(vInput[i], ' ', true);
structOneRecord stRecord;
stRecord.vWords.resize(vWords.size());
for(int j=0; j<vWords.size(); j++)
{
map<string, int>::iterator mIte = mStringToInt.find(vWords[j]);
if(mIte == mStringToInt.end())
mIte = mStringToInt.insert(mStringToInt.begin(), make_pair(vWords[j], mStringToInt.size()));
stRecord.vWords[j] = mIte->second;
}
set<structOneRecord>::iterator sIte = sRecords.find(stRecord);
if(sIte != sRecords.end())
{
sIte->vLines.push_back(stRecord.vWords);
if(sIte->vIndexOfMutableWords.size() == 0)
{
// Count diferences
vector<int> vCurrentIndexs;
for(int i=0; i<stRecord.vWords.size(); i++)
{
if(sIte->vWords[i] != stRecord.vWords[i])
vCurrentIndexs.push_back(i);
}
sIte->vIndexOfMutableWords = vCurrentIndexs;
}
}
else
{
stRecord.vLines.push_back(stRecord.vWords);
sRecords.insert(stRecord);
}
}
}
This would give you an output that could easily be printed as your output by reconstructing for each record a string with the substrings pointed by the vWords (and replacing the strings at the indexes in the Mutable words with '%%') and also can give you the lines ordered by record "type".
*Edit Forgot to mention that under each structOneRecord the vLines.size() give you the error count.
I think some manual intervention will be required and frequency counting will be the best approach.
精彩评论