Classify array of strings based on commonalities
I have huge list (200000) of strings (multi 开发者_如何转开发word). I want to group these strings based on comman array of word match among these strings. I cant think of a low computation time algorithm for this
"AB 500"
"Bus AB 500" "News CA" "News CA BLAH"My plan was
a. Tokenize them to words. b. Create a global array tokens c. Compare those strings with common tokens.As you guessed this does not help. Can you suggest an algorithm for this? I am writing this in python..
200000 is not that much, you can do this
- Split each string to get tokens e.g. "News CA BLAH" -> ["Blah", "CA", "News"]
- create a dict entry each length of list e.g. in case of ["Blah", "CA", "News"] all combinations in order
- Now just loop thru the dict and see the groups
example code:
data="""AB 500
Bus AB 500
News CA
News CA BLAH"""
def getCombinations(tokens):
count = len(tokens)
for L in range(1,count+1):
for i in range(count-L+1):
yield tuple(tokens[i:i+L])
groupDict = {}
for s in data.split("\n"):
tokens = s.split()
for groupKey in getCombinations(tokens):
if groupKey not in groupDict:
groupDict[groupKey] = [s]
else:
groupDict[groupKey].append(s)
for group, values in groupDict.iteritems():
if len(values) > 1:
print group, "->", values
it outputs:
('News', 'CA') -> ['News CA', 'News CA BLAH']
('AB',) -> ['AB 500', 'Bus AB 500']
('500',) -> ['AB 500', 'Bus AB 500']
('CA',) -> ['News CA', 'News CA BLAH']
('AB', '500') -> ['AB 500', 'Bus AB 500']
('News',) -> ['News CA', 'News CA BLAH']
Do you mean something like this?
>>> from collections import defaultdict
>>> L=["AB 500",
... "Bus AB 500",
... "News CA",
... "News CA BLAH"]
>>> d=defaultdict(list)
>>> for s in L:
... for w in s.split():
... d[w].append(s)
...
>>> print d["News"]
['News CA', 'News CA BLAH']
>>> print d["CA"]
['News CA', 'News CA BLAH']
>>> print d["500"]
['AB 500', 'Bus AB 500']
Unless repetition of words is an important feature for your use case, I suggest sets. I.e.:
thestrings = [
"AB 500",
"Bus AB 500",
"News CA",
"News CA BLAH",
]
thesets = dict((s, set(s.split())) for s in thestrings)
similarities = dict()
for s in thestrings:
for o in thestrings:
if s>=o: continue
sims = len(thesets[s] & thesets[o])
if not sims: continue
similarities[s, o] = sims
for s, o in sorted(similarities, similarities.get, reverse=True):
print "%-16r %-16r %2d" % (s, o, similarities[s, o])
Is this close to what you're looking for? It does classify the 4 strings you give in the way you desire, but that's a very feeble sample, of course, so I'm double checking;-).
What would happen, if the string "AB 500 News CA" is added to your list? Do the two groups of strings have to merge? If not, how to split up the list of strings and why?
A very general workflow for problems like this (if i understood it correctly) goes like this:
- Get a list of candidate pairs via an inverted Index/All pairs similarity search/Simhashing
- Calc some distance functions for each pair and combine them into a single weight
- Each weighted pair ((a, b), weight) represents now an edge in a graph, which you can cluster into the "word-match groups" via hierarchical clustering/power iteration
精彩评论