Classify array of strings based on commonalities

2022-12-11 18:40 问答作者：

I have huge list (200000) of strings (multi 开发者_如何转开发word). I want to group these strings based on comman array of word match among these strings. I cant think of a low computation time algorithm for this

"AB 500"

"Bus AB 500"

"News CA"

"News CA BLAH"

My plan was

a. Tokenize them to words.

b. Create a global array tokens

c. Compare those strings with common tokens.

As you guessed this does not help. Can you suggest an algorithm for this? I am writing this in python..

200000 is not that much, you can do this

Split each string to get tokens e.g. "News CA BLAH" -> ["Blah", "CA", "News"]
create a dict entry each length of list e.g. in case of ["Blah", "CA", "News"] all combinations in order
Now just loop thru the dict and see the groups

example code:

data="""AB 500
Bus AB 500
News CA
News CA BLAH"""

def getCombinations(tokens):
    count = len(tokens)
    for L in range(1,count+1):
        for i in range(count-L+1):
            yield tuple(tokens[i:i+L])

groupDict = {}
for s in data.split("\n"):
    tokens = s.split()
    for groupKey in getCombinations(tokens):
        if groupKey not in groupDict:
            groupDict[groupKey] = [s]
        else:
            groupDict[groupKey].append(s)

for group, values in groupDict.iteritems():
    if len(values) > 1:
        print group, "->", values

it outputs:

('News', 'CA') -> ['News CA', 'News CA BLAH']
('AB',) -> ['AB 500', 'Bus AB 500']
('500',) -> ['AB 500', 'Bus AB 500']
('CA',) -> ['News CA', 'News CA BLAH']
('AB', '500') -> ['AB 500', 'Bus AB 500']
('News',) -> ['News CA', 'News CA BLAH']

Do you mean something like this?

>>> from collections import defaultdict
>>> L=["AB 500",
... "Bus AB 500",
... "News CA",
... "News CA BLAH"]
>>> d=defaultdict(list)
>>> for s in L:
...     for w in s.split():
...         d[w].append(s)
... 
>>> print d["News"]
['News CA', 'News CA BLAH']
>>> print d["CA"]
['News CA', 'News CA BLAH']
>>> print d["500"]
['AB 500', 'Bus AB 500']

Unless repetition of words is an important feature for your use case, I suggest sets. I.e.:

thestrings = [
"AB 500",
"Bus AB 500",
"News CA",
"News CA BLAH",
]

thesets = dict((s, set(s.split())) for s in thestrings)

similarities = dict()
for s in thestrings:
  for o in thestrings:
    if s>=o: continue
    sims = len(thesets[s] & thesets[o])
    if not sims: continue
    similarities[s, o] = sims

for s, o in sorted(similarities, similarities.get, reverse=True):
  print "%-16r %-16r %2d" % (s, o, similarities[s, o])

Is this close to what you're looking for? It does classify the 4 strings you give in the way you desire, but that's a very feeble sample, of course, so I'm double checking;-).

What would happen, if the string "AB 500 News CA" is added to your list? Do the two groups of strings have to merge? If not, how to split up the list of strings and why?

A very general workflow for problems like this (if i understood it correctly) goes like this:

Get a list of candidate pairs via an inverted Index/All pairs similarity search/Simhashing
Calc some distance functions for each pair and combine them into a single weight
Each weighted pair ((a, b), weight) represents now an edge in a graph, which you can cluster into the "word-match groups" via hierarchical clustering/power iteration

继续阅读：algorithm classification python string

Classify array of strings based on commonalities

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？