Efficient way of calculating likeness scores of strings when sample size is large?

2022-12-09 09:59 问答作者：

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.

I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.

Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.

Is there a more eff开发者_高级运维icient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)?

Is Levenshtein score a poor choice of algorithms for this problem?

Yup - you can find all strings within a given distance of a string in O(log n) time by using a BK-Tree. Alternate solutions involving generating every string with distance n may be faster for a levenshtein distance of 1, but the amount of work rapidly balloons out of control for longer distances.

This problem is known as clustering and is a part of a bigger deduplication problem (where you get to decide which member of the cluster is "the right" one), also known as merge-purge.

I once read a few research papers on exactly this topic (the names are below) and basically, the authors used a limited-size sliding window over a sorted list of strings. They would only compare (using an edit distance algorithm) N*N strings inside the window, thereby reducing the computational complexity. If any two strings looked similar, they were combined into a cluster (by inserting a record into a separate cluster table).

The first pass through the list was followed by a second pass where the strings were reversed before getting sorted. This way the strings with different heads had another chance to get close enough to be evaluated as part of the same window. On this second pass, if a string looked close enough to two (or more) strings in the window, and those strings were already parts of their own clusters (found by the first pass), the two clusters would then be merged (by updating the cluster table) and the current string would be added to the newly merged cluster. This clustering approach is known as the union-find algorithm.

Then they improved the algorithm by replacing the window with a list of top X substantially unique prototypes. Each new string would be compared to each of the top X prototypes. If string looked close enough to one of the prototypes, it would then be added to the prototype's cluster. If none of the prototypes looked similar enough, the string would become a new prototype, pushing the oldest prototype out of the top X list. (There was an heuristic logic employed to decide which of the strings in the prototype's cluster should be used as the new prototype representing the entire cluster). Again, if the string looked similar to several prototypes, all of their clusters would be merged.

I once implemented this algorithm for deduplication of name/address records with sizes of the lists being around 10-50 million records and it worked pretty damn fast (and found duplicates well too).

Overall for such problems, the trickiest part of course is finding the right value of the similarity threshold. The idea is to capture all the dups w/o producing too many false positives. Data with different characteristics tends to require different thresholds. The choice of an edit-distance algorithm is also important as some algorithms are better for OCR errors while others are better for typos and yet others are better for phonetic errors (such as when getting a name over the phone).

Once the clustering algorithm is implemented, a good way to test it is to get a list of unique samples and artificially mutate each sample to produce its variations, while preserving the fact that all the variations have come from the same parent. This list is then shuffled and fed to the algorithm. Comparing the original clustering with the clustering produced by the deduplication algorithm will give you the efficiency score.

Bibliography:

Hernandez M. 1995, The Merge/Purge Problem for Large Databases.

Monge A. 1997, An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records.

You can do it with Levenshtein in O(kl), where k is your maximum distance and l is the maximum string.

Basically when you know how to calculate basic Levenshtein then it's easy to figure out that every result that is further than k from the main diagonal has to be bigger than k. So if you calculating main diagonal with width 2k + 1 will suffice.

If you have 10000 email addresses you won't need a faster algorithm. Computer can calculate with O(N^2) fast enough.

Levenshtein is quite good for this kind of problem.

Also what you might consider is transforming emails with soundex before comparing. You'll probably get better results.

I don't think you can do better than O(n^2) but you can do some smaller optimizations which could be just enough of a speedup to make your application usable:

You could first sort all email addresses by th part after the @ and only compare addresses where that is the same
You can stop calculating the distance between two addresses when it becomes bigger than n

EDIT: Actually you can do better than O(n^2), just look at Nick Johnsons answer below.

Let's say you have 3 strings:

1 - "abc" 2 - "bcd" 3 - "cde"

The L Distance between 1 & 2 is 2 (subtract 'a', add 'd'). The L Distance between 2 & 3 is 2 (subtract 'b', add 'e').

Your question is whether we can infer an L Distance between 1 & 3 by using the 2 comparisons above. The answer is no.

The L Distance between 1 & 3 is 3 (replace every character), there is no way that this can be inferred because of the scores of the first 2 calculations. The scores do not reveal whether deletions, insertions or substitution operations were performed.

So, I'd say that Levenshtein is a poor choice for a large list.

10,000 email addresses sound not too much. For similarity search in a larger space you can use shingling and min-hashing. This algorithm is a bit more complicated to implement, but is much more efficient on a large space.

If you really are comparing email addresses then one obvious way to do this would be to combine a levenshtein algorithm with domain mapping. I can think of times when I've signed up for something multiple times using the same domain, but variations on the username portion of the email address.

It's possible to do better, at the condition of reversing the problem.

I suppose here that your 10.000 addresses are pretty 'fixed', otherwise you will have to add an update mechanism.

The idea is to use the Levenshtein distance, but in 'reverse' mode, in Python:

class Addresses:
  def __init__(self,addresses):
    self.rep = dict()
    self.rep[0] = self.generate_base(addresses)
      # simple dictionary which associate an address to itself

    self.rep[1] = self.generate_level(1)
    self.rep[2] = self.generate_level(2)
    # Until N

The generate_level method generates all possible variations from the previous set, minus the variations that already exist at a previous level. It preserves the 'origin' as the value associated to the key.

Then, you just have to lookup your word in the various set:

  def getAddress(self, address):
    list = self.rep.keys()
    list.sort()
    for index in list:
      if address in self.rep[index]:
        return (index, self.rep[index][address]) # Tuple (distance, origin)
    return None

Doing so, you compute the various sets once (it takes some times... but then you can serialize it and keep it forever).

And then lookup is much more efficient than O(n^2), though giving it exactly is kind of difficult since it depends on the size of the sets that are generated.

For reference, have a look at: http://norvig.com/spell-correct.html

继续阅读：algorithm cluster-analysis complexity-theory edit-distance string

Efficient way of calculating likeness scores of strings when sample size is large?

Bibliography:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Bibliography:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？