开发者

Similar Data Algorithm

I have a couple DBs of user information, each one 10k-20k entries, each one from a couple of different sources, and each one constantly growing. I'm looking to create a tool that can within a certain tolerance notice similar emails, or similar names ( first + ' ' + last ). I'm running a MySQL database, and can work with either C++ or PHP to run the comparison. Can anyone suggest any existing so开发者_如何学Golutions / tutorials that would allow me to just run a check against the database or an array of data and return possible duplicates? I'd just want it to pick up a few common mistakes like these:

josh@test.com <> josh@test.test.com <> jash@test.com
Josh O <> josh t O <> Joshua O

Maybe have the tolerance adjustable to within a certain amount of characters difference between the entries? Thanks you very, very much for any advice or solutions, I've not had much success Googling.


I have some great news for you, and some horrible news for you.

The great news is that PHP has implementations of a few algorithms to compare strings built right in:

  • similar_text
  • levenshtein

It also has two relatively popular ways to break down English-ish words into more simple representations that are suitable for comparison:

  • soundex
  • metaphone

While that's great news, the horrible news is that with 10-20k entries, you're going to need to perform somewhere close to one and a half metric ass-tons of comparisons if you use the first two options, and they aren't great performers. I'm not too sure about what that would be in big-O notation, but I think it's somewhere in the range of O(run away).

Pre-calculating a similarity breakdown using the latter two functions and then using some variety of grouping operation on the resulting data might prove to be a major performance and time win.


That would depend on your notion of "similarity". If you are looking for the number of characters that must be inserted, deleted or replaced in order to transform one string into another, the algorithm is called Levenshtein distance. Be warned, though, that it is quite slow for long strings (as each comparison uses a number of operations that is proportional to mn, where m and n are the lengths of the strings being compared), but if your data is email addresses and other short strings, you should be fine (and your biggest problem would be the number of comparisons, as you would need to compare each pair of strings to each other).


Given a maximum character distance, this sounds like a job for the bitap algorithm (Wu and Manber, "Fast Searching with Text Errors"). It's the core algorithm of the agrep program and it can be quite fast when the number of acceptable character errors is limited. Google's implementation in library form for several languages can be found here. (The code for just doing the approximate match is relatively short and well documented.)

You're still looking at O(n2) for the total number of e-mail to e-mail comparisons (~400M for 20k e-mails). But a well tuned implementation of a good comparison function like bitap should help to reduce the constant. You can probably also cull a bunch of comparisons by dividing the e-mails into groups based on length and only matching e-mails between groups that are within a limited difference in size (e.g., if you're tolerance is 3 character differences, it's pointless to compare any 10-character e-mails to any 20-character e-mails.). You should also be able to parallelize the comparisons if you have a multicore machine. Again, these are reductions in the constant, not the order, but I'd guess a good C++ implementation on a fast machine could handle this in a couple of minutes.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜