开发者

Finding groups of similar strings in a large set of strings

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
Da开发者_开发技巧vid Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

Does anybody know of any ways to solve this reasonably efficiently?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

Does anybody have any thoughts/pointers?


UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.


Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/


The problem you're trying to solve is a typical clusterization problem.

Start with simple K-Means algorithm and use Levenshtein distance as a function for calculating distance between elements and clusters centers.

BTW, algorithm for Levenshtein distance calculation is implemented in Apache Commons StringUtils - StringUtils.getLevenshteinDistance

The main problem of K-Means is that you should specify the number of clusters (subgroups in your terms). So, you'll have 2 options: improve K-Means with some euristic or use another clusterization algorithm which doesn't require specifying clusters number (but that algorithm can show worse performance and can be very difficult in implemenation if you decide to implement it yourself).


If we're talking about actual pronouncable words, comparing the (start of) their metaphone might be of assistance:

MRFLPRBRTS: Mr Philip Roberts
FLRBRTS: Phil Roberts   
FLPRBRTS: Philip Roberts 
FMKBR: Foo McBar      
TFTJNS: David Jones    
TFJNS: Dave Jones     
TFJNS: Davey Jones    
JNT: Jane Doe       
JNSM0: John Smith     
JNSM0: Jonny Smith


For the example you give, I reckon Levenshtein distance would be unsuitable as "Bonny Smith" would be 'very similar' to "Jonny Smith" and would almost certainly end up being considered in the same class.

I think you need to approach this (if working with names) from the point-of-view of certain names having synonyms (e.g. "John", "Jon", "Jonny", "Johnny" etc.) and matching based on these.


I have solved a problem like that, first of all I normalized the text, and get out of the string words without value to the entire string, like InC. OF USA ...

This unvalue words have to be defined by you.

After normalize I run a inspection by names using Jaro Winkler distance, and I grouped the results in an object with a list of similar objects.

It was really good.

I ran this in java with 30K people names

I hope this idea is going to be useful to someone


There is a solution for this exact problem documented in an open source java library for fuzzy matching https://github.com/intuit/fuzzy-matcher

The idea used there is to break down the names in words (tokens) and use text matching algorithm to find similarity in words (like Soundex, Jaccard or Lavenshtiein).

Then use the score found from each word and average the score for each name.

Performance for such matching is fairly critical, since if we keep matching every name with every other this becomes an exponentially growing complexity.

This library relies on equivalence and transitive property of the match algorithm Where if "David" matches with "Davey" then the reverse match is implied, and you don't have to run those matches

This library has a few more tricks to reduce the match complexity, and I was able to run a match against 4000 names in around 2 secs.


Here is SQL code for a Levenshtein function:

CREATE FUNCTION [Levenshtein](@str_1 nvarchar(4000), @str_2 nvarchar(4000))
RETURNS int
AS

BEGIN
 DECLARE    @str_1_len int
        ,   @str_2_len int
        ,   @str_1_itr int
        ,   @str_2_itr int
        ,   @str_1_char nchar
        ,   @Levenshtein int
        ,   @LD_temp int
        ,   @cv0 varbinary(8000)
        ,   @cv1 varbinary(8000)

SELECT  @str_1_len = LEN(@str_1)
    ,   @str_2_len = LEN(@str_2)
    ,   @cv1 = 0x0000
    ,   @str_2_itr = 1
    ,   @str_1_itr = 1
    ,   @Levenshtein = 0


WHILE @str_2_itr <= @str_2_len

SELECT  @cv1 = @cv1 + CAST(@str_2_itr AS binary(2))
    ,   @str_2_itr = @str_2_itr + 1

WHILE @str_1_itr <= @str_1_len
BEGIN
    SELECT  @str_1_char = SUBSTRING(@str_1, @str_1_itr, 1)
        ,   @Levenshtein = @str_1_itr
        ,   @cv0 = CAST(@str_1_itr AS binary(2))
        ,   @str_2_itr = 1

    WHILE @str_2_itr <= @str_2_len
    BEGIN
        SET @Levenshtein = @Levenshtein + 1
        SET @LD_temp = CAST(SUBSTRING(@cv1, @str_2_itr+@str_2_itr-1, 2) AS int) +
        CASE WHEN @str_1_char = SUBSTRING(@str_2, @str_2_itr, 1) THEN 0 ELSE 1 END
        IF @Levenshtein > @LD_temp SET @Levenshtein = @LD_temp
        SET @LD_temp = CAST(SUBSTRING(@cv1, @str_2_itr+@str_2_itr+1, 2) AS int)+1
        IF @Levenshtein > @LD_temp SET @Levenshtein = @LD_temp
        SELECT @cv0 = @cv0 + CAST(@Levenshtein AS binary(2)), @str_2_itr = @str_2_itr + 1
    END

SELECT @cv1 = @cv0, @str_1_itr = @str_1_itr + 1
END

RETURN @Levenshtein
END
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜