Compare 5000 strings with PHP Levenshtein

2022-12-14 16:37 问答作者：

I have 5000, so开发者_开发技巧metimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

I think a better way to group similar addresses would be to:

create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

for each address (with id $target) find the most similar addresses:

SELECT similar.id, similar.address, count(*) 
FROM adress similar, word cmp, word src
WHERE src.address_id=$target
AND src.soundex=cmp.soundex
AND cmp.address_id=similar.id
ORDER BY count(*)
LIMIT $some_value;

calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}

You can use a bk-tree to speed-up the search/comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.

You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.

As a (quite possibly irrelevant) option, you could always use something like soundex which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)

You could group them based on soundexes then limit the comparisons to the nearest N cases...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

Then iterate through the keys of $mashed.

If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}

Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

First of all, you should normalize addressess, get rid of abbreviations etc.

Ave -> Avenue
Rd. -> Road

You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";

继续阅读：database levenshtein-distance php similarity street-address

Compare 5000 strings with PHP Levenshtein

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？