Compare 5000 strings with PHP Levenshtein
I have 5000, so开发者_开发技巧metimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?
Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.
I think a better way to group similar addresses would be to:
create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with
for each address (with id $target) find the most similar addresses:
SELECT similar.id, similar.address, count(*) FROM adress similar, word cmp, word src WHERE src.address_id=$target AND src.soundex=cmp.soundex AND cmp.address_id=similar.id ORDER BY count(*) LIMIT $some_value;
calculate the levenstein difference between your source address and the top few values returned by the query.
(doing any sort of operation on large arrays is often faster in databases)
I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.
You can do something like:
for($i=0;$i<count($array)-1;$i++)
{
for($j=$i+1;$j<count($array);$j++)
{
$lev = levenshtein($array[$i],$array[$j]);
if($lev == 0)
{
// exact match
}
else if($lev <= THRESHOLD)
{
// similar
}
}
}
You can use a bk-tree to speed-up the search/comparison.
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:
Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!
edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.
Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.
You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.
As a (quite possibly irrelevant) option, you could always use something like soundex
which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)
You could group them based on soundexes then limit the comparisons to the nearest N cases...
$mashed=array();
foreach ($address as $key=>$val) {
$mashed[$key]=soundex($val);
}
sort($mashed);
Then iterate through the keys of $mashed.
C.
If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):
$results = array();
$count = count($entries);
while ($count != 0) {
# The entry to process
$entry = array_shift($entries);
# Get levenshtein distances to all others
$result = array_map(
'levenshtein',
# array_map() needs two arrays, this one is an array consisting of
# multiple entries of the value that we are processing
array_fill($entry, 0, $count),
$toCompare
);
$results[] = array($entry => $result);
$count--;
}
Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.
First of all, you should normalize addressess, get rid of abbreviations etc.
- Ave -> Avenue
- Rd. -> Road
You could have some fixed max Lehvenstein distance ( N ) for similar addresses.
If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.
There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.
$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and all other minor languages include";
echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);
// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);
echo "string same elements---> ".$d."<br />";
// getting difrent element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);
if (!empty($zem)) {
echo "string diffrence---> ".$zem."<br />";
} else {
echo "string diffrence--->both strings are same <br />";
}
similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";
精彩评论