Fuzzy Text Search: Regex Wildcard Search Generator?
I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.
I was thinking a regex generator might be able to do it. So given an input of "crazy" it would generate this regex:
.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*
It would then return all matches for that word or variations of that word.
How to build the generator: I would probably split the search string/word up into an array of characters and build the regex expression doing a foreach the newly created array replacing the key value (the position of the letter in the string) with ".+".
Is this a good way to do fuzzy text search or is there a better way? What about some kind of string comparison that give开发者_运维问答s me a score based on how close it is? I'm trying to see if some badly converted OCR text contains a word in short.
String distance functions are useless when you don't know what the right word is. I'd suggest pspell functions:
$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));
http://www.php.net/manual/en/function.pspell-suggest.php
echo generateRegex("crazy");
function generateRegex($word)
{
$len = strlen($word);
$regex = "\b((".$word.")";
for($i = 0; $i < $len; $i++)
{
$temp = $word;
$temp[$i] = '.';
$regex .= "|(".$temp.")";
}
$regex = $regex.")\b";
return $regex;
}
Levenshtein is one example of a String Edit-distance. There are different metrics for different purposes. Familiarize yourself with them and find the one that works for you.
精彩评论