Algorithm in .NET to detect simple typos
Is there an existing algorithm for .NET that would be able to detects typos from a list of predefined words?
Example, suppose the word "Stuff" is in my list and someone types "Stuf", "sutff", or "stff" or "stiff". I would like to be able to suggest the person that the word "Stuff" might 开发者_开发知识库be the right word.
I am not talking about anything grammatical or anything more than maybe one letter missing, substituted or mixed.
My goal is to prevent the same word being entered in a list with different typing. Not that uppercase and lowercase do not cause a problem for me as everything is always lowercase.
This is a well-studied problem and there are many good algorithms for doing this. Most of them work by constructing some sort of data structure to hold all of the legal words in a way where words with similar edit distance can be found efficiently. Informally, the edit distance between two strings is the number of changes you would need to make to transform one string into another. For example, given the words "misspell" and "mispell," the edit distance is one (just insert another 's' into the word), while the edit distance between "cat" and "dog" is three (replace each letter).
A misspelled word is likely only a small edit distance away from the word that was intended, so if you can store the words in a way where you can, for any string, query words that are a small edit distance away from the string, you can provide a candidate list of possible words that the user may have meant.
One common data structure for holding this data is a trie, a 26-way tree structure where each node stores a prefix of a word and each edge corresponds to appending some letter to the current prefix. If you have such a structure, you can find words that are "close" to a particular word (perhaps a certain edit distance away) using a simple recursive tree search. At each point, keep track of how far of an edit distance you wish to be away from the target word and how much of the misspelled word that you have processed so far. At each point, you can either follow the edge in the trie corresponding to the letter in the word, or you can use up one edit distance to insert a new letter by following a different edge in the trie.
Another data structure that is often used for this is a BK-tree, which stores strings in a way where you can efficiently query all words a certain edit distance away from some source string. This would more directly solve your problem, though there are fewer online resources on how to construct BK-trees compared with tries.
Once you've found the words that are within some edit distance, you'll probably want to rank them somehow before presenting them to the user. This requires you to have some idea of what sorts of typos people tend to make in practice. Common typos include
- Transpositions, where two letters are swapped: "thme" instead of "them"
- Substitutions, where the wrong letter is used: "arithmatic" instead of "arithmetic"
- Deletions, where a letter is left out: "helo" instead of "hello"
- Repetition, where a letter is duplicated: "tommorrow" instead of "tomorrow"
To build a good spell checker, you would ideally have some sort of probabilities associated with each type of mistake. That way, when you had your list of possible corrections, you could rank them from most probable to least probable.
Hope this helps!
Here's a good step by step to do what you are looking for implemented in python, but there are links to implementations in C# and other languages too.
http://norvig.com/spell-correct.html
Maybe you want to look for a trigram search? A trigram search need to create every possibilites of 3 letters of your input and look for similar strings in the match.
Posting my implementation in C#, allowed for strings with length >= 2. It checks if 2 words match disregarding Transpositions, Substitutions, Deletions (valid for words with length >=2 after deletion) and Repetitions (multiple allowed).
public bool CheckWordsSameWithTypo(string word1, string word2)
{
if (word1.Length < 2 || word2.Length < 2) return false;
//transposition "thme" instead of "them"
bool matchWithTrans = false;
#region transLogic
var transOptions1 = new List<string>();
var transOptions2 = new List<string>();
for (int i = 1; i < word1.Length; i++)
{
var wordArr = word1.ToArray();
char letter1 = wordArr[i -1];
char letter2 = wordArr[i];
wordArr[i -1] = letter2;
wordArr[i] = letter1;
transOptions1.Add(new string(wordArr));
}
for (int i = 1; i < word2.Length; i++)
{
var wordArr = word2.ToArray();
char letter1 = wordArr[i -1];
char letter2 = wordArr[i];
wordArr[i -1] = letter2;
wordArr[i] = letter1;
transOptions2.Add(new string(wordArr));
}
if (transOptions1.Any(p => p.Equals(word2)) || transOptions2.Any(p => p.Equals(word1))) matchWithTrans = true;
#endregion
//substitution "arithmatic" instead of "arithmetic"
bool matchWithSubst = false;
#region substLogic
var substOptionPatterns1 = new List<string>();
var substOptionPatterns2 = new List<string>();
for (int i = 0; i < word1.Length; i++)
{
var wordArr = word1.ToArray();
wordArr[i] = '.';
substOptionPatterns1.Add(new string(wordArr));
}
for (int i = 0; i < word2.Length; i++)
{
var wordArr = word2.ToArray();
wordArr[i] = '.';
substOptionPatterns2.Add(new string(wordArr));
}
foreach(var patt in substOptionPatterns1)
{
Regex regex = new Regex(patt);
if (regex.IsMatch(word2)) matchWithSubst = true;
}
foreach(var patt in substOptionPatterns2)
{
Regex regex = new Regex(patt);
if (regex.IsMatch(word1)) matchWithSubst = true;
}
#endregion
//deletion "helo" instead of "hello"
bool matchWithDeletion = false;
#region deletionLogic
var delOptions1 = new List<string>();
var delOptions2 = new List<string>();
for (int i = 0; i < word1.Length; i++)
{
delOptions1.Add(word1.Remove(i, 1));
}
for (int i = 0; i < word2.Length; i++)
{
delOptions2.Add(word2.Remove(i, 1));
}
if (delOptions1.Any(p => p.Length>1 && p.Equals(word2)) || delOptions2.Any(p => p.Length>1 && p.Equals(word1))) matchWithDeletion = true;
#endregion
//repetition "tommorrow" instead of "tomorow"
bool matchWithRepetition = false;
#region repsLogic
StringBuilder word1_distinctBuilder = new StringBuilder(word1.Substring(0, 1));
for (int i = 1; i < word1.Length; i++)
{
string currentLetter = word1.Substring(i, 1);
if(!word1_distinctBuilder.ToString().Substring(word1_distinctBuilder.ToString().Length-1, 1).Equals(currentLetter))
{
word1_distinctBuilder.Append(currentLetter);
}
}
StringBuilder word2_distinctBuilder = new StringBuilder(word2.Substring(0, 1));
for (int i = 1; i < word2.Length; i++)
{
string currentLetter = word2.Substring(i, 1);
if(!word2_distinctBuilder.ToString().Substring(word2_distinctBuilder.ToString().Length-1, 1).Equals(currentLetter))
{
word2_distinctBuilder.Append(currentLetter);
}
}
matchWithRepetition = word1_distinctBuilder.ToString().Equals(word2_distinctBuilder.ToString());
#endregion
return matchWithTrans || matchWithSubst || matchWithDeletion || matchWithRepetition;
}
精彩评论