Using the Levenshtein distance in a spell checker

2023-02-18 07:45 问答作者：

I am working on a spell checker in C++ and I'm stuck at a certai开发者_如何学JAVAn step in the implementation.

Let's say we have a text file with correctly spelled words and an inputted string we would like to check for spelling mistakes. If that string is a misspelled word, I can easily find its correct form by checking all words in the text file and choosing the one that differs from it with a minimum of letters. For that type of input, I've implemented a function that calculates the Levenshtein edit distance between 2 strings. So far so good.

Now, the tough part: what if the inputted string is a combination of misspelled words? For example, "iloevcokies". Taking into account that "i", "love" and "cookies" are words that can be found in the text file, how can I use the already-implemented Levenshtein function to determine which words from the file are suitable for a correction? Also, how would I insert blanks in the correct positions?

Any idea is welcome :)

Spelling correction for phrases can be done in a few ways. One way requires having an index of word bi-grams and tri-grams. These of course could be immense. Another option would be to try permutations of the word with spaces inserted, then doing a lookup on each word in the resulting phrase. Take a look at a simple implementation of a spell checker by Peter Norvig from Google. Either way, consider using an n-gram index for better performance, there are libraries available in C++ for reference.

Google and other search engines are able to do spelling correction on phrases because they have a large index of queries and associated result sets, which allows them to calculate a statistically good guess. Overall, the spelling correction problem can become very complex with methods like context-sensitive correction and phonetic correction. Given that using permutations of possible sub-terms can become expensive you can utilize certain types of heuristics, however this can get out of scope quick.

You may also consider using and existing spelling library, such as aspell.

A starting point for an idea: one of the top hits of your L-distance for "iloevcokies" should be "cookies". If you can change your L-distance function to also track and return a min-index and max-index (i.e., this match is best starting from character 5 and going to character 10) then you could remove that substring and re-check L-distance for the string before it and after it, then concatenate those for a suggestion....

Just a thought, good luck....

I will suppose that you have an existing index, on which you run your levenshtein distance (for example, a Trie, but any sorted index generally work well).

You can consider the addition of white-spaces as a regular edit operation, it's just that there is a twist: you need (then) to get back to the root of your index for the next word.

This way you get the same index, almost the same route, approximately the same traversal, and it should not even impact your running time that much.

继续阅读：algorithm levenshtein-distance spell-checking

Using the Levenshtein distance in a spell checker

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？