Max edit distance and suggestion based on word frequency
I need a spell checker with the following specification:
- Very scalable.
- To be able to set a maximum edit distance for the suggested words.
- To get suggestion based on provided words frequencies (most common word first).
I took a look at Hunspell:
I found the parameter MAXDIFF in the man but doesn't seem to work as expected. Maybe I'm using it the wrong wayfile t.aff:
MAXDIFF 1
file dico.dic:
5
rouge
vert
bleu
bleue
orange
-
NHunspell.Hunspell h = new NHunspell.Hunspell("t.aff", "dico.开发者_运维知识库dic");
List<string> s = h.Suggest("bleuue");
returns the same thing t.aff
being empty or not:
bleue
bleu
We decided to use Apache Solr
, which exactly fulfills our needs.
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck
A maxdiff of one should return a few, but still can return more than one.
Even a maxdiff of zero can give more than a single result, but it should lower the change. It depends on the n-gram. Try a maxdiff of zero less results, but this still doesn't guarantee you will get a single suggestion.
For your requirement to sort on the most frequent word, the Google ngram corpus is publicly available.
精彩评论