English Lexicon for Search Query Correction

2022-12-18 14:17 问答作者：

I'm building a spelling corrector for search engine queries by implementing the method described in "Spelling correction as an iterative process that exploits the collective knowledge of web users".

The high-level approach is as follows: for a given query, come up with possible correction candidates (words in the query log within a certain edit distance) of each unigram and bigram, then perform a modified Viterbi search to find the most likely sequence of candidates given bigram frequencies. Repeat this process until the sequence is of maximum probability.

开发者_StackOverflow中文版

The modification to the Viterbi search is such that if two adjacent words are both found in a trusted lexicon, at most one can be corrected. This is especially important for avoiding correction of properly-spelled single-word queries to words of higher frequency.

My question is where to find such a lexicon. It should be in English and contain proper nouns (first/last names, places, brand names, etc) likely to show up in search queries as well as common and uncommon English words. Even a push in the right direction would be useful.

Also, if anyone is reading this and has any suggestions for improvement on the methodology supplied in the paper, I am open to those as well given that this is my first foray into NLP.

The best lexicon for this purpose is probably the Google Web 1T 5-gram data set.

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

Unfortunately, it is not free unless your university is a member of LDC.

You could also try the corpora in packages like Python NLTK, but the Google one seems to be the best for your purpose since it is related to search queries already.

继续阅读：lexicon search

English Lexicon for Search Query Correction

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？