开发者

full text search with spelling changes/mistakes

We have many objects and each objects comes with around 100-200 words description. (for example a book's author name and small summary).

User gives input as series for words. How to implement search with approximate text and minor spelling changes? for 开发者_开发知识库example "Joshua Bloch", "Joshua blosh", joshua block" could lead to same text result.


If you are using Lucene for your full-text search, there is a "Did you mean" extension for is probably what you want.


How to implement search with approximate text and minor spelling changes? for example "Joshua Bloch", "Joshua blosh", joshua block" could lead to same text result.

Does your database support Soundex? Soundex will match similar sounding words which seems to fit the example you gave above. Even if your database doesn't have native soundex you can still write an implementation and save the soundex for each author name in a separate field. This can be used to match later.

However Soundex is not a replacement for full text search; it will only help in specific cases likle author name. If you are looking to find some specific text from say, the book's blurb then you are better off with a full text search option (like Postgresql's).


If you are looking for actual implementation of this feature, here is a brilliant program written by Peter Norvig: http://norvig.com/spell-correct.html

It also has links to implementations in many other languages including Java, C etc.


You can use the spell checker JOrtho. From the context in your database you can generate a custom dictionary and set it. Then all words that are not in the dictionary and not in your database are mark as wrong spelling.


Instead of Lucene, please check Solr. Lucene is a library which you can use to embed search function in your application. Solr is the actual implementation of Lucene which you can directly plug in to your application via APIs. For most systems, Solr will save dealing with complexity of Lucene.


Apache Lucene may fit your bill. It is high performance, full test search engine library written entirely in Java.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜