Fuzzy Queries in Lucene

2023-01-09 20:56 问答作者：

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.

For example if I have the company name "开发者_如何学运维Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".

From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.

The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.

Like I said I have indexed the database on company name, and then have the following code:

IndexSearcher searcher = new IndexSearcher(directory);  

new QueryParser(Version.LUCENE_30, "company", analyzer);

Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));

I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)

TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);

searcher.search(fuzzy_query, collector);

System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());

Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:

import org.apache.lucene.queryParser.*;

Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:

Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);

Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.

You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:

TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.score);
}

继续阅读：indexing lucene

Fuzzy Queries in Lucene

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？