开发者

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.

For example if I have the company name "开发者_如何学运维Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".

From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.

The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.

Like I said I have indexed the database on company name, and then have the following code:

IndexSearcher searcher = new IndexSearcher(directory);  

new QueryParser(Version.LUCENE_30, "company", analyzer);

Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));

I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)

TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);

searcher.search(fuzzy_query, collector);

System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());

Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:

import org.apache.lucene.queryParser.*;

Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.


You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:

Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);

Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.

You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.


You can get the match values with:

TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.score);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜