开发者

Alternatives to Lucene Default Fuzzy Matching Implementation

Lucene fuzzy matching uses a basic editDistance algorithm to implement fuzzy matching. Are there other implementations of fuzzy matching for Lucene which use other similarity metrics? They should identi开发者_如何学Cfy homphones also. Also please compare various fuzzy matching approaches for lucene.


Don't think Lucene offers any other string matching algorithms, you can however add one yourself. Here is a good library that contains most well known string comparison algorithms.


Something that I've been doing is pretty simple, and works in most scenarios (In my scenario, I have 6.7 million event names, from a dirty table that has slightly altered or drilled-down versions of event names, and the table I'm fuzzy matching with has all the clean event names)

``select distinct a.Column, b.Column 
from tableA a 
inner join tableB b 
on '%' + SUBSTRING(b.Column, x, y) + '%' = '%' + SUBSTRING(a.Column, x, y) + '%'
order by a.Column asc;``

My problem is that if I simply did a fuzzy match with no substring, I was only getting about 11 results because of how obscure the naming conventions between the two were. This solution shows all of the drill-down-esque events being matched up with their broader counterparts in the clean table.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜