elasticsearch fuzzy matching max_expansions & min_similarity

2023-03-30 00:29 问答作者：

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.

As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.

The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal so开发者_运维知识库lution for me. Anyway, it's not working for example i have the word "Samvel"

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

The documentation says something I actually do not understand:

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

So can please anyone explain to me how exactly these parameters affect the search results.

The min_similarity is a value between zero and one. From the Lucene docs:

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

The 'edit distance' that is referred to is the Levenshtein distance.

The way this query works internally is:

it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.

You can imagine how heavy this query could be!

To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

继续阅读：elasticsearch fuzzy-comparison fuzzy-logic fuzzy-search

elasticsearch fuzzy matching max_expansions & min_similarity

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？