How to match against subsets of a search string in SOLR/lucene

2023-02-08 11:13 问答作者：

I've got an unusual situation. Normally when you search a text index you are searching for a small number of keywords against documents with a larger number of terms.

For example you might search f开发者_开发知识库or "quick brown" and expect to match "the quick brown fox jumps over the lazy dog".

I have the situation where I have lots of small phrases in my document store and I wish to match them against a larger query phrase.

For example if I have a query:

"the quick brown fox jumps over the lazy dog"

and the documents

"quick brown"
"fox over"
"lazy dog"

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

Is this sort of query possible with SOLR/lucene?

It sounds like you want to use ShingleFilter in your analysis, so that you index word bigrams: so add ShingleFilterFactory at both query and index time.

At index time your documents are then indexed as such:

"quick brown" -> quick_brown
"fox over" -> fox_over
"lazy dog" -> lazy_dog

At query time your query becomes:

"the quick brown fox jumps over the lazy dog" -> "the_quick quick_brown brown_fox fox_jumps jumps_over over_the the_lazy lazy_dog"

This is still no good, by default it will form a phrase query. So in your query analyzer only add PositionFilterFactory after the ShingleFilterFactory. This "flattens" the positions in the query so that the queryparser treats the output as synonyms, which will yield a booleanquery with these subs (all SHOULD clauses, so its basically an OR query):

BooleanQuery:

the_quick OR
quick_brown OR
brown_fox OR
...

this should be the most performant way, as then its really just a booleanquery of termqueries.

Sounds like you want the DisMax "minimum match" parameter. I wrote a blog article on the concept here a little while: http://blog.websolr.com/post/1299174416. There's also the Solr wiki on minimum match.

The "minimum match" concept is applied against all the "optional" terms in your query -- terms that aren't explicitly specified, using +/-, whether they are "+mandatory" or "-prohibited". By default, the minimum match is 100%, meaning that 100% of the optional terms must be present. In other words, all of your terms are considered mandatory.

This is why your longer query isn't currently matching documents containing shorter fragments of that phrase. The other keywords in the longer search phrase are treated as mandatory.

If you drop the minimum match down to 1, then only one of your optional terms will be considered mandatory. In some ways this is the opposite of the default of 100%. It's like your query of quick brown fox… is turned into quick OR brown OR fox OR … and so on.

If you set your minimum match to 2, then your search phrase will get broken up into groups of two terms. A search for quick brown fox turns into (quick brown) OR (brown fox) OR (quick fox) … and so on. (Excuse my psuedo-query there, I trust you see the point.)

The minimum match parameter also supports percentages -- say, 20% -- and some even more complex expressions. So there's a fair amount of tweakability.

only setting mm parameter will not satisfy your needs since

"the quick brown fox jumps over the lazy dog"

will match all three documents

"quick brown"
"fox over"
"lazy dog"

and as you said:

I'd like to find the documents that have a phrase that occurs in the query. In this case "quick brown" and "lazy dog" (but not "fox over" because although the tokens match it's not a phrase in the search string).

继续阅读：lucene solr

How to match against subsets of a search string in SOLR/lucene

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？