Better search results using Lucene

2022-12-21 07:05 问答作者：

I've got a database with a lot of books in it. I've got fields like title, descriptions, authors etc.

I'm indexing title with a boost of 100f and description with a boost of 0.1f, both fields tokenized and stemmed.

I'm searching with a single input field, that 开发者_如何学Pythonsearches in all available fields using a booleanquery joined with BooleanClause.Occur.SHOULD and containing a wildcardquery for each field. I also remove all "stopwords" from the query to start with.

The problem i'm having is when i search for the string without the quotes

"de wetenschap van het leven", after removing the stop words i get "wetenschap leven"

The Title query becomes "*wetenschap* *leven*", the description query the same, with a wrapping booleanquery joined with BooleanClause.Occur.SHOULD.

The following books are in the db

Wetenschappelijk denken. Een inleiding voor de medische en biomedische wetenschappen en voor de andere levenswetenschap.
De wetenschap van de aarde. Over een levende planeet
Atlas van de menselijke levensloop
De wetenschap van het leven. Over eenheid in biologische diversiteit

The book return in the first 4 books, that's good, but in this implementation we cut off at 3 and the rest is below a read more link. Just upping the cutoff is not an option

For me, the "De wetenschap van het leven. Over eenheid in biologische diversiteit" book matches the query "more" then the others (or so i feel), but i'm unable to find the correct index/search combination to make this work. Does anyone have an idea?

A few suggestions:

Do not remove stop words - they seem to be an important part of your search query.
Do not use wildcards - search just for the words you need. I believe the best will be to use a PhraseQuery - e.g. "de wetenschap van het leven".
Do not search past sentence end. This is tougher - you may need to index each sentence separately.
Read Debugging Relevance Issues in Search - you will probably get other ideas there.

I think a SpanQuery (specifically a SpanNearQuery) might be what you need.

Given a document "a quick brown fox jumps over a lazy dog"

it can find a match for "brown fox " and "lazy dog". You can adjust the slop setting to adjust the distance between the two search query phrases/terms....in short, it gives you a lot of tools to tweak your search.

Also unfamiliar with dutch(?) language you might want to stem your queries if possible, and avoid leading wildcards - they are quite expensive and lead to lower precision and recall.

I improved the relevance by adding a phrase search for the entire string as well. This way we still get the "search in everything" behavior and the titles are a lot more relevant then the rest.

继续阅读：lucene

Better search results using Lucene

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？