what analzyer is good for my situation? hibernate search case

2023-03-10 06:49 问答作者：

We are running a search app for book. It is implemented by hibernate search.

Book entity is defined as following开发者_开发百科:

@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;

@Field
private String description;
...}

If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?

If search words is microsoft access 2007, results have best score if they contains "microsoft access",

the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

Can anyone offer me some clues?

thanks!

Lucene should already discount terms that occur frequently and thus don't discriminate well among documents. If you want to increase that effect, you have a few choices:

Change the similarity function from the default, and use the new function to weight terms differently
Boost low-df (high idf) terms in the query by first looking up the number of documents that contain a given term, and adjusting that term's weight accordingly
Write a classifier that can a priori decide which terms are not going to be as effective (e.g., year numbers), and adjust their weight accordingly
Use something like WordNet or Wikipedia as a source of phrases (e.g., leadership skills) that you index as a single token. This will involve a modified TokenStream as configured by your analyzer.

I don't know how to differentiate a good 2007 from a bad one.

One thing you could do is to use a analyzer that ignores numbers for description but use a regular analyzer for title. That way only numbers in the title will be picked up. In practice it's not a whole analyzer but a simple filter that you can write and add to the analyzer stack.

You can also index description twice, once ignoring numbers and once not ignoring them. You can then play with the boost factor at query time to search both fields but give the one with numbers a low priority.

Another solution is to ignore some number patterns in your custom filter (ie year-style numbers, single digits numbers etc): these would be the most common type of noisy numbers that you would want ignored (that's what I would go for first I think).

As for the phrase search, simply use a PhraseQuery by Lucene or use the more friendly Hibernate Search DSL,

Query luceneQuery = mythQB
   .phrase()
   .onField("history")
   .matching("Thou shalt not kill")
       .createQuery();

The whole doc for the query DSL is here

继续阅读：analyzer hibernate-search lucene

what analzyer is good for my situation? hibernate search case

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？