what analzyer is good for my situation? hibernate search case
We are running a search app for book. It is implemented by hibernate search.
Book entity is defined as following开发者_开发百科:
@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;
@Field
private String description;
...}
If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.
The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?
If search words is microsoft access 2007, results have best score if they contains "microsoft access",
the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".
Can anyone offer me some clues?
thanks!
Lucene should already discount terms that occur frequently and thus don't discriminate well among documents. If you want to increase that effect, you have a few choices:
- Change the similarity function from the default, and use the new function to weight terms differently
- Boost low-df (high idf) terms in the query by first looking up the number of documents that contain a given term, and adjusting that term's weight accordingly
- Write a classifier that can a priori decide which terms are not going to be as effective (e.g., year numbers), and adjust their weight accordingly
- Use something like WordNet or Wikipedia as a source of phrases (e.g., leadership skills) that you index as a single token. This will involve a modified TokenStream as configured by your analyzer.
I don't know how to differentiate a good 2007 from a bad one.
One thing you could do is to use a analyzer that ignores numbers for description but use a regular analyzer for title. That way only numbers in the title will be picked up. In practice it's not a whole analyzer but a simple filter that you can write and add to the analyzer stack.
You can also index description twice, once ignoring numbers and once not ignoring them. You can then play with the boost factor at query time to search both fields but give the one with numbers a low priority.
Another solution is to ignore some number patterns in your custom filter (ie year-style numbers, single digits numbers etc): these would be the most common type of noisy numbers that you would want ignored (that's what I would go for first I think).
As for the phrase search, simply use a PhraseQuery by Lucene or use the more friendly Hibernate Search DSL,
Query luceneQuery = mythQB
.phrase()
.onField("history")
.matching("Thou shalt not kill")
.createQuery();
The whole doc for the query DSL is here
精彩评论