Lucene wildcard matching fails on chemical notations(?)
Using Hibernate Search Annotations (mostly just @Field(index = Index.TOKENIZED)
) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser
, which has so far worked fine.
Among the fields indexed and searchable is a field called compoundName, with sample values:
3-Hydroxyflavone
6,4'-Dihydroxyflavone
When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:
- searching for
3-Hydroxyflav*
still gives the correct hit, but - searching for
6,4'-Dihydroxyflav*
fails to find anything.
Now as I'm quite new to Lucene / Hibernate-search, I'm not quit开发者_运维百科e sure where to look at this point.. I think it might have something to do with the '
present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?
Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?
I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.
Some relevant code bits to illustrate my current approach:
Relevant parts of the indexed Compound class:
@Entity
@Indexed
@Data
@EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
@NaturalId
@NotEmpty
@Length(max = 30)
@Field(index = Index.TOKENIZED)
private String inchikey;
@ManyToOne
@IndexedEmbedded
private ChemicalClass chemicalClass;
@Field(index = Index.TOKENIZED)
private String commonName;
...
}
How I currently search over the indexed fields:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser =
new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery =
fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.
You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.
I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.
Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.
Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.
Either way, first step os to find out how your chemical names get tokenized. Hope that helps.
I wrote my own analyzer:
import java.util.Set;
import java.util.regex.Pattern;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.util.Version;
public class ChemicalNameAnalyzer extends PatternAnalyzer {
private static Version version = Version.LUCENE_29;
private static Pattern pattern = compilePattern();
private static boolean toLowerCase = true;
private static Set stopWords = null;
public ChemicalNameAnalyzer(){
super(version, pattern, toLowerCase, stopWords);
}
public static Pattern compilePattern() {
StringBuilder sb = new StringBuilder();
sb.append("(-{0,1}\\(-{0,1})");//Matches an optional dash followed by an opening round bracket followed by an optional dash
sb.append("|");//"OR" (regex alternation)
sb.append("(-{0,1}\\)-{0,1})");
sb.append("|");//"OR" (regex alternation)
sb.append("((?<=([a-zA-Z]{2,}))-(?=([^a-zA-Z])))");//Matches a dash ("-") preceded by two or more letters and succeeded by a non-letter
return Pattern.compile(sb.toString());
}
}
精彩评论