Lucene wildcard matching fails on chemical notations(?)

2023-01-17 20:29 问答作者：

Using Hibernate Search Annotations (mostly just @Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.

Among the fields indexed and searchable is a field called compoundName, with sample values:

3-Hydroxyflavone
6,4'-Dihydroxyflavone

When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:

searching for 3-Hydroxyflav* still gives the correct hit, but
searching for 6,4'-Dihydroxyflav* fails to find anything.

Now as I'm quite new to Lucene / Hibernate-search, I'm not quit开发者_运维百科e sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?

Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?

I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.

Some relevant code bits to illustrate my current approach:

Relevant parts of the indexed Compound class:

@Entity
@Indexed
@Data
@EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
    @NaturalId
    @NotEmpty
    @Length(max = 30)
    @Field(index = Index.TOKENIZED)
    private String                  inchikey;

    @ManyToOne
    @IndexedEmbedded
    private ChemicalClass           chemicalClass;

    @Field(index = Index.TOKENIZED)
    private String                  commonName;
...
}

How I currently search over the indexed fields:

String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = 
    new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery = 
    fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();

Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.

You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.

I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.

Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:

Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.

It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.

Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.

Either way, first step os to find out how your chemical names get tokenized. Hope that helps.

I wrote my own analyzer:

import java.util.Set;
import java.util.regex.Pattern;

import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.util.Version;

public class ChemicalNameAnalyzer extends PatternAnalyzer {

    private static Version version = Version.LUCENE_29;
    private static Pattern pattern = compilePattern();
    private static boolean toLowerCase = true;
    private static Set stopWords = null;

    public ChemicalNameAnalyzer(){
        super(version, pattern, toLowerCase, stopWords);
    }

    public static Pattern compilePattern() {
        StringBuilder sb =  new StringBuilder();
        sb.append("(-{0,1}\\(-{0,1})");//Matches an optional dash followed by an opening round bracket followed by an optional dash  
        sb.append("|");//"OR" (regex alternation)
        sb.append("(-{0,1}\\)-{0,1})"); 
        sb.append("|");//"OR" (regex alternation)
        sb.append("((?<=([a-zA-Z]{2,}))-(?=([^a-zA-Z])))");//Matches a dash ("-") preceded by two or more letters and succeeded by a non-letter
        return Pattern.compile(sb.toString());
    }
}

继续阅读：hibernate-search lucene matching wildcard

Lucene wildcard matching fails on chemical notations(?)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？