Lucene.Net - How to treat a space-separated phrase as a single token?

2023-02-04 13:59 问答作者：

I've implemented a search facility using Lucene.Net. The index includes UK academic qualifications, including "A Level".

I'd like the users to be able to search using the phrase "A Level", but using the Standard Analyser the "A" is stripped out as a stop-word and therefore only "Level" is indexed/searched.

What's my best option to work around this? I'm guessing I need to somehow tokenise "A Level" to "A-Level" or similar by creating a custom analyser.

Is this the best approach?

Edits:

Note that I want don't want the whole search to be a phrase query. i.e. in my search box I want the user to be able to enter <"A Level" AN开发者_高级运维D English Maths Physics> and this would return any with "A Level" and either of English Maths or Physics. Question updated to reflect this.

I'd specifically like to keep the use of 'A' as a stop word in all cases appart from 'A Level'

The phrase 'A Level' is not in it's own specific field, it's in a free text field that may include the phrase.

Use PhraseQuery - that can be combined with any other by Boolean construction

EDITED

You don't need to search entire phrase. For you sample it looks like following (sorry it is pseudo-code, since I can't test it right now)

 BooleanQuery rootQuery = new ...
 PhraseQuery q1 = new PhraseQuery("A Level");
 TermQuery q2 = new TermQuery("English");
 TermQuery q3 = new TermQuery("Maths");
 TermQuery q4 = new TermQuery("Physics");
 rootQuery.Add(q1, BooleanClause.Occur.SHOULD); //or MUST - depends on you
 rootQuery.Add(q2, BooleanClause.Occur.SHOULD); 
 rootQuery.Add(q3, BooleanClause.Occur.SHOULD); 
 rootQuery.Add(q4, BooleanClause.Occur.SHOULD);

I do not think this is currently doable with Lucene. I have a half-finished plug in which does this, you can see it here. It doesn't set the position and offset attributes, which means that phrase searching won't work correctly, but hopefully it should give you a head start.

How did you indexed the content - which analyzer have you used? If you are using StandardAnalyzer then you can specify the stopwords in the constructor (you can use an empty list):

Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, new Hashtable());

So index the contenxt with upper analyzer. After that you can query the content using the QueryParser (be sure to use the analyzer above) or you can manual construct the query:

        // Phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.Add(new Term("MyField", "A"));
        phraseQuery.Add(new Term("MyField", "Level"));

        // Or query
        BooleanQuery orQuery = new BooleanQuery();
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "English")), BooleanClause.Occur.SHOULD));
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "Maths")), BooleanClause.Occur.SHOULD));
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "Physics")), BooleanClause.Occur.SHOULD));

        // Main query
        BooleanQuery query = new BooleanQuery();
        query.Add(phraseQuery, BooleanClause.Occur.MUST);
        query.Add(orQuery, BooleanClause.Occur.MUST);

Bye

The KeywordAnalyzer does not tokenize strings, unlike the StandardAnalyzer. I'm assuming there is a .net implementation of this - possibly this?.

I'll often do something like this (beware, Java follows):

private ReusableAnalyzer getReusableAnalyzer(String fieldName, Reader reader) {
    boolean phrase = treatAsPhrase(fieldName);
    ReusableAnalyzer ra = new ReusableAnalyzer();
    TokenStream result = phrase ? new KeywordTokenizer(reader) : new StandardTokenizer(version, reader);

whereby I use the field name to determine wether to treat the text as a "phrase" or not.

This is doable in Lucene with a little more customization.

1) Create a separate field in which stop words are preserved. You'll need to create your own analyzer which inherits from StandardAnalyzer but specifies no stop words in the base constructor.

public class PreserveStopWordsAnalyzer : StandardAnalyzer
{
    public PreserveStopWordsAnalyzer() : base(Version.LUCENE_29, new Hashtable())
    {}
}

2) Search quoted terms against the 'stop word' field. For example:

+RegularField:English +StopWordField:"A Level"

继续阅读：.net lucene lucene.net

Lucene.Net - How to treat a space-separated phrase as a single token?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？