开发者

Lucene.Net - How to treat a space-separated phrase as a single token?

I've implemented a search facility using Lucene.Net. The index includes UK academic qualifications, including "A Level".

I'd like the users to be able to search using the phrase "A Level", but using the Standard Analyser the "A" is stripped out as a stop-word and therefore only "Level" is indexed/searched.

What's my best option to work around this? I'm guessing I need to somehow tokenise "A Level" to "A-Level" or similar by creating a custom analyser.

Is this the best approach?

Edits:

Note that I want don't want the whole search to be a phrase query. i.e. in my search box I want the user to be able to enter <"A Level" AN开发者_高级运维D English Maths Physics> and this would return any with "A Level" and either of English Maths or Physics. Question updated to reflect this.

I'd specifically like to keep the use of 'A' as a stop word in all cases appart from 'A Level'

The phrase 'A Level' is not in it's own specific field, it's in a free text field that may include the phrase.


Use PhraseQuery - that can be combined with any other by Boolean construction

EDITED

You don't need to search entire phrase. For you sample it looks like following (sorry it is pseudo-code, since I can't test it right now)

 BooleanQuery rootQuery = new ...
 PhraseQuery q1 = new PhraseQuery("A Level");
 TermQuery q2 = new TermQuery("English");
 TermQuery q3 = new TermQuery("Maths");
 TermQuery q4 = new TermQuery("Physics");
 rootQuery.Add(q1, BooleanClause.Occur.SHOULD); //or MUST - depends on you
 rootQuery.Add(q2, BooleanClause.Occur.SHOULD); 
 rootQuery.Add(q3, BooleanClause.Occur.SHOULD); 
 rootQuery.Add(q4, BooleanClause.Occur.SHOULD); 


I do not think this is currently doable with Lucene. I have a half-finished plug in which does this, you can see it here. It doesn't set the position and offset attributes, which means that phrase searching won't work correctly, but hopefully it should give you a head start.


How did you indexed the content - which analyzer have you used? If you are using StandardAnalyzer then you can specify the stopwords in the constructor (you can use an empty list):

Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29, new Hashtable());

So index the contenxt with upper analyzer. After that you can query the content using the QueryParser (be sure to use the analyzer above) or you can manual construct the query:

        // Phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.Add(new Term("MyField", "A"));
        phraseQuery.Add(new Term("MyField", "Level"));

        // Or query
        BooleanQuery orQuery = new BooleanQuery();
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "English")), BooleanClause.Occur.SHOULD));
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "Maths")), BooleanClause.Occur.SHOULD));
        orQuery.Add(new BooleanClause(new TermQuery(new Term("MyField", "Physics")), BooleanClause.Occur.SHOULD));

        // Main query
        BooleanQuery query = new BooleanQuery();
        query.Add(phraseQuery, BooleanClause.Occur.MUST);
        query.Add(orQuery, BooleanClause.Occur.MUST);

Bye


The KeywordAnalyzer does not tokenize strings, unlike the StandardAnalyzer. I'm assuming there is a .net implementation of this - possibly this?.

I'll often do something like this (beware, Java follows):

private ReusableAnalyzer getReusableAnalyzer(String fieldName, Reader reader) {
    boolean phrase = treatAsPhrase(fieldName);
    ReusableAnalyzer ra = new ReusableAnalyzer();
    TokenStream result = phrase ? new KeywordTokenizer(reader) : new StandardTokenizer(version, reader);

whereby I use the field name to determine wether to treat the text as a "phrase" or not.


This is doable in Lucene with a little more customization.

1) Create a separate field in which stop words are preserved. You'll need to create your own analyzer which inherits from StandardAnalyzer but specifies no stop words in the base constructor.

public class PreserveStopWordsAnalyzer : StandardAnalyzer
{
    public PreserveStopWordsAnalyzer() : base(Version.LUCENE_29, new Hashtable())
    {}
}

2) Search quoted terms against the 'stop word' field. For example:

+RegularField:English +StopWordField:"A Level"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜