What are the best practices for combining analyzers in Lucene?

2023-02-15 04:21 问答作者：

I have a situation where I'm using a StandardAnalyzer in Lucene to index text strings as follows:

public void indexText(String suffix, boolean includeStopWords)  {        
    StandardAnalyzer analyzer = null;


    if (includeStopWords) {
        analyzer = new StandardAnalyzer(Version.LUCENE_30);
    }
    else {

        // Get Stop_Words to exclude them.
        Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();      
        analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
    }

    try {

        // Index text.
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);            
        this.addTex开发者_如何学CtToIndex(w, this.getTextToIndex());
        w.close();

        // Read index.
        IndexReader ir = IndexReader.open(index);
        Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

        int docId = 0;

        ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm);

        // Set output.
        this.setWordFrequencies(ttvm.getWordFrequencies());
        w.close();
    }
    catch(Exception ex) {
        logger.error("Error message\n", ex);
    }
}

private void addTextToIndex(IndexWriter w, String value) throws IOException {
    Document doc = new Document();
    doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    w.addDocument(doc);
}

Which works perfectly well but I would like to combine this with stemming using a SnowballAnalyzer as well.

This class also has two instance variables shown in a constructor below:

public Text_Indexer(String textToIndex) {
    this.textToIndex = textToIndex;
    this.wordFrequencies = new HashMap<String, Integer>();
}

Can anyone tell me how best to achieve this with the code above?

Thanks

Mr Morgan.

Lucene provides the org.apache.lucene.analysis.Analyzer base class which can be used if you want to write your own Analyzer.
You can check out org.apache.lucene.analysis.standard.StandardAnalyzer class that extends Analyzer.

Then, in YourAnalyzer, you'll chain StandardAnalyzer and SnowballAnalyzer by using the filters those analyzers use, like this:

TokenStream result = new StandardFilter(tokenStream);
result = new SnowballFilter(result, stopSet);

Then, in your existing code, you'll be able to construct IndexWriter with your own Analyzer implementation that chains Standard and Snowball filters.

Totally off-topic:
I suppose you'll eventually need to setup your custom way of handling requests. That is already implemented inside Solr.

First write your own Search Component by extending SearchComponent and defining it in SolrConfig.xml, like this:

<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>

Then write your Search Handler (request handler) by extending SearchHandler, and define it in SolrConfig.xml:

  <requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true">
    <!-- default values for query parameters -->
        <lst name="defaults">
            <str name="echoParams">explicit</str>       
            <int name="rows">1000</int>
            <str name="fl">*</str>
            <str name="version">2.1</str>
        </lst>

        <arr name="components">
            <str>yourQueryComponent</str>
            <str>facet</str>
            <str>mlt</str>
            <str>highlight</str>            
            <str>stats</str>
            <str>debug</str>

        </arr>

  </requestHandler>

Then, when you send url query to Solr, simply include additional parameter qt=YourRequestHandlerName, which will result in your request handler being used for that request.

More about SearchComponents.
More about RequestHandlers.

The SnowballAnalyzer provided by Lucene already uses the StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and SnowballFilter. So it sounds like it does exactly what you want (everything StandardAnalyzer does, plus the snowball stemming).

If it didn't, you could build your own analyzer pretty easily by combining whatever tokenizers and TokenStreams you wish.

In the end I rearranged the program code to call the SnowBallAnalyzer as an option. The output is then indexed via the StandardAnalyzer.

It works and is fast but if I can do everything with just one analyzer, I'll revisit my code.

Thanks to mbonaci and Avi.

继续阅读：analyzer lucene

What are the best practices for combining analyzers in Lucene?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？