What are the best practices for combining analyzers in Lucene?
I have a situation where I'm using a StandardAnalyzer in Lucene to index text strings as follows:
public void indexText(String suffix, boolean includeStopWords) {
StandardAnalyzer analyzer = null;
if (includeStopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_30);
}
else {
// Get Stop_Words to exclude them.
Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();
analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
}
try {
// Index text.
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
this.addTex开发者_如何学CtToIndex(w, this.getTextToIndex());
w.close();
// Read index.
IndexReader ir = IndexReader.open(index);
Text_TermVectorMapper ttvm = new Text_TermVectorMapper();
int docId = 0;
ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm);
// Set output.
this.setWordFrequencies(ttvm.getWordFrequencies());
w.close();
}
catch(Exception ex) {
logger.error("Error message\n", ex);
}
}
private void addTextToIndex(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
w.addDocument(doc);
}
Which works perfectly well but I would like to combine this with stemming using a SnowballAnalyzer as well.
This class also has two instance variables shown in a constructor below:
public Text_Indexer(String textToIndex) {
this.textToIndex = textToIndex;
this.wordFrequencies = new HashMap<String, Integer>();
}
Can anyone tell me how best to achieve this with the code above?
Thanks
Mr Morgan.
Lucene provides the org.apache.lucene.analysis.Analyzer
base class which can be used if you want to write your own Analyzer.
You can check out org.apache.lucene.analysis.standard.StandardAnalyzer
class that extends Analyzer.
Then, in YourAnalyzer, you'll chain StandardAnalyzer and SnowballAnalyzer by using the filters those analyzers use, like this:
TokenStream result = new StandardFilter(tokenStream);
result = new SnowballFilter(result, stopSet);
Then, in your existing code, you'll be able to construct IndexWriter with your own Analyzer implementation that chains Standard and Snowball filters.
Totally off-topic:
I suppose you'll eventually need to setup your custom way of handling requests. That is already implemented inside Solr.
First write your own Search Component by extending SearchComponent and defining it in SolrConfig.xml, like this:
<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>
Then write your Search Handler (request handler) by extending SearchHandler, and define it in SolrConfig.xml:
<requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true">
<!-- default values for query parameters -->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">1000</int>
<str name="fl">*</str>
<str name="version">2.1</str>
</lst>
<arr name="components">
<str>yourQueryComponent</str>
<str>facet</str>
<str>mlt</str>
<str>highlight</str>
<str>stats</str>
<str>debug</str>
</arr>
</requestHandler>
Then, when you send url query to Solr, simply include additional parameter qt=YourRequestHandlerName, which will result in your request handler being used for that request.
More about SearchComponents.
More about RequestHandlers.
The SnowballAnalyzer provided by Lucene already uses the StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and SnowballFilter. So it sounds like it does exactly what you want (everything StandardAnalyzer does, plus the snowball stemming).
If it didn't, you could build your own analyzer pretty easily by combining whatever tokenizers and TokenStreams you wish.
In the end I rearranged the program code to call the SnowBallAnalyzer as an option. The output is then indexed via the StandardAnalyzer.
It works and is fast but if I can do everything with just one analyzer, I'll revisit my code.
Thanks to mbonaci and Avi.
精彩评论