开发者

Lucene bigrams tokenizer to include punctuation signs

Is there any chance that I could use Lucene's ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)开发者_如何学编程? Quick example: given the field "one two; three four" would provide 2 bigrams only: (one two) and (three four)?


You could create a ShingleAnalyzerWrapper that uses an analyzer based on LetterTokenizer. LetterTokenizer breaks the input text at non letters. Something like:

public class MyCharAnalyzer extends Analyzer { 

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new LetterTokenizer(reader);    
    return result;
  }
}

ShingleAnalyzerWrapper myBigramWrapper = new ShingleAnalyzerWrapper(new MyCharAnalyzer());

If you wanted better control over what you consider punctuation, you could subclass CharTokenizer and override the isTokenChar() method.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜