Lucene bigrams tokenizer to include punctuation signs
Is there any chance that I could use Lucene's ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)开发者_如何学编程? Quick example: given the field "one two; three four" would provide 2 bigrams only: (one two) and (three four)?
You could create a ShingleAnalyzerWrapper
that uses an analyzer based on LetterTokenizer
. LetterTokenizer
breaks the input text at non letters. Something like:
public class MyCharAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new LetterTokenizer(reader);
return result;
}
}
ShingleAnalyzerWrapper myBigramWrapper = new ShingleAnalyzerWrapper(new MyCharAnalyzer());
If you wanted better control over what you consider punctuation, you could subclass CharTokenizer
and override the isTokenChar()
method.
精彩评论