Tokenizing hashtags in Lucene.Net
I am u开发者_如何学运维sing Lucene.Net (version 2.9). I would like to preserve tweet post '@name' or '#Note'.
Using the Lucene AnalyzerViewer tool (http://www.codeproject.com/KB/cs/lucene_analysis.aspx?msg=3326095#xx3326095xx) to review tokens produced by different analyzer.
For example, tokens produced below from this text: "#Note: Excercise, to live longer."
- Whitespace Analyzer: [#Note:] [Excercise,] [to] [live] [longer.]
- Standard Analyzer: [note] [excercise] [live] [longer]
- Simple Analyzer: [note] [excercise] [to] [live] [longer]
'Whitespace Analyzer' preserve the hash tags. I created a custom analyzer, which uses WhitespaceTokenizer and lower case.
Custom Analyzer code...
public class CustomAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
TokenStream result = new Lucene.Net.Analysis.WhitespaceTokenizer(reader);
// Makes sure everything is lower case
result = new LowerCaseFilter(result);
//Return the built token stream.)
return result;
}
}
However, the custom analyzer leaves punctuations. Tokens produced by the custom analyzer: [#note:] [excercise,] [to] [live] [longer.]
Any suggestions to use a filter where '#', '@' tags preserve and punctuations removed?
Thanks in advance.
In the java version of lucene there is a PatternAnalyzer, that lets you specify a pattern that will be used to split the tokens.
Documentation: http://lucene.apache.org/java/2_9_4/api/contrib-memory/org/apache/lucene/index/memory/PatternAnalyzer.html
You could watch out for a .net version of this analyzer or port it your own.
精彩评论