Lucene.NET: Camel case tokenizer?
I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers treat the whole camel case source code identifier name as a single token.
I'm looking for a way to treat camel case identifiers like MaxWidth
into three tokens: maxwidth
, max
and width
. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?
UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter
myself. I'll write a post about it on my blog and I'll update the question.
Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.
Below link might be helpful to write custom tokenizer...
http://karticles.com/NoSql/lucene_custom_tokenizer.html
Here is my implementation :
package corp.sap.research.indexing;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class CamelCaseFilter extends TokenFilter {
private final CharTermAttribute _termAtt;
protected CamelCaseScoreFilter(TokenStream input) {
super(input);
this._termAtt = addAttribute(CharTermAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
String spliettedString = splitCamelCase(a.toString());
_termAtt.setEmpty();
_termAtt.append(spliettedString);
return true;
}
static String splitCamelCase(String s) {
return s.replaceAll(
String.format("%s|%s|%s",
"(?<=[A-Z])(?=[A-Z][a-z])",
"(?<=[^A-Z])(?=[A-Z])",
"(?<=[A-Za-z])(?=[^A-Za-z])"
),
" "
);
}
}
精彩评论