Lucene.NET: Camel case tokenizer?

2023-01-15 10:28 问答作者：

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers treat the whole camel case source code identifier name as a single token.

I'm looking for a way to treat camel case identifiers like MaxWidth into three tokens: maxwidth, max and width. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter myself. I'll write a post about it on my blog and I'll update the question.

Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.

Below link might be helpful to write custom tokenizer...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

Here is my implementation :

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}

继续阅读：lucene lucene.net tokenize

Lucene.NET: Camel case tokenizer?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？