Culture-independent stemmer/analyzer for Lucene.NET
We're currently developing a full-text-search-enabled app and we Lucene.NET is our weapon of choice. What's expected is that an app will be used by people from different countries, so Lucene.NET has to be able to search across Russian, English and other texts equally well.
Are there any universal and culture-independent stemmers and an开发者_运维知识库alyzers to suit our needs? I understand that eventually we'd have to use culture-specific ones, but we want to get up and running with this potentially quick and dirty approach.
Given that the spelling, grammar and character sets of English and Russian are significantly different, any stemmer which tried to do both would either be massively large or poorly performant (most likely both).
It would probably be much better to use a stemmer for each language, and pick which one to use based on either UI clues (what language is being used to query) or by explicit selection.
Having said that, it's unlikely that any Russian text will match an English search term correctly or vice-versa.
This sounds like a case where a little more business analysis would help more than code.
There is no such a thing as a language-independent stemmer. In fact, whether stemming improves retrieval performance varies per language. The best you can do is language guessing on the documents and queries, then dispatch to the appropriate analyzer/stemmer.
Language guessing on short queries is hard, though (as in state-of-the-art, not quick 'n' dirty). If your queries are short, you might want use a simple whitespace analyzer on the queries and not stem anything.
精彩评论