开发者

Django+Haystack+Whoosh: how to deal with language inflection

Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..

How should I use django+haystack+whoosh properly to deal with language inflection?

Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".

In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.

My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+ha开发者_JS百科ystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:

haystack search index template:
    {{some_indexed_text|convert_to_base_form_filter}}

text to index: "Nie ma komputera"  => "Nie ma komputer" <- this is really indexed
 search query: "komputery"         => "komputer"   <-- this will match 

But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.

So - how should I solve this issue? Maybe I should use other search engine than whoosh?


Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:

/your_path_to_whoosh/whoosh/lang/analysis.py

This is where StemmingAnalyzer (the default analyzer) is defined and an excellent starting point. The stem function, imported from porter.py, is the other important place to look in.

So, the three steps are:

  • Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.

  • Implement your own Analyzer taking as reference StemmingAnalyzer inside analysis.py. The file is heavily documented so you should have no problem navigating through it. You'll see that StemmingAnalyzer is basically a chaining of a Tokenizer with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see that StemFilter takes the stemming function as a parameter, so you don't have to reimplement the filter.

  • Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema


For future readers: Whoosh can handle different languages with snowball stemmer.

from whoosh.lang.snowball.russian import RussianStemmer
stemmer_ru = RussianStemmer()
analyzer = StemmingAnalyzer(stemfn=stemmer_ru.stem)
schema = fields.Schema(
    name=fields.TEXT(analyzer=analyzer),
)


Whoosh LanguageAnalyzer:

Configures a simple analyzer for the given language, with a LowercaseFilter, StopFilter, and StemFilter.

https://whoosh.readthedocs.io/en/latest/api/analysis.html#whoosh.analysis.LanguageAnalyzer

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜