Django+Haystack+Whoosh: how to deal with language inflection

2023-01-15 11:20 问答作者：

Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..

How should I use django+haystack+whoosh properly to deal with language inflection?

Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".

In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.

My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+ha开发者_JS百科ystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:

haystack search index template:
    {{some_indexed_text|convert_to_base_form_filter}}

text to index: "Nie ma komputera"  => "Nie ma komputer" <- this is really indexed
 search query: "komputery"         => "komputer"   <-- this will match

But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.

So - how should I solve this issue? Maybe I should use other search engine than whoosh?

Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:

/your_path_to_whoosh/whoosh/lang/analysis.py

This is where StemmingAnalyzer (the default analyzer) is defined and an excellent starting point. The stem function, imported from porter.py, is the other important place to look in.

So, the three steps are:

Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.
Implement your own Analyzer taking as reference StemmingAnalyzer inside analysis.py. The file is heavily documented so you should have no problem navigating through it. You'll see that StemmingAnalyzer is basically a chaining of a Tokenizer with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see that StemFilter takes the stemming function as a parameter, so you don't have to reimplement the filter.
Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema

For future readers: Whoosh can handle different languages with snowball stemmer.

from whoosh.lang.snowball.russian import RussianStemmer
stemmer_ru = RussianStemmer()
analyzer = StemmingAnalyzer(stemfn=stemmer_ru.stem)
schema = fields.Schema(
    name=fields.TEXT(analyzer=analyzer),
)

Whoosh LanguageAnalyzer:

Configures a simple analyzer for the given language, with a LowercaseFilter, StopFilter, and StemFilter.

https://whoosh.readthedocs.io/en/latest/api/analysis.html#whoosh.analysis.LanguageAnalyzer

继续阅读：django django-haystack whoosh

Django+Haystack+Whoosh: how to deal with language inflection

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？