Lucene / Solr: what request handlers to use for query strings in Chinese or Japanese?
For my Solr server, some of the query strings will be in Asian languages such as Chinese or Japanese.
For such query strings, would the Standard or Dismax request handler work? My understanding is that both the Standard and the Dismax handler tokenize the query string by white开发者_如何转开发space. And that wouldn't work for Chinese or Japanese, right?
In that case, what request handler should I use? And if I need to set up custom request handlers for those languages, how do I do it?
Thanks.
It's not about the request handler but the language analyzers.
Lucene has a CJK package for this purpose. See here for info on using it in Solr.
See also this thread for alternatives.
Your queries will be parsed according to the analyzers of the fields you're querying, whether you're using the standard Solr query parser or DisMax query parser.
So in this case, as Mauricio says, the question is about how your strings of text are analyzed into tokens.
For Chinese and Korean, there is CJK, which performs basic N-Gram analysis to break down text into byte pairs. It's not the best way to analyze in terms of relevance and index size, but it works.
For Japanese, I highly recommend the new Kuromoji morphological analyzers in Solr and Lucene 3.6.0. It uses a dictionary and some other statistics to tokenize into real terms. That lets you do all sorts of really excellent quality
Docs are sparse at the moment, so check out these links…
- Kuromoji - Japanese morphological analyzer
- LUCENE-3305
- Sample schema using Kuromoji analyzers on Websolr
- My presentation at the 20 Apr 2012 #herokujp meetup, on full-text search with an emphasis on analyzing Japanese.
精彩评论