开发者

Solr / Sunspot - determine indexing language at runtime, dynamically choose analyzers

I would like to use Solr + Sunspot to index a bilingual FR-EN site. The issue: model Post can be written both in French or in English. I can determine at runtime what is the language, but I also need Solr to index the model accordingly.

EG: For French models, I would need a French stemmer,

<filter class="solr.SnowballPorterFilterFactory" language="French"/>

What are my options? Can I change Solr analy开发者_Python百科zers at runtime? Can I make a set of analyzers for each language?


This is a great question, and a feature that's being discussed for inclusion in Sunspot.

Sunspot uses dynamic field naming conventions to set up its schema. For example, here are two existing definitions for text fields:

<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts" stored="true" type="text" multiValued="true" indexed="true"/>

These correspond to the fieldType name="text" defined earlier in the schema.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

You could add a similar definition for the different languages you'd like to index (as Mauricio also mentions), and then set up some new dynamicField definitions to use them.

1. A fieldType definition for a French text field

<fieldType name="text_fr" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

2. A dynamicField definition for the French text field

<dynamicField name="*_text_fr" stored="false" type="text" multiValued="true" indexed="true"/>
<dynamicField name="*_texts_fr" stored="true" type="text" multiValued="true" indexed="true"/>

3. Using the French text field in Sunspot

The latest Sunspot 1.2 (not quite released — use 1.2.rc4) supports an :as option which lets you specify the field name.

searchable do
  text :description, :as => 'description_text_fr'
end

Like I said, this is something I'm thinking of adding to Sunspot 1.3 or 1.4. Personally, I'd like to see something like :lang => :en on a text field definition to choose the appropriate field definition. Do feel free to chime in on the Sunspot mailing list with your thoughts!


Can't say anything about Sunspot, but in pure Solr I'd create separate field types in your Solr schema (one fieldType for French, another for English), then create one field for English content (using the English fieldType) and another field for French content (using the French fieldType).

Since you know which language to use at runtime, you'd just pick one field or the other to run your searches and get results.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜