开发者

Solr associations

The last couple of days we are thinking of using Solr as our search engine of choice. Most of the features we need are out of the box or can be easily configured. There is however one feature that we absolutely need that seems to be well hidden (or missing) in Solr.

I'll try to explain with an example. We have lots of documents that are actually businesses:

<document>
  <name>Apache</name>
  <cat>1</cat>
  ...
</document>
<document>
  <name>McDonalds</name>
  <cat>2</cat>
  ...
</document>

In addition we have another xml file with all the categories and synonyms:

<cat id=1>
  <name>software</name>
  <synonym>IT<synonym>
</cat>
<cat id=2>
  <name>fast food</name>
  <synonym>restaurant<s开发者_运维问答ynonym>
</cat>

We want to associate both businesses and categories so we can search using the name and/or synonyms of the category. But we do not want to merge these files at indexing time because we should update the categories (adding.remioving synonyms...) without indexing all the businesses again.

Is there anything in Solr that does this kind of associations or do we need to develop some specific pieces?

All feedback and suggestions are welcome.

Thanks in advance, Tom


Basically you have a design decision here. The usual thing people do with Solr indexes is to denormalize them, i.e. explode the category definition into the business' document. As you do not want to do this, I suggest keeping two types of documents - one for the businesses and another for the categories.You can keep both in the same index, as Solr does not require all documents to have the same fields. The business documents seem straightforward, but you have to make them searchable by both the business name and the category id. I suggest creating a category document for each synonym, where you search by synonym and find the id (and category name).

To search using synonyms, you will need a double search -

  • Search for category id using the name's text.
  • Search for businesses using the category id.


There is actually a filter class called solr.SynonymFilterFactory.

This should allow you to map the cat numbers to its 2 text equivalents, if you use it in the query analyser only, something like the following:

    <fieldType name="category" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="category_Synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

That way you can index ONLY the category ID. This means you won't have to send all the businesses to Solr again. Also if someone queries "software"or "IT" it will map it to the category

Your category_Synonyms.txt should have lines such as the following:

1, software, IT

The onlydraw back here is that you'll have to come up with a way of editing the text document when you change the names or synonyms. So i guess this will only help if you change the category names infrequently?? Unless someone else knows of a way that this can be done easily.

I actually added the above to my own solr and ran the Analyser tool on it.. here is the result:

Solr associations

As you can see it's turned software into

1

Please note you MUST set the

expand

parameter to

false

I hope this helps.

Dave


You cannot find the unindexed pieces of informations, unless you implement some kind of query translation/expansion that translates some query terms in their indexed equivalent before submitting the query.

So, if the user types "restaurant", then your query is translated to include a filter by cat=1.

As far as I know Solr doesn't include this feature, so you have to implement it on your own or adapt a suitable module (like http://lucene-qe.sourceforge.net/).


Other than some of the excellent ideas offered earlier, you can also look at a multivalued fields. So your category field can contain any number of values (and updated when needed), when you search it queries all the values.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜