开发者

"2d Search" in Solr or how to get the best item of the multivalued field 'items'?

The title is a bit awkward but I couldn't found a better one. My problem is as follows:

I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so

Another problem is that

  • the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.

Example where items are emails:

user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
          item2 { text:"c# development",                   id:2, title:"nice!" }
          ...
          item77 ...

user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
          item2 { text:"best cafe",       id:4, title:"blup"}
          ...
          item223 ...

Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...

So how can I find the best items of a documents with multiple such items?

I added my two findings as answers, but as I will point out each of them has its own drawbacks.

Could anyone point me to a better solution?


One of my rules of thumb for designing Solr schemas is: the document is what you will search for.

If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.

EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).

You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.

If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John


You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.

Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):

  1. query the user index => list of e.g. 10 users
  2. query the items index for each user of the 1. step => best items

Assume the following example:

userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and

userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".

Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.

Drawbacks:

  • bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
  • maintaining two indices.

Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.


You can use the collapse patch and store each item as separate document linking back to the user.

The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)

See the "Assume the following example:" part in my second answer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜