Best way to filter fields stored in a remote database in solr/lucene?

2023-02-25 02:26 问答作者：

I have an index of about 100k documents that represent a movie entity.

Users can put films on various lists (like favorites etc.)

These lists are stored in a mysql database and are not indexed in solr.

I could store the user ids in multivalued fields that represent a list, but that is quite bad because the fields would get very, very long and the indexing would be problematic too.

So currently i do the following (pseudocode):

$favorites = SELECT document_id FROM favorites WHERE user_id = $user_id
$documents = 'http://solr.com:8393/select/?q=XYZ&fq=document_id:('.join(' OR ',$favorites);

this works great and fast but the number of items in filter queries is limited to 1024 (i tried that). also filter queries add up. so if i have one filter query with 500 values to filter i can have another values to 524 filters on another field.

It's okay for now because I limited the entries per list to 1024, and that's quite a lot but I think this approach is very clumsy and produces a lot of overhead.

Isn't there a better solution? Like writing a solr module that directly connects to the database or something? I'd like to do it in php.

If there is no other way, can i somehow raise the 1024 limit? because it works very fast now! I think with good hardware more wouldn't be a problem.

Edit: as asked in the comments i here post my original schema and a working example query.

<field name="film_id" type="int" indexed="true" stored="true" required="true"/> 
<field name="imdb_id" type="int" indexed="true" stored="true" /> 
<field name="parent_id" type="int" indexed="true" stored="true"/> 
<field name="malus" type="int" indexed="true" stored="true"/> 
<field name="type" type="int" indexed="true" stored="true"/> 
<field name="year" type="int" indexed="true" stored="true" termVectors="true"/> 
<field name="locale_title" type="string" indexed="false" stored="true"/> 
<field name="aka_title" type="filmtitle" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" /> 
<field name="sort_title" type="string" indexed="true" stored="true"/> 
<field name="director" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field name="director_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="actor" type="person" indexed="true" stored="true" multiValued="true" omitNorms="true"/> 
<field na开发者_Python百科me="actor_phonetic" type="person_phonetic" multiValued="true" omitNorms="true"/> 
<field name="country" type="string" indexed="true" stored="true" multiValued="true"/> 
<field name="description" type="text" indexed="true" stored="true" /> 
<field name="genre" type="genre" indexed="true" stored="true" multiValued="true" termVectors="true"/> 
<field name="url" type="string" indexed="true" stored="true" multiValued="false"/> 
<field name="image_url" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="rating" type="int" indexed="true" stored="true" required="false" default="50"/>
<field name="affiliate" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="product_type" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="product_*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="blockbuster" type="boolean" indexed="true" stored="true" /> 
<copyField source="film_id" dest="id"/>
<field name="director_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>
<field name="actor_id" type="string" indexed="true" stored="true" multiValued="true" termVectors="true"/>

theese are my additions to the default schema.xml

a sample search result can be viewed here.

a sample query would be:

http://my-server.com:8983/solr/select/?
q=description:nazis
&fq=product_bluray:amazon
&fq=film_id:(1185616 1054606 88763 361748 78748)

here the user would search for movies that are:

available on amazon as a bluray
that have the term "nazis" in the description
AND that are on his favorite list

the list includes the movies (documents) with the ids 1185616 1054606 88763 361748 78748 and are stored in the mysql database.

ps: I don't know whether I formulated the question well, I hope its understandable. If not, please feel free to edit!

Step one is to make sure you really want to use Solr. Looking at your schema, there's an awful lot in there that is susceptible to a normal RDBMS with basic text indexing. Take half an hour and look at postgresql unless you've already determined that a regular good old fashioned RDBMS with some extra bells an whistles just won't do for you.

There's a lot of interest in this problem in the Solr community, but there isn't a real solution.

The obvious approach is to reindex a "favorited" document every time someone favorites it with their username in a multivalued field. This is brain-dead, of course, but that doesn't mean it won't work, depending on how often one of your users mess with his/her favorites list. If your documents are on the small size (I assume they are only a few K) and you have can get enough hardware to keep the whole index in memory (likely since you've only got 100K documents) this might be the approach to consider. You can test it by building an index of a size that you can actually fit into the memory available and implement the strategy. See if it's fast enough.

You may also be able to 'batch' these operations if people don't add a gazillion favorites in one go, like this:

Day 1: I add ten items to my favorites. You stick their ID's in a database and use that list of ID's to filter my queries.
Night 1: You update all the documents that have been favorited by anyone during the day, adding my username to the "favoritedBy" multiValued field. Remove my favorited list from the DB, since it's now represented in the Solr index itself.
Day 2: I add three more items to my favorites. You filter on both favorited:myusername and id:(newID1 OR newID2 or newID3).

This may work for you if people add a reasonable number of favorites per day and you don't have a lot of traffic at night.

继续阅读：lucene php solr sql

Best way to filter fields stored in a remote database in solr/lucene?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？