Mixed Lucene / MySQL Query or Concept

2022-12-20 06:27 问答作者：

Our project requires near-real time searches and constant updating. The data is currently stored in a MySQL database and the Lucene index is updated as the database is modified.

We have the search capability currently where we want it. However, we are attempting to add the ability to "tag" documents in the index/database. Since the data pots can be millions of records, we don't want to update the Lucene index for tagging (or if there is a way to mass-update Lucene that might work too). We instead have a table of document IDs in MySQL that开发者_运维知识库 we would like to be using to determine the tag sets.

The best option I've so far found is to retrieve both list of IDs as an integer array, sort them (so I only need to loop through once), then loop through and look for matches between the two (though this isn't ideal since we possibly lose sorting).

Attempting to use the list of Lucene IDs in "IN" query in MySQL fails because the number of documents can be in the millions and MySQL chokes on it.

Any insight into how we could optimize this or do it?

Another suggestion was a 2nd index and using a MutliSearcher, but I'm not entirely sure how to go about doing that due to still needing to update the index with a possible million rows when updating or deleting a tag set.

For your "mass updates", can't you perform a delta-update to the Lucene index based on a timestamp or similar in your MySql table? I've done this in solr, rather than directly in Lucene, but as Solr is a wrapper around Lucene functionality, this is essentially the same (or so I'm assuming...).

Solr Delta import command.

Relevant question, (perhaps).

For all that follows, the assumption is that you don't have enough RAM to completely hold an entire collection.

Indexing technology is designed in particular for a situation where you have far more reads than writes. It would be good to first analyze the corresponding frequencies and thus quantify "constant updating".

If the frequency of updates is too high, you might want to try to handle this part of the search directly with your database system (if MySQL doesn't do the job, there is also PostgreSQL; also the response speeds will depend on the indexing mechanisms in the DB and the memory available to cache them in-memory).

Otherwise, you may want to look into Solr (which is a bit more than just a simple wrapper around Lucene, as it provides extra functionality that may be based on, but is not by itself available using Lucene).

In particular:

Solr performance factors
Multi Core approach (i.e. core swapping)
StreamingUpdateSolrServer

Maybe you can use different strategies depending on the batch size of the update and the performance off-trade for commits/optimization. For huge batch updates, it may be easier to copy a standby core, batch update, commit/optimize and swap the core. However, it won't be "near-real time" (NRT) any longer; the idea of NRT in Lucene is local and dependens directly on available RAM and collection sizes.

继续阅读：.net lucene.net

Mixed Lucene / MySQL Query or Concept

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？