Full Text Search with multiple index and complex requirements

2023-02-14 14:50 问答作者：

We are building an application which will require us to index data for each of our users so that we can provide full text search on their data. Here are some notable things about the application:

A) The data for every user is totally unrelated to every other user. This gives us few advantages:

we can keep our indexes small in size.
merging/compatcting fragmented index will take less time.
if some indexes becomes inaccessible for whatever reason (corruption?), only those users gets affected. Other users are unaffected and the service is available for them.

B) Each user can have few different types of data. We want to keep each type in separate folders, for the same reasons as above.

So, our index hierarchy will look something like:

/user1/type1/<index files>

/user1/type2/<index files>

/user2/type1/<index files>

/user3/type3/<index files>

C) Often, probably with every itereation, we'll add "types" of data that can be indexed.

So we want to have an efficient/programmatic way to add schemas for different "types". We would like to avoid having fixed schema for indexing. I like Lucene's schema-less way of indexing stuff.

D) The users can fire search queries which will search either: - Within a specific "type" for that user - Across all types for that user: in this case we want to fire a parallel query like Lucene has. (ParallelMultiSearcher)

E) We require real time update for the index. This is a must.

F) We are are plannin开发者_JAVA技巧g to shard our index across multiple machines. For this also, we want:

if a shard becomes inaccessible, only those users whose data are residing in that shard gets affected. Other users get uninterrupted service.

We were considering Lucene, Sphinx and Solr to do this. This is what we found:

Sphinx: No efficient way to do A, B, C, F. Or is there?
Luecne: Everything looks possible, as it is very low level. But we have to write wrappers to do F and build a communication layer between the web server and the search server.
Solr: Not sure if we can do A, B, C easily. Can we?

So, my question is what is the best software for the above requirements? I am inclined more towards Solr and then Lucene if we get all the requirements.

I can't see Solr being able to handle A or B, as Solr's model is to have everything in one index (per ~~shard~~ core). Solr can handle C if you use the dynamic field types. Although Solr can do real time indexing, it is not as fast as Lucene (even with Embedded Solr, in my experience). This all points to Lucene being your only choice.

I think Solr might work really well for you here.

The key feature that Solr has that will work well for you in your sitiuation is the notion of cores. See http://wiki.apache.org/solr/CoreAdmin

One way you can implement this is that each user/type combination can be a separate Solr core. This satisfies (A) and (B). The client can either direct the search at a single core, or it can direct the search at multiple cores at once (and optional across different Solr servers), which is what you want when you search across a single user and all types. This satisfies (D) and (F). Or you can one core for each user, with a "type" field that you can filter on.

As for (C), Solr has the notion of dynamic fields. See http://wiki.apache.org/solr/SchemaXml#Dynamic_fields

As far as (E) goes, Solr doesn't have "true" real-time indexing yet. But if a lag of a few seconds is acceptable, then Solr can handle that.

继续阅读：full-text-search lucene search-engine solr sphinx

Full Text Search with multiple index and complex requirements

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？