Strategy for arbitrary predicate queries in Couchdb

2023-02-06 21:19 问答作者：

We have an application that could hugely benefit from using a document-based data store like CouchDB. But we have a query use-case which I'm struggling to implement with Map Reduce.

Our documents really only contain two types of data:

Numeric attributes
Boolean attributes

The boolean attributes essentially mark a document as belonging to one or more non-exclusive sets. The numeric attributes will always only need to be summed. One way of structuring the document is like this:

{
  "id": 3123123,
  "attr": {"x": 2, "y": 4, "z": 6},
  "sets": ["A", "B", "C"]
}

With this structure, it's easy to work out aggregate x, y, z values for the sets A, B and C, but it g开发者_StackOverflowets more complicated when you want to see the aggregates for intersections like A&C.

In this small case I could emit keys for all permutations of ABC ("A, B, C, AB, AC, BC, ABC"), but I'm worried about how this will scale. Our documents could belong to some combination of 80 sets and it is fronted by a user-interface which can construct any conceivable combination of them.

I'm inclined to think that this isn't a job for a CouchDB, and perhaps MongoDB or something else would be better suited to this problem.

Am I missing anything?

A data structure that can efficiently compute and cache all those values is going to be quite complex. I'm not certain that any database system is able to do this without iterating over subsets. Intersection is a notoriously hard operation, and CouchDB doesn't really have anything available to handle intersection properly.

As you correctly identified, emitting all permutations (subsets, to be precise) is going to be a memory hog because it's still going to multiply your items by a huge factor (2ⁿ key-value pairs for n sets). You can reduce this by collapsing prefixes together (the CouchDB key structure lets you retrieve the values for ["A"] and ["A","B"] when you emit for ["A","B","C"] using the group level option) but only by a factor of 2 (2^n-1 key-value pairs for n sets).

So, if your items have on average three associated sets, you're going to be fine (4 key-value pairs instead of 3), but four associated sets is heavier (8 instead of 4) and five is starting to get annoying (16 instead of 5). This also makes items with many associated sets vulnerable to performance issues (a 10-set item would create more than 500 key-value pairs).

A middle ground approach would be to emit keys up to four sets in length (it merely doubles the required memory), and run some application-side processing when a deeper intersection is required (grab all items without reduction, run the reduction within the application). With some luck, the number of concerned items will be smaller - if it isn't, you can always use the maximum set size to sacrifice more memory for more performance.

An opposite approach would be to have the application update 2ⁿ totals when every document is inserted/updated (by fetching all "totals" documents that match a subset of the current item). Those totals would be stored in a different database and would be queried by key. This approach is better if you can afford to be updating totals on-the-fly (or your architecture lets you update them by listening to the updates in the main database), as it makes the queries lightning-fast.

继续阅读：mongodb

Strategy for arbitrary predicate queries in Couchdb

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？