开发者

Strategy for arbitrary predicate queries in Couchdb

We have an application that could hugely benefit from using a document-based data store like CouchDB. But we have a query use-case which I'm struggling to implement with Map Reduce.

Our documents really only contain two types of data:

  1. Numeric attributes
  2. Boolean attributes

The boolean attributes essentially mark a document as belonging to one or more non-exclusive sets. The numeric attributes will always only need to be summed. One way of structuring the document is like this:

{
  "id": 3123123,
  "attr": {"x": 2, "y": 4, "z": 6},
  "sets": ["A", "B", "C"]
}

With this structure, it's easy to work out aggregate x, y, z values for the sets A, B and C, but it g开发者_StackOverflowets more complicated when you want to see the aggregates for intersections like A&C.

In this small case I could emit keys for all permutations of ABC ("A, B, C, AB, AC, BC, ABC"), but I'm worried about how this will scale. Our documents could belong to some combination of 80 sets and it is fronted by a user-interface which can construct any conceivable combination of them.

I'm inclined to think that this isn't a job for a CouchDB, and perhaps MongoDB or something else would be better suited to this problem.

Am I missing anything?


A data structure that can efficiently compute and cache all those values is going to be quite complex. I'm not certain that any database system is able to do this without iterating over subsets. Intersection is a notoriously hard operation, and CouchDB doesn't really have anything available to handle intersection properly.

As you correctly identified, emitting all permutations (subsets, to be precise) is going to be a memory hog because it's still going to multiply your items by a huge factor (2n key-value pairs for n sets). You can reduce this by collapsing prefixes together (the CouchDB key structure lets you retrieve the values for ["A"] and ["A","B"] when you emit for ["A","B","C"] using the group level option) but only by a factor of 2 (2n-1 key-value pairs for n sets).

So, if your items have on average three associated sets, you're going to be fine (4 key-value pairs instead of 3), but four associated sets is heavier (8 instead of 4) and five is starting to get annoying (16 instead of 5). This also makes items with many associated sets vulnerable to performance issues (a 10-set item would create more than 500 key-value pairs).

A middle ground approach would be to emit keys up to four sets in length (it merely doubles the required memory), and run some application-side processing when a deeper intersection is required (grab all items without reduction, run the reduction within the application). With some luck, the number of concerned items will be smaller - if it isn't, you can always use the maximum set size to sacrifice more memory for more performance.

An opposite approach would be to have the application update 2n totals when every document is inserted/updated (by fetching all "totals" documents that match a subset of the current item). Those totals would be stored in a different database and would be queried by key. This approach is better if you can afford to be updating totals on-the-fly (or your architecture lets you update them by listening to the updates in the main database), as it makes the queries lightning-fast.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜