Can a "pre-computed" map-reduce index (à la RavenDB/CouchDB) be used for this kind of algorithm?

2023-01-27 12:39 问答作者：

I'm trying开发者_运维技巧 to see if a specific algorithm can be translated to the kind of map-reduce index RavenDB/CouchDB uses, ie, "pre-computed" map-reduce (which means the indexes are refreshed on insertion and updates, not when performing the actual query).

Let's say we have a typical online store with 50,000 products, grouped in categories. Every product has a collection of "Attribute Values", ie, something like "[Red, Round, Metal]".

Since we have so much products on our website, and there's probably a lot of items in each of the categories, we want to give the user another way to "filter" the products he's currently seeing.

For example, if a category is "Less than $20", there's a whole bunch of products in this category. But our user only need to see products which are less than $20 and Red. Unfortunately, there's no sub-category "Red" in the "Less than $20" category.

Our algorithm would take the current list of products, and generate a list of "interesting" Attributes and Attribute Values, ie, given a list of products, it would output something like:

Color
   Red (40)
   Blue (32)
   Yellow (17)
Material
   Metal (37)
   Plastic (36)
   Wood (23)
Shape
   Square (56)
   Round (17)
   Cylinder (12)

Could this sort of algorithm be somehow pre-computed à la RavenDB/CouchDB map-reduce index? If not, why exactly (so I can identify that kind of algorithm in the future) and if yes, how?

A C# 4.0 Visual Studio Test Solution is available that demonstrates the potential data structures and sample data, as well as a try at a map-reduce implementation (which doesn't seem to be pre-computable).

General case: It's always possible to use a CouchDB-style map-reduce view, but it's not necessarily practical.

In the end, it's mostly a counting-based argument: if you need to ask the question for any subset of your 500,000 products, then your database must be able to provide a distinct answer to each of 2^500,000 different possible questions, which uses a prohibitive amount of memory if you have to emit a B-tree leaf for every one of them (and you need to emit data unless the answer to most of these queries is zero, false, an empty set or a similar null value).

CouchDB provides a first small optimization through the existence of range queries (meaning that in an ideal case, it can use as little as N B-tree leaves to answer N² questions). However, in your example, this would only reduce the number of leaves down to 2^250,000 (and that's a theoretical lower bound).

CouchDB provides a second small optimization through key prefix queries, meaning that you can compress [A], [A,B] and [A,B,C] queries into a single [A,B,C] key. So, instead of your 2^250,000 possibilities, you're down to a "mere" 2^249,999 ...

So, while you could think up an emitting strategy for answering the question for any subset, it would take more storage space than is actually available on our planet. In the general case, to answer N different questions you need to emit at least sqrt(N/2) B-tree leaves, so count your questions and determine if that lower bound on the number of leaves is acceptable.

Only for categories and subcategories: if you give up on arbitrary lists of products and only ask questions of the form "give me the significant attributes in category A filtered by attributes B and C", then your number of emits drops to:

 AvgCategories * AvgAttr * 2 ^ (AvgAttr - 1) * 500,000

You're basically emitting for each product the keys [Category,Attr,Attr,...] for all categories of the product and all combinations of attributes of the product, which lets you query by category + attributes. If you have on average 1 category and 3 attributes per product, this works out to about 6 million entries, which is fairly acceptable.

This should be quite straightforward to implement in something like CouchDB. Have the map phase of your index output one key, value pair for each attribute the object has, with the value simply being '1'. Then, have the reduce phase sum up all input values and output the sum. The end result will be an index of the form you describe.

继续阅读：algorithm mapreduce ravendb

Can a "pre-computed" map-reduce index (à la RavenDB/CouchDB) be used for this kind of algorithm?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？