Can a "pre-computed" map-reduce index (à la RavenDB/CouchDB) be used for this kind of algorithm?
I'm trying开发者_运维技巧 to see if a specific algorithm can be translated to the kind of map-reduce index RavenDB/CouchDB uses, ie, "pre-computed" map-reduce (which means the indexes are refreshed on insertion and updates, not when performing the actual query).
Let's say we have a typical online store with 50,000 products, grouped in categories. Every product has a collection of "Attribute Values", ie, something like "[Red, Round, Metal]".
Since we have so much products on our website, and there's probably a lot of items in each of the categories, we want to give the user another way to "filter" the products he's currently seeing.
For example, if a category is "Less than $20", there's a whole bunch of products in this category. But our user only need to see products which are less than $20 and Red. Unfortunately, there's no sub-category "Red" in the "Less than $20" category.
Our algorithm would take the current list of products, and generate a list of "interesting" Attributes and Attribute Values, ie, given a list of products, it would output something like:
Color
Red (40)
Blue (32)
Yellow (17)
Material
Metal (37)
Plastic (36)
Wood (23)
Shape
Square (56)
Round (17)
Cylinder (12)
Could this sort of algorithm be somehow pre-computed à la RavenDB/CouchDB map-reduce index? If not, why exactly (so I can identify that kind of algorithm in the future) and if yes, how?
A C# 4.0 Visual Studio Test Solution is available that demonstrates the potential data structures and sample data, as well as a try at a map-reduce implementation (which doesn't seem to be pre-computable).
General case: It's always possible to use a CouchDB-style map-reduce view, but it's not necessarily practical.
In the end, it's mostly a counting-based argument: if you need to ask the question for any subset of your 500,000 products, then your database must be able to provide a distinct answer to each of 2500,000 different possible questions, which uses a prohibitive amount of memory if you have to emit a B-tree leaf for every one of them (and you need to emit data unless the answer to most of these queries is zero, false, an empty set or a similar null value).
CouchDB provides a first small optimization through the existence of range queries (meaning that in an ideal case, it can use as little as N B-tree leaves to answer N2 questions). However, in your example, this would only reduce the number of leaves down to 2250,000 (and that's a theoretical lower bound).
CouchDB provides a second small optimization through key prefix queries, meaning that you can compress [A], [A,B] and [A,B,C] queries into a single [A,B,C] key. So, instead of your 2250,000 possibilities, you're down to a "mere" 2249,999 ...
So, while you could think up an emitting strategy for answering the question for any subset, it would take more storage space than is actually available on our planet. In the general case, to answer N different questions you need to emit at least sqrt(N/2)
B-tree leaves, so count your questions and determine if that lower bound on the number of leaves is acceptable.
Only for categories and subcategories: if you give up on arbitrary lists of products and only ask questions of the form "give me the significant attributes in category A filtered by attributes B and C", then your number of emits drops to:
AvgCategories * AvgAttr * 2 ^ (AvgAttr - 1) * 500,000
You're basically emitting for each product the keys [Category,Attr,Attr,...]
for all categories of the product and all combinations of attributes of the product, which lets you query by category + attributes. If you have on average 1 category and 3 attributes per product, this works out to about 6 million entries, which is fairly acceptable.
This should be quite straightforward to implement in something like CouchDB. Have the map phase of your index output one key, value pair for each attribute the object has, with the value simply being '1'. Then, have the reduce phase sum up all input values and output the sum. The end result will be an index of the form you describe.
精彩评论