'AVG' and 'SUM' functionality in MongoDB, any tips?
I'm a relative newbie to MongoDB, but from what I've read there are various methods to going about finding averages and sums of values in a MongoDB database, with various benefits and drawbacks for each.
I'm primarily ask开发者_如何学编程ing for a method of finding the sum of a selection of values, and the average of a selection of values, in an as efficient (fast) method possible.
The documents in the collection being queried resemble this structure (with a lot of other fields):
{
"_id": ObjectId('4e650107580fd649e5000005'),
"date_added": ISODate("2011-09-05T00:00:00Z"),
"value": 1500
}
Precalculating things like sums is, in my application, not always possible, because the selection of values to be summed can change (based on date ranges - e.g. between a start date and an end date, what is the average). This is a similar problem with precalculating averages.
From what I've read, MapReduce is definitely not ideal for real-time (i.e. on demand) lookup, so that seems to be out of the question too.
At the moment I'm querying the collection in this way: (note: this is using pymongo
)
response = request.db['somecollection'].find(
{
'date_added': {
'$gte': date_start,
'$lte': date_end
}
},
{
'value':1
}
).limit(500)
Then doing the calculation in Python using a for
loop over the response. The limit of 500 results is arbitrary, to keep it from become too slow. I'm only retrieving the value, and none of the other fields.
Is this the most efficient method of doing this calculcation, or are there other methods to accomplish what I need?
Caveats:
- I can't use the
group
function because I will probably be using sharding in the future - I can't use MapReduce because it's a function which will be used on-the-fly by users
- I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different
- I have looked around stackoverflow and the web to try and find recommendation on how to do this kind of thing, and it's fairly open-ended
EDIT:
I should point out that the number of documents returned from the query I posted above could be anything from 1 document to hundreds, but will probably have a maximum number of returned documents of about 150 (average of about 60 or 70)
Give map-reduce a try, it's probably not as slow as you think. I've used it for real-time aggregation over some large data sets, and although it's sometimes not lightning fast, it's more often fine. It's best if you can filter down the size of the initial data you're aggregating, e.g.:
db.collection.mapReduce(m, r, { query : { year: 2011 } });
If you need to speed things up even more, consider distributing the data over a sharded cluster. Then the map-reduce processing can be scaled out across multiple shards running in parallel.
MongoDB notes
OK, so Map/Reduce and aggregation have some serious issues currently.
Big caveat: the MongoDB instance can only have one "javascript engine" instance. This means that you cannot run two simultaneous Map/Reduces on the server. And you only get one core for running the map-reduce.
In the case of what you are doing, you're basically "rolling your own" M/R. The downside is the extra network traffic. The upside is that you can now throw more cores at the problem (from the web-servers).
Your key question
I can't precalculate a lot of my sums/averages because the selection of values to sum/average is almost always different
There is no general method for optimizing "all possible" queries. If you want the system to be able to sum and aggregate along every field for every range, then you will eventually find a set of fields/ranges that are too big.
The way to "solve" this is to reduce the set of fields and ranges.
So keep daily / hourly counters and sum over those counters. At the least you reduce the number of documents you need to scan in order to answer your query.
Simple answer is:
- If it possible precalculate everything you can precalculate.
- If you need aggregate data by date ranges and aggregation should work as quick as possible then use map/reduce + sharding to distribute calculation across multiple machines.
But in same time mongodb guide say:
The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.
So it sounds like mongodb is not best solution for real time data aggregation.
MongoDB is slated to get native aggregation functions for things like sum/avg/min/max in version 2.1.1 (currently slated for Nov 1, 2011). For more detail and status see the issue at: https://jira.mongodb.org/browse/SERVER-447
精彩评论