Atomicity in Map/Reducing over new records (MongoDB)
Here's th开发者_如何学Ce situation: I've got a MongoDB cluster and a web-app, which does a pretty intensive Map/Reduce query. This query happens periodically (every 5min) in a cron job, and the results are stored (using $merge
) into a collection.
What works: Currently, the query performs over every record in its collection. Said collection is slowly growing to be millions of rows, and each time it runs, it takes a little longer.
The obvious solution is to run the Map/Reduce over new records, and use the reduce function over the old stored values to calculate the correct value. MongoDB is great, it lets you specify a reduce
option instead of merge
to do just that.
What I can't figure out: How to correctly perform the M/R only over new records in the initial collection. I see two potential solutions, neither of which are good. Ideas?
- I could flag records that have been processed. Problem is how to flag exactly the same records that I just M/R'd over?
- I could query for the matching items, then pass the list of ids as an
$in: [id1, id2, ...]
query to the Map/Reduce, and then send an update to set my flag using the same$in
. But that's really inelegant, and I don't know how that's going to perform when the list of records is huge.
tl;dr: How do I only select new records in a Map/Reduce query that reduces its result into a collection.
A kind soul on the #mongodb
IRC channel helped me figure this one out. A simple solution is to have a state machine field, and do the following (in pseudo-code):
set {state:'processing'} where {state:{$exists:false}}
mapreduce {...} where {state:'processing'}
set {state:'done'} where {state:'processing'}
Now, this is suboptimal because it wastes a lot of disk space on a collection with millions of records. But the real question is, why did I not think of this sooner?
精彩评论