开发者

Atomicity in Map/Reducing over new records (MongoDB)

Here's th开发者_如何学Ce situation: I've got a MongoDB cluster and a web-app, which does a pretty intensive Map/Reduce query. This query happens periodically (every 5min) in a cron job, and the results are stored (using $merge) into a collection.

What works: Currently, the query performs over every record in its collection. Said collection is slowly growing to be millions of rows, and each time it runs, it takes a little longer.

The obvious solution is to run the Map/Reduce over new records, and use the reduce function over the old stored values to calculate the correct value. MongoDB is great, it lets you specify a reduce option instead of merge to do just that.

What I can't figure out: How to correctly perform the M/R only over new records in the initial collection. I see two potential solutions, neither of which are good. Ideas?

  1. I could flag records that have been processed. Problem is how to flag exactly the same records that I just M/R'd over?
  2. I could query for the matching items, then pass the list of ids as an $in: [id1, id2, ...] query to the Map/Reduce, and then send an update to set my flag using the same $in. But that's really inelegant, and I don't know how that's going to perform when the list of records is huge.

tl;dr: How do I only select new records in a Map/Reduce query that reduces its result into a collection.


A kind soul on the #mongodb IRC channel helped me figure this one out. A simple solution is to have a state machine field, and do the following (in pseudo-code):

set {state:'processing'} where {state:{$exists:false}}
mapreduce {...} where {state:'processing'}
set {state:'done'} where {state:'processing'}

Now, this is suboptimal because it wastes a lot of disk space on a collection with millions of records. But the real question is, why did I not think of this sooner?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜