MongoDB incremental mapReduce, select only new documents, added after last mapReduce

2023-04-13 08:28 问答作者：

Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):

> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9开发者_运维知识库c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }

Now I need to collect some statistics on that collection. For example:

db.data.mapReduce(function(){
        emit(this.type,this.value);
     },function(key,values){
        var total = 0;
        for(i in values) {total+=values[i]};
        return total;
     },
{out:'stat'})

will collect totals in 'stat' collection.

> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }

At this point everything is perfect, but I've stuck on the next move:

'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment

So the question is:

Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?

You can cache the time and use it as a barrier for your next incremental map-reduce.

We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)

We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):

m = Code(<map function>)
r = Code(<reduce function>)

# pseudo code
end = last_time + 5 minutes

# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])

collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)

Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.

You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.

EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/

I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:

https://gist.github.com/2233072

""" This method performs an incremental map-reduce on any new data in 'source_table_name' 
into 'target_table_name'.  It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.  

The set of data to be processed incrementally is determined non-invasively (meaning the source table is not 
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed, 
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents 
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.

If reset is True, it will drop 'target_table_name' before starting.

If max_datetime is given, it will only process records up to that date.

If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all. 

If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""

We solve this issue using 'normalized' ObjectIds. Steps that we are doing:

normalize id - take timestap from the current/stored/last processed id and set other parts of id to its min values. C# code: new ObjectId(objectId.Timestamp, 0, short.MinValue, 0)
run map-reduce with all items that have id greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.

Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.

继续阅读：mapreduce mongodb

MongoDB incremental mapReduce, select only new documents, added after last mapReduce

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？