Persistence in MapReduce
Let's say you have divided your work for the map phase of map/reduce and mapping is running. Now, each unit of work takes about 1 minute. Let's say that you need to stop processing. How would you persist the state of the map/reduce so that you waste the least 开发者_运维技巧amount of time when you start back up?
You'd have to memoize the results in a way that allows you to skip most of the processing of rows you've seen before. If there's a candidate key that identifies the row you can use that to look in a cache, then fetch the processed results that are stored there.
Setting up your cluster with Memcached or Redis would be one approach for achieving memoization.
精彩评论