开发者

Hadoop: Iterative MapReduce Performance

Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the sa开发者_开发知识库me logic?

I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required.

I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases.


No parallel processing system makes much sense if a single machine does the job, most of the time. The complexity associated with most parallelization tasks is significant and requires a good reason to make use of it.

Even when it's obvious that a task can't be resolved without parallel processing in acceptable time, parallel execution frameworks come in different flavours: from the more low-level, science-oriented tools like PVM or MPI to high-level, specialized (e.g. map/reduce) frameworks like Hadoop.

Among the parameters you should consider are start time and scalability (how close to linear does the system scale). Hadoop will not be a good choice if you need answers quickly, but might be a good choice if you can fit your process into a map-reduce frame.


You may refer to project HaLoop ( http://code.google.com/p/haloop ) which addresses exactly this problem.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜