Why is the right number of reduces in Hadoop 0.95 or 1.75?

2023-04-01 00:53 问答作者：

The hadoop documentation states:

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapred.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of开发者_StackOverflow社区 reduces doing a much better job of load balancing.

Are these values pretty constant? What are the results when you chose a value between these numbers, or outside of them?

The values should be what your situation needs them to be. :)

The below is my understanding of the benefit of the values:

The .95 is to allow maximum utilization of the available reducers. If Hadoop defaults to a single reducer, there will be no distribution of the reducing, causing it to take longer than it should. There is a near linear fit (in my limited cases) to the increase in reducers and the reduction in time. If it takes 16 minutes on 1 reducer, it takes 2 minutes on 8 reducers.

The 1.75 is a value that attempts to optimize the performance differences o the machines in a node. It will create more than a single pass of reducers so that the faster machines will take on additional reducers while slower machines do not.
This figure (1.75) is one that will need to be adjusted much more to your hardware than the .95 value. If you have 1 quick machine and 3 slower, maybe you'll only want 1.10. This number will need more experimentation to find the value that fits your hardware configuration. If the number of reducers is too high, the slow machines will be the bottleneck again.

To add to what Nija said above, and also a bit of personal experience:

0.95 makes a bit of sense because you are utilizing the maximum capacity of your cluster, but at the same time, you are accounting for some empty task slots for what happens in case some of your reducers fail. If you're using 1x the number of reduce task slots, your failed reduce has to wait until at least one reducer finishes. If you're using 0.85, or 0.75 of the reduce task slots, you're not utilizing as much of your cluster as you could.

We can say that these numbers are not valid anymore. Now acording to the book "Hadoop: definitive guide" and hadoop wiki we target that reducer should process by 5 minutes.

Fragment from the book:

Chosing the Number of Reducers The single reducer default is something of a gotcha for new users to Hadoop. Almost all real-world jobs should set this to a larger number; otherwise, the job will be very slow since all the intermediate data flows through a single reduce task. Choosing the number of reducers for a job is more of an art than a science. Increasing the number of reducers makes the reduce phase shorter, since you get more parallelism. However, if you take this too far, you can have lots of small files, which is suboptimal. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output.

继续阅读：configuration mapreduce

Why is the right number of reduces in Hadoop 0.95 or 1.75?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？