Tools for optimizing scalability of an Hadoop application?

2023-01-05 11:18 问答作者：

I'm working with a team of mine on a small application that takes a lot of input (logfiles of a day) and produces useful output after several (now 4, in the future perhaps 10) map-reduce steps (Hadoop & Java).

Now I've done a partial POC of this app and ran it on 4 old desktops (my Hadoop test cluster). What I've noticed is that if you do the partitioning "wrong" the horizontal-scaling characteristics are wrecked beyond recognition. I found that comparing a test run on a single node (say 20 minutes) and on all 4 nodes only resulted in 50% speedup (about 10 minutes) where I expected a 75% (or at least >70%) speedup (about 5 or 6 minutes).

The general principle of making map-reduce scale horizontally is to ensure that the partitions are as independent as possible. I found that in my case I did the partitioning of each step "wrong" because I simply used the default Hash partitioner; this make the records jump around to a different partition in开发者_如何学Go the next map-reduce step.

I expect (haven't tried it yet) that I can speed the thing up and make is scale much better if I can convince as many records as possible to stay in the same partition (i.e. build a custom partitioner).

In the above described case I found this solution by hand. I deduced what went wrong by thinking hard about this while in my car to work.

Now my question to you all: - What tools are available to detect issues like this? - Are there any guidelines/checklists to follow? - How do I go about measuring things like "the number of records that jumped partition"?

Any suggestions (tools, tutorials, book, ...) are greatly appreciated.

Make sure that you're not running into the small files problem. Hadoop is optimized for through-put rather than latency, so it will process many log-files joined into one large sequence file much more quickly than it will many individual files stored on the hdfs. Using sequences files in this way eliminates extra time needed do house-keeping for individual map and reduce tasks and improves data locality. But yes, it's important that your map outputs are reasonably well distributed to the reducers, to ensure that a few reducers are not overloaded with a disproportionate amount of work.

Take a look at Karmashpere (formerly known as hadoop studio) plugin for Netbeans/Eclipse : http://karmasphere.com/Download/download.html. There's free version that can help with detecting and test-running hadoop jobs.
I have tested it a little and it looks promising.

继续阅读：horizontal-scaling mapreduce partitioning performance

Tools for optimizing scalability of an Hadoop application?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？