How efficient are opensource computation platform like Hadoop etc.?

2023-03-23 16:52 问答作者：

How effi开发者_如何学编程cient are opensource distributed computation frameworks like Hadoop? By efficiency, I mean CPU cycles that can be used for the "actual job" in tasks that are mostly pure computation. In other words, how much CPU cycles are used for overhead, or wasted because of being not used? I'm not looking for specific numbers, just a rough picture. E.g. can I expect to use 90% of the cluster's CPU power? 99%? 99.9%?

To be more specific, let's say I want to calculate PI, and I have an algorithm X. When I perform this on a single core in a tight loop, let's say I get some performance Y. If I do this calculation in a distributed fashion using e.g. Hadoop, How much performance degradation can I expect?

I understand this would depend on many factors, but what would be the rough magnitude? I'm thinking of a cluster with maybe 10 - 100 servers (80 - 800 CPU cores total), if that matters.

Thanks!

Technically hadoop has considerable overheads in several dimensions:
a) Per task overhead which can be estimated from 1 to 3 seconds.
b) HDFS Data reading overhead, due to passing data via socket and CRC calculation. It is harder to estimate
These overheads can be very significant if you have a lot of small tasks, and/or if your data processing is light.
In the same time if your have big files (less tasks) and Your data processing is heavy (let say a few mb/sec per core) then Hadoop overhead can be negleted.
In a bottom line - Hadoop overhead is variable things which higly depends on the nature of processing you are doing.

This question is too broad and vague to answer usefully. There are many different open-source platforms, varying very widely in their quality. Some early Beowulfs were notoriously wasteful, for example, whereas modern MPI2 is pretty lean.

Also, "efficiency" means different things in different domains. It might mean the amount of CPU overhead spent on constructing and passing messages relative to the work payload (in which case you're comparing MPI vs Map/Reduce), or it might mean the number of CPU cycles wasted by the interpreter/VM, if any (in which case you're comparing C++ vs Python).

It depends on the problem you are trying to solve, too. In some domains, you have lots of little messages flying back and forth, in which case the CPU cost of constructing them matters a lot (like high-frequency trading). In others, you have relatively few but large work-blocks, so the cost of packing the messages is small compared to the computational efficiency of the math inside the work block (like Folding@Home).

So in summary, this is an impossible question to answer generally, because there's no one answer. It depends on specifically what you're trying to do with the distributed platform, and what machinery it is running on.

MapR is one of the alternative for Apache Hadoop and Srivas (CTO and founder of MapR) has compared MapR with Apache Hadoop. The below presentation and video have metrics comparing MapR and Apache Hadoop. Looks like the hardware is not efficiently used in Apache Hadoop.

http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop

http://www.youtube.com/watch?v=fP4HnvZmpZI

Apache Hadoop seems to be inefficient in some dimensions, but there is a lot of activity in Apache Hadoop community around scalability/reliability/availability/efficiency. Next Generation MapReduce, HDFS Scalability/Availability are some of things being worked currently. These would be available in the Hadoop version 0.23.

Till some time back, the focus of the Hadoop community seemed to be on scalability, but now shifting towards efficiency also.

继续阅读：mapreduce performance

How efficient are opensource computation platform like Hadoop etc.?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？