Possible to use Map Reduce and Hadoop to parallel process batch jobs?

2023-03-15 16:22 问答作者：

Our organization has hundreds of batch jobs that run overnight. Many of these jobs require 2, 3, 4 hours to complete; some even require up to 7 hours. Currently, these jobs run in single-threaded mode, so our attempts to increase performance is limited by vertical scaling of the machine with additional CPU and memory.

We are exploring the idea of leveraging parallel processing techniques, such as Map Reduce, to cut down the time required for these jobs to complete. Most of our batch processes pull in large data sets, typically from a database, process the data row by row, and dump the result as a file into another database. In most cases, processing of individual rows is independent of other rows.

Now we are looking at Map Reduce frameworks to break up these jobs into smaller pieces for parallel processing. Our organization has over 400 employee desktop PC's, and we would like to utilize these machines off business hours as the parallel processing grid.

What do we need to get this working? Is Hadoop the only component required? Do we also need HBase? We are slightly confused by all the different offerings and needed some assistance.

Tha开发者_开发技巧nks

There's a couple questions here -- about MapReduce, and about making use of 400 PCs for the job.

What you're describing is definitely possible, but I think it might be too early to be choosing a particular programming model like Map Reduce at this stage.

Let's take the using 400 desktops idea first. This is, in principle, completely doable. It has its own challenges -- note that, for instance, leaving a bunch of desktop-class machines on overnight will never be as power-efficient as dedicated cluster nodes. And the desktop nodes are not as reliable as cluster nodes - some might be shut off, some might have network problems, something left running on them which slows a compute job. But there are frameworks that can deal with this. The one I'm familiar with is Condor, which made its name making use of exactly this sort of situation. It runs on windows and linux (and does fine in mixed environments), and is very flexible; you can automatcally have it make use of unused machines even during the day.

There are likely other such "opportunistic computing" systems out there, and maybe others can suggest them. You could other clustering solutions too and use a traditional queueing system to run your jobs (sge, rocks, etc), but most of them assume that the machines are always theirs to be used.

As to MapReduce, if most of your computing is really of the form of (independant accesses of a DB) → (independant computations) → (put independant row into a 2nd DB), I think MapReduce might even be overkill for what you want. You could probably script something to partition the job into individual tasks and run them individually without having the overhead of an entire MapReduce system and its associated very particular filesystem. But if you want to, you could run mapreduce on top of some scheduling / resource manager type system like condor. Hadoop on top of condor has a long history.

继续阅读：mapreduce parallel-processing

Possible to use Map Reduce and Hadoop to parallel process batch jobs?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？