Possible to use Map Reduce and Hadoop to parallel process batch jobs?
Our organization has hundreds of batch jobs that run overnight. Many of these jobs require 2, 3, 4 hours to complete; some even require up to 7 hours. Currently, these jobs run in single-threaded mode, so our attempts to increase performance is limited by vertical scaling of the machine with additional CPU and memory.
We are exploring the idea of leveraging parallel processing techniques, such as Map Reduce, to cut down the time required for these jobs to complete. Most of our batch processes pull in large data sets, typically from a database, process the data row by row, and dump the result as a file into another database. In most cases, processing of individual rows is independent of other rows.
Now we are looking at Map Reduce frameworks to break up these jobs into smaller pieces for parallel processing. Our organization has over 400 employee desktop PC's, and we would like to utilize these machines off business hours as the parallel processing grid.
What do we need to get this working? Is Hadoop the only component required? Do we also need HBase? We are slightly confused by all the different offerings and needed some assistance.
Tha开发者_开发技巧nks
There's a couple questions here -- about MapReduce, and about making use of 400 PCs for the job.
What you're describing is definitely possible, but I think it might be too early to be choosing a particular programming model like Map Reduce at this stage.
Let's take the using 400 desktops idea first. This is, in principle, completely doable. It has its own challenges -- note that, for instance, leaving a bunch of desktop-class machines on overnight will never be as power-efficient as dedicated cluster nodes. And the desktop nodes are not as reliable as cluster nodes - some might be shut off, some might have network problems, something left running on them which slows a compute job. But there are frameworks that can deal with this. The one I'm familiar with is Condor, which made its name making use of exactly this sort of situation. It runs on windows and linux (and does fine in mixed environments), and is very flexible; you can automatcally have it make use of unused machines even during the day.
There are likely other such "opportunistic computing" systems out there, and maybe others can suggest them. You could other clustering solutions too and use a traditional queueing system to run your jobs (sge, rocks, etc), but most of them assume that the machines are always theirs to be used.
As to MapReduce, if most of your computing is really of the form of (independant accesses of a DB) → (independant computations) → (put independant row into a 2nd DB), I think MapReduce might even be overkill for what you want. You could probably script something to partition the job into individual tasks and run them individually without having the overhead of an entire MapReduce system and its associated very particular filesystem. But if you want to, you could run mapreduce on top of some scheduling / resource manager type system like condor. Hadoop on top of condor has a long history.
精彩评论