How do I make this Java code parallelizable? How do I make it cloudable?
I'm working on a system at the moment. It's a complex system but it boils down to a Solver
class with a method like this:
public int solve(int problem); // returns the solution, or 0 if no solution found
Now, when the system is up and running, a run time of about 5 seconds for this method is expected and is perfectly fast enough. However, I plan to run some tests that look a bit like this:
List<Integer> problems = getProblems();
List<Integer> solutions = new ArrayList<Integer>(problems.size);
Solver solver = getSolver();
for (int problem: problems) {
solutions.add(solver.solve(problem));
}
// see what percentage of solutions are zero
// get arithmetic mean of non-zero solutions
// etc etc
The problem is I want to run this on a large number of problems, and don't want to wait forever for the results. So say I have a million test problems and I want the tests to complete in the time it takes me to make a cup of tea, I have two questions:
Say I have a million core processor and that instances of
Solver
are threadsafe but with no locking (they're immutable or something), and that all the computation they do is in memory (i.e. there's no disk or network or other stuff going on). Can I just replace the solutions list with a threadsafe list and kick off threads to solve each problem and expect it开发者_开发百科 to be faster? How much faster? Can it run in 5 seconds?Is there a decent cloud computing service out there for Java where I can buy 5 million seconds of time and get this code to run in five seconds? What do I need to do to prepare my code for running on such a cloud? How much does 5 million seconds cost anyway?
Thanks.
You have expressed your problem with two major points of serialisation: Problem production and solution consumption (currently expressed as Lists of integers). You want to get the first problems as soon as you can (currently you won't get them until all problems are produced).
I am assuming as well that there is a correlation between the problem list order and the solution list order – that is solutions.get(3)
is the solution for problems.get(3)
– this would be a huge problem for parallelising it. You'd be better off having a Pair<P, S>
of problem/solution so you don't need to maintain the correlation.
Parallelising the solver method will not be difficult, although exactly how you do it will depend a lot on the compute costs of each solve method (generally the more expensive the method the lower the overhead costs of parallelising, so if these are very cheap you need to batch them). If you end up with a distributed solution you'll have much higher costs of course. The Executor
framework and the fork/join extensions would be a great starting point.
You're asking extremely big questions. There is overhead for threads, and a key thing to note is that they run in the parent process. If you wanted to run a million of these solvers at the same time, you'd have to fork them into their own processes.
You can use one program per input, and then use a simple batch scheduler like Condor (for Linux) or HPC (for Windows). You can run those on Amazon too, but there's a bit of a learning curve, it's not just "upload Java code & go".
- Sure, you could use a standard worker-thread paradigm to run things in parallel. But there will be some synchronization overhead (e.g., updates to the solutions list will cause lock contention when everything tries to finish at the same time), so it won't run in exactly 5 seconds. But it would be faster than 5 million seconds :-)
- Amazon EC2 runs between US$0.085 and US$0.68 per hour depending on how much CPU you need (see pricing). So, maybe about $120. Of course, you'll need to set up something separate to distribute your jobs across various CPUs. One option might be just to use Hadoop (see this question about whether Hadoop is right for running simulations.
You could read things like Guy Steele's talk on parallelism for more info on how to think parallel.
Use an appropriate Executor. Have a look at http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()
Check out these article on concurrency:
- http://www.vogella.de/articles/JavaConcurrency/article.html
- http://www.baptiste-wicht.com/2010/09/java-concurrency-part-7-executors-and-thread-pools/
Basically, Java 7's new Fork/Join model will work really well for this approach. Essentially you can set up your million+ tasks and it will spread them as best it can accross all available processors. You would have to provide your custom "Cloud" task executor, but it can be done.
This assumes, of course, that your "solving" algorithm is rediculously parallel. In short, as long as the Solver is fully self-contained, they should be able to be split among an arbitrary number of processors.
精彩评论