What operations cause parallel code to run slowly?
Reading this: The Hitchhiker's Guide to Concurrency, more specifically, the section about Amdahl's Law - the idea that a parallel program on开发者_如何转开发ly goes as fast as its slowest part, and that the more parallel a program is from the beginning the faster it will be after the introduction of more cores, I find myself wondering: how can I ensure that I write code that is as parallel as possible from the start? How can I ensure that my code would reap the maximum possible benefits of adding multiple cores? And also, what kinds of operations would cause code not to be parallel, or parallel code to be slow? Code examples would of course be appreciated.
Not all code is inherently able to be made to run in parallel.
Parallel execution means at the same time, simultaneously. Code can only run simultaneously if it does not rely on the end result, of the other code running alongside it. If you are doing an equation like this:
((((x+y)+z)+a)*b)
Each bracket must be worked out before the next stage, so the operations can not be done sequentially. In order to make a program parallel, it is important to identify when there is a large task that can be broken up in to pieces that can be done at the same time.
Consider a summation, I need to add up 100,000 numbers.
sum = 0;
for (i = 0; i < 100000; i++) {
sum += numbers[i];
}
Adding is commutative and transitive, a + b + c + d
can be split up in to (a + b + c) + d
, a + (b + c) + d
, (a + b) + (c + d)
, etc. The last case is an even distribution of the work, half in one bracket, half in the other.
So suppose we did the big task like this:
sumA = 0;
for (i = 0; i < 100000 / 2; i++) {
sumA += numbers[i];
}
sumB = 0;
for (i = 100000 / 2; i < 100000; i++) {
sumB += numbers[i];
}
sum = sumA + sumB;
Split in two, the two loops could run at the same time and still get the same answer, we just have to recombine at the end.
In parallel programming this is the key, splitting up work in to sections for each worker (cpu/node/machine), and then collecting the results and assembling the final result. We call this scattering and gathering.
Many computing tasks can be split up, some can not. A program that lends itself well to being split up is ideal to be made parallel, one that is not, is not. Consider also that there is significant overhead in scattering and gathering, splitting up the data, recombining it (possibly transferring it) - it is not always worth dividing the task.
So, in answer to your question. It is not so much what you can do to make your program parallel, as whether or not your program is naturally able to be so.
using a purely functional language (where there are no side effects) is a good start (examples: Haskell, Lisp, etc.)
If you're in a more OO environment, write immutable classes when you can (in Java: use final fields, and have those fields be of immutable types themselves), and encapsulate your mutable fields so you don't have to worry about them being changed by code running on another thread.
Finally, even if your code is extremely parallel-izable, it can still run slow! Near the end of the project, profile your app and work on the slowest parts first.
Good luck!
As many answers concepts in computer science; It depends.
First of all the Erlang way of creating parallel application is by using EVMs lightweight processes. The EVM scheduler allocates time for execution on different cores on a per process basis. More processes means more possibilities to allow processes to execute in parallel however this doesn't mean the application benefit from this.
Basically: http://www.erlang.org/doc/efficiency_guide/processes.html
To gain performance by using the SMP emulator, your application must have more than one runnable Erlang process most of the time. Otherwise, the Erlang emulator can still only run one Erlang process at the time, but you must still pay the overhead for locking. Although we try to reduce the locking overhead as much as possible, it will never become exactly zero.
Other properties of your application must also be taken into consideration. One major benefit of Erlang/OTP is the isolation available between processes. As example a web server or telecom node may be parallel for each incoming request/call by spawning processes. But other major benefit are process (incoming request) isolation, supervision and fault tolerant behavior that "comes for free". Erlang/OTP was made to perform well for these situations.
On the other hand an application with only sequential operations (i.e. lists:foldl/3) may not benefit anything by spawning processes. It may instead lose performance due to spawning process overhead.
Also think about what other parts of your system parallel execution may affect. By spawning many processes that all shall access the same resource you may be looking at i.e. IO problems. I've written parallel code that works super fast but been forced to remove parallelism because accessing external resources became the limitation.
After some rambling I'll try to answer your question.
- Parallelization is achieved by spawning more than one process that can execute at the same time as another process.
- There is a cost involved when spawning processes, even if it's cheap.
- Sequential operations may not benefit at all by spawning processes for each operation, contrary it may hurt performance.
- Other major benefits may be gained when spawning processes in Erlang/OTP. Make sure they work together with your application.
- Watch out for "scaling problem" when accessing the same type of resources from parallel processes.
- Test both with and without parallel execution by spawning or not spawning processes for your operations.
- Test some more.
Keep in mind that typically, the serialized portion of the application is dependent on the problem size. Also typically, the serialized portion is either fixed, or grows slowly with problem size, so that overall as you increase the problem size, the percentage influence of the serial portion diminishes. See also Gustafson's law.
Typically, the serialized portion of your application will be obvious. One common pattern is a single server process that handles requests, eg. from a network socket, which distributes work to worker processes. The whole system can go no faster than that single process can read from the network and distribute tasks to workers. To increase the parallelism, you need to have multiple servers in parallel.
Another similar example is having a single process take control over a shared resource. In order to request subresources of this shared resource, all requests from multiple sources must be serialized. One way to increase parallelism here is to have multiple controllers, each of some subsection of the full resource, and when requesting access, a source will pick which controller to ask randomly (or some other way of distributing requests semi-evenly). This is essentially Sharding the resource.
how can I ensure that I write code that is as parallel as possible from the start?
In general, the problem has to do with data dependency. An operation is dependent on another operation if it needs the data in order to begin executing itself. There are some important tricks to be played here. Suppose you know an operation can only have 2 possible outcomes. Then you can start both computations and then later "select" the right one. In fact, CPU design often exploits this idea in computational units (See for instance the wikipedia article on Carry Select Adders). Another trick is when you know that the result is going to be a specific value with, say, 99% certainty. Then you can speculatively execute the next operation, and hope for your prediction to be true. If it fails, you can always roll back the computation and redo it. This is also done in modern CPUs, see for instance Branch Prediction with speculative execution and reorder buffering.
At this point, it should be clear to you that modern CPU architectures already employ a vast slew of parallelization tricks. The problem is that they have squeezed out all the local parallelization we can hope to get and now we need to take the ideas "one-level-up" into our programs.
How can I ensure that my code would reap the maximum possible benefits of adding multiple cores?
What kills parallel computation is dependencies. If you depend on another part of the computation you will need to communicate between parallel threads of execution. This is turn leads to stalls since you are waiting on other parts to send messages. An often used trick is latency hiding: Rather than wait for the message to arrive, you do something else in between, hoping that the data has been fully transferred when you need it. But even better it is, if you can arrange your code such that it does not have to communicate at all.
This is why functional programming is seen as a powerful tool for parallelism. In FP, there is no shared state and so the code is already in a state where it is easy to fork out small bundles of computation to different CPUs. Haskell is the language which has the most maturity for this idea in my opinion due to a keyword par
.
In general, if your program has few dependencies in the data flow, then it is easy to make fast when adding multiple cores. You want to avoid serialization choke points in your data flow as much as possible.
Erlang My hesitation to mention Erlang is that Erlang gets its parallelism indirectly. In Erlang we describe a program as a set of concurrent activities, namely a set of processes which work together to solve a problem. Processes are isolated and communicate by messaging each other. Erlangs main unique strength is fault tolerance: An error in one process can't possibly affect another process due to isolation - so you can build software that will recover in the case of single process going into a dead walking zombie state.
Now this model, leans itself heavily to an apparent parallel evaluation strategy: When one core is handling one process, we can have excess cores grabbing other processes for handling. Again, if processes are not dependent on each other, waiting for a message to arrive, then adding more cores will yield speedup.
It is no magical silver bullet though. If you serialize all your messages through a single process as a "choke-point" then your code will not be parallel, Amdahl or Gustafsson will come into play and you are left with a serial program.
精彩评论