Load-balancing in parallel processing application
I'm building a network-distributed parallel processing application that uses a combination of CPU and GPU resources across many machines.
The app has to perform some very computationally expensive operations on a very large dataset over thousands of iterations:
for step = 0 to requested_iterations
for i = 0 to width
for j = 0 to height
for k = 0 to depth
matrix[i,j,k] = G*f(matrix[i,j,k])
Also, the matrix operations have to be executed synchronously: that is, each iteration depends on the results of the frame that came immediately before it.
The hardware available in this ad-hoc grid, comprising both dedicated servers and idle desktop machines, varies greatly in performance from machine to machine. I'm wondering what the best way is to balance the work load across the entire system.
Some idiosyncracies:
The grid should be as robust as possible. Some simulations require weeks to run, and it would be nice not to have to cancel a run if one out of 100 machines goes offline.
Some of the lower-end machines (desktops that are idle but have to wake up when someone logs in) may join and leave the grid at any time.
The dedicated servers may also join and leave the grid, but this is predictable.
So far, what the best idea I've been able to come up with is:
- Have each node track the time itself takes to process a group of n cells in the matrix (cells processed per unit time) and report this to a central repository.
- Weight this time against the total time for a frame (across the entire grid) of the simulation and the total size of the problem domain. So, each node would get a score expressed in work units (matrix cells) per time, and a scalar rating expressing its performance vs the rest of the grid.
- On each frame, distribute the work load based on those scores so that each machine finishes as close to the same time as possible. If machine
A
is 100x faster than machineB
, it will receive 100x as many matrix cells to process in a given frame (assuming that the matrix size is large enough to warrant including the extra machines). - Nodes that leave the grid (desktops that are logged into, etc.) will have their workload redistributed among the remaining nodes.
Or,
Arrange the nodes in a tree structure, where each node has a "weight" assigned. Nodes that are higher in the tree have a weight based on their ability combined with that of their children. This weight is adjusted per frame. When a node loses communication its child, i开发者_StackOverflow社区t uses a cached tree graph to contact the orphaned children and re-balance its branch.
If it makes a difference, the app is a combination of C# and OpenCL.
Links to papers, example apps, and especially tutorials are welcome.
Edit
This isn't homework. I'm turning a simulator I wrote as part of my thesis into a more useful product. Right now the work is distributed uniformly with no accounting for performance of each machine, and no facility to recover from machines joining or leaving the grid.
Thanks for the excellent, detailed responses.
For heterogeneous clusters, I like to let each processor request a new job as the processor becomes available. Implementation involves a light weight server that can handle many requests at a time (but usually only returns a job number). Implementation might go something like this:
- Break the job down into its smallest components (we know there are 1000 tasks now)
- Start a network server (preferably UDP with timeouts to avoid network congestion) which counts upwards
- Start your cluster processes.
- Each process asks, "What job number should I perform?" and the server replies with a number
- As the process finishes, it asks for the next job number. When all tasks are complete, the server returns a -1 to the processes, so they shut down.
This is a lighter weight alternative to what you suggest above. Your fast processors still do more work than your slower machines, but you don't have to calculate how long the tasks take. If a processor drops out for whatever reason, it will stop asking for tasks. Your server could choose to recycle task numbers after a certain amount of time.
This is pretty much what a cluster scheduler would do on its own, except the processors don't have startup and shutdown costs, so your individual tasks can be smaller without penalty.
I would go for decentralized solution.
Every node picks (not given) same amount of work from center. After some run every node is able to deside for itself
an average power of calculation and communicate it with others.
After all every node will have a table of every node's average calc power. Having this information (could be even persistant,why not?) each node can deside to "ask" some other node with more power to delegate a stuff to it by signing a contract.
Before every process start every node have to make broadcast signal about: "I start doing X". One time finished always broadcast: "I finished X".
Well, it's no so easy, cause there will be case when you begin job, after your hard disk failed and you will never finish it. Others, especially those ones who are waiting a result from you should figure out this and pick from the basket your job and begin the stuff from the beginning. Here come "ping" technique with timer.
Bad: The first tuning time can take non indifferent amount of time.
Good: You will have almost fault tolerant solution. Leave them for a week, and even if some of nodes fail your grid still alive and does its work.
Many years ago I did something like this and with pretty good results. But it wasn't definitely on such large scale as described by you. And scale, actually, makes a difference.
So the choice is up to you.
Hope this helps.
I wouldn't bother tracking those stats too much at the server level. Your going to introduce a fair amount of overhead.
Instead, the control server should just maintain a list of work units. As a client becomes available, let it grab the next unit in line and process it. Rinse, repeat.
Once the list of work units for a given matrix is exhausted, allow currently incomplete work units to be reassigned.
Examples based off of a matrix containing 10 work units and 5 servers.
Equally fast, all available:
Server 1 checks in and grabs unit 1. This proceeds for the next 4 machines (ie: Server 2 gets unit 2...) When unit 1 is done, server 1 then grabs unit 6. The others grab the rest. Once the last server checks in, the matrix is done.
Low Disparate performance, all available:
You start the round robin again and the first 5 units are acquired by the servers. However, Server 1 takes 30% longer than the others. This means Server 2 will grab unit 6. etc. At some point server 1 will check in unit 1, meanwhile units 2 through 5 will have been completed and 6 through 10 will have been assigned. Server 1 is assigned unit 6 as it's not done yet. However, Server 2 will check in it's completed work before Server 1 finishes. No big deal, just throw away that last result.
Huge Disparate Performance, all available
You start the round robin again and the first 5 units are acquired by the servers. Let's say Server 1 takes 400% more time than the others. This means Server 2 will grab unit 6, etc. After server 2 checks in unit 6 it will see that unit #1 is still in process. Go ahead and assign it to Server 2; which will complete it before Server 1 returns.
In this case you should probably monitor for those machines that are consistently reporting work late and drop them from further consideration. Of course, you will have to make some allowances for those that go offline due to shutdown or personal usage. Probably some type of weighted rating where once it drops below a certain threshold you simply deny it further work; perhaps the rating is reset every so often to allow rebalancing from a steady state it will meet.
Machine disappears
This has the exact same plan as the "Huge Disparate Performance" listed above. The only difference is that the machine will either never report in, or will do so after some unknown amount of time.
If for some reason you have more machines than units then an interesting thing happens: multiple servers will be assigned the same work unit right off the bat. You can either stop this by putting in place some type of delay (like a unit must be in process for x minutes before allowing it to be reassigned) or simply allow it to happen. This should be thought through.
What have we done? First, we alleviated the need to track individual performance. Second, we've allowed for machines to just disappear while making sure the work is still completed. Third, we've ensured that the work will be completed in the least amount of time as possible.
It's a little more chatty than simply assigning blocks of multiple units to machines based on performance; however, this allows for even the fast machines to be unplugged from the network while ensuring total recoverability. Heck you could kill all of the machines and later turn on some of them to pick up where you left off.
精彩评论