开发者

multi-threading question for large batch process

We have a batch process consisting of about 5 calculations that happens on each row of data (20 million rows total). Our production server will have around 24 processors with decent CPUs.

Performance is critical for us. 开发者_StackOverflow Assuming that our algorithms are pretty efficient, what would be the best way to achieve maximum time performance for this? Specifically, should we be able to achieve better performance through multi-threading, using threadpools, etc? Also, could use of the Process object to divide up the batch into multiple programs be of benefit?


A few thoughts:

First, you need to put a bit more definite around "best" - there are trade-offs involved in performing such massive processing. Specifically, memory, I/O and CPU utilization are considerations. How much memory each calculation requires. And so on.

Assuming that you are the only process on the machine, you have lots of memory, and you are primarily interested in optimizing throughput, here are some suggestions:

  • You will definitely want some manner of multi-threaded processing.
  • Thread pools are a reasonable approach, but you need to make sure that I/O latency is not your most time-consuming step.
  • You may want to use a hand-off process to separate the calculation from persisting the results of the calculation. You will also probably want to batch the results so as to reduce the number of round trips to the database.
  • You will also want to batch loading records into memory, again, to minimize trips to the database.
  • Avoid blocking operations and other memory barriers where possible to reduce locking contention.

In additional to thread pools, there is also the Task Parallel Library, which offers facilities for simplifying the development of such parallel computations. It is specifically designed to scale to the number of cores and optimize the way threads are used. There's also Parallel LINQ, which you may also find useful.


Overall if you can wait for .NET 4, PFX (parallel extensions) is likely to be the best model.

until then avoid lots of process/thread starts/ends, ie. use the threadpool (starting a process is extremely expensive, starting a thread is very expensive).

Simple approach: batch up the calculations into jobs that should be completed in ~50ms, and then start queuing them up. The hard part is ensuring everything has completed. A simple completion would be to have a shared "completed" counter with each job incrementing it. Main thread spins on reading the counter until it reaches the expected final value.


This depends a lot on what the "5 calculations" constitute. If there is any significant computation required to perform those 5 calculations, then multithreading will be a huge benefit. The smaller the amount of work, the more care will need to go into partitioning in order to get a good gain.

Given that this is running "on each row of data", the most efficient way to handle this (if possible), would be to update this directly within your database. Pulling the data client side, processing, and repopulating will be much slower than trying to do the calculation directly in the DB. Most database servers will have good support for threading on their own, and do a good job of optimizing an update, so if you can work this to process the data directly in the DB, you will get the best performance.

If that's not possible, then I'd recommend looking into using the Task Parallel Library to handle this. Running on .NET 4 will be especially helpful, since the work stealing added to the thread pool will give you a better overall throughput.


Threadpools are a safe and easy way to do this - there is a maximum of 64 simultaneous threads available to the pool (this is actually a limit of WaitHandles). Using the Process object just introduces new issues and complexities around debugging that arent worth the perceived trade offs - especially considering that any value you gain would come from the parallelism that the pool will give you.


Only a granular assessment can bring to light the best way to optimize their task, but certainly the use of a pool of threads may bring improvements. discover the most common tasks and divide them in the pool. Importantly, key ways to measure performance, for only then can know where the bottlenecks are and where to strike to improve.


I'd suggest doing this within a database procedure, if possible. Otherwise, it probably doesn't matter how efficient your client-side processing is, the time will be dominated by marshalling the data back and forth across the network. Even if you run the process on the same machine, you can incur the penalty of serializing everything through your (presumably ODBC) driver. Unless, of course, you write a native procedure that can run within the address space of your database server (if your server supports that).

I guess I'd suggest writing a procedure that takes a lower and upper bound for selecting records, then writing a client-side program that forks off a few threads, allocates a DB connection per thread, then calls the server-side procedure with appropriately-sized bounds (say five threads with four million rows apiece). If your DB server is multithreaded, then this should give you decent performance.

With any multithreaded approach, though, be aware that if you're updating many rows you can wind up with locking problems due to lock escalation if you don't commit your transactions often enough.


If you're using SQL Server 2005/2008, consider adding your calculations to SQL Server as CLR functions: http://msdn.microsoft.com/en-us/library/ms254498%28VS.80%29.aspx. This is much faster than doing calculation in T-SQL and saves you the cost of moving data in and out of the database. SQL Server would manage the threads for you. You could also experiment with opening multiple connections, with each one working on a different set of rows to gauge the impact on performance, connection time, etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜