Multithreaded app in multicore environment - weird load per core
Given environment: Xeon processor with 16 cores, OS - Win 2008 server R2.
Given application (.Net/C#) before paralleling loads 1 core at almost 100%. Obvious solution to make some profit was to use .Net 4 parallel task library to speed application up X-times. Suppose the part of application that is paralleled is really appropriate - no locking occurs between threads (no shared resources, each parallel task is completely independent). But to my regret the profit is really low - 16-threaded app works approx. 2 times faster than sequential.
Here is the first illustration - 16 threads on 16 cores
It seems really weird - each task is equal but first 8 cores are loaded at almost same level (~30%) and other 8 have progressively descending load.
So, I've tried different configurations, for example 8 threads on 16 cores
Looks like 8 threads are all runnin on 8 cores and threads are not transfered from one core to another. Moreover, on 8 cores average core load is greater than on 16.
I did some research via profiler - each thread has same behaviour like in single threaded case in terms of percentage of time spent in different methods. Only (and mean) difference is absolute time - it gets greater and greater with the growth of thread number (like if the performance of each core was degrading)
So the main tendencies that I cant explain - more threads mean lower average load per core and integral cpu usage is about 20-25% at maximum. And each operation in thread runs slower with the growth of the number of threads.
Any ideas to explain this weird things?
UPD
After applying Server GC the picture has changed significantly
8 threads on 16 cores illustration:
12 threads on 16 cores illustration:
15 threads on 16 cores illustration:
So, looks like cpu usage is increasing with the growth of core number. First thing that botheres me is that i t looks like all of cores are used and threads are jumping from core to core, so overall performance is not as good.
Second thing is that app maximum speed is at 12 cores, 15 cores 开发者_JAVA百科give same results, 16 cores are even slower.
What is the possible reason?
The pattern that you are seeing is often an indication of an I/O bottleneck. If your disks or network are running full-out to provide data to these calculations (or handle the results), then you could run it on a million cores with no additional benefit. I'd suggest using Sysinternals Process Explorer to examine network and disk I/O and see if there is an issue there before trying to get further into why this isn't parallelizing well.
Since it sounds like you have no synchronization internal to your method, the problem is likely in the partitioning.
Given that you're using the TPL, work must get sent to cores based on a partitioner. However, the actual source IEnumerable<T>
is not thread safe, so that requires access via a single core. This, in effect, will often lead to performance characteristics like the one you are showing above if the actual work is small compared to the number of items.
The way around this is to use the Partitioner class to pre-partition your work items into blocks, and then iterate through the "blocks" of items in parallel. For details, see How to: Speed Up Small Loop Bodies.
精彩评论