Why is multithreading this code leading to such inconsistent timings?

2023-03-12 12:36 问答作者：

I have a function that processes a large image. By the specification, the largest this image can be is 55mb. The processing entails breaking the image into several different bands and then reconstituting the image by adding these bands back into an output image. Because the image is so large, I can't keep all four images plus input and output images in memory simultaneously on a 32 bit system. As a result, I put each image on disk and then read it back in in portions.

Prior to multithreading, the pseudocode looks like:

for (y is 0 to ysize)
   unsigned short* ptr1 = ReadLineFromDisk(image1, y)
   unsigned short* ptr2 = ReadLineFromDisk(image2, y)
   unsigned short* ptr3 = ReadLineFromDisk(image3, y)
   unsigned short* ptr4 = ReadLineFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code runs in 3 seconds on a dual core machine with a standard 500 gb hard drive using a high performance counter.

If increase the number of lines read from disk to something like 100, and then step through that, with code that looks like:

chunksize = 100;
for (y is 0 to ysize by chunksize)
   unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
   unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
   unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
   unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code is faster than the previous code, down to 1.5 seconds.

Question 1: Why is that code faster?

I hypothesize that it's faster because, in my experience, large, contiguous reads are faster than smaller ones for the same amount of data. That is, if I read 100 lines of data all at once, that's faster than 100 individual reads, at least for a regular (non-SSD) hard drive. Is my hypothesis close to correct?

Even so, the processor is not being used intensively here. Increasing the cache size is actually prohibitive, in that 1.5 is the best I can get, and then the value appears to drop off a bit (not sure why that would be either, except that maybe there's some disk caching playing a role). That leads me to

Question 2: Why would there be a sweet spot in the chunk size?

If I understand things here (and I don't think I do, really), if everything could be in memory, then that would be extremely quick, because there would be no disk hits. If reading more also makes things faster, wouldn't reading in say a quarter of the image at a time be only a slight speed hit?

So then I switch to placing the outer loop in a lambda expression and using Intel's TBB to thread the code, something like:

chunksize = 100;
parallel_for (y is 0 to ysize by chunksize in a lambda expression)
   unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
   unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
   unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
   unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
   unsigned short* outPtr = &(outImage[y*inXSize])
   for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
        outPtr = combination of ptr1, ptr2, ptr3, ptr4;
   }
}

This code ranges in speeds from 0.4 seconds to 1.6 seconds.

That brings me to:

Question 3: Shouldn't that speed increase be, at most, 2x, not 4x?

This is a dual core machine I'm running these benchmarks on, so in a perfect world, one thread reads from disk while the other processes. Even when it runs at the 4x speed increase, it's only using 80% of the processors, not 100%, so there is a disk bottleneck still left. But a 4x increase means that something else is happening as well.

I als开发者_JAVA技巧o assume that the wide range of speed differences is that the threads are not perfectly synchronized on their reads, if that's how the speed increase is happening. The real, final question, is:

Question 4: How can I get that 4x speed increase consistently?

Answer 1: Yes you're disk-bound so the CPU will not be pegged all that much and yes reading larger chunks is more efficient (as long as the chunks are aligned with the disk cache).

Answer 2: A disk that has an 8 MB cache and is spinning at 10k RPM might get a throughput of 60 to 80 MB/sec, so the "sweet spot" would be to read chunks aligned with the cache size. You can increase your buffer, but keep it aligned with the cache size: i.e. 8MB, 16MB, 32MB, etc.

Answer 3: Ideally you would want to dedicate one thread to reading from disk and the other to processing the data (you may want to use several threads for processing). Multi-threading the disk reads may have some small performance increase, but it is generally not the case. I don't know why you think "something else" is happening when you get a 4x increase.

Answer 3 Update: Frankly, I don't exactly know why this is happening either, but I've also seen it with multithreaded disk I/O on .NET applications. As a matter of fact I even have a C# test example which demonstrates the same kind of performance increase that you're noticing. Note that in my test I'm loading HTML pages which are roughly what you would see in the "wild" (about 80 to 160 KB each), so I'm not aligning my reads with the disk cache. It may be possible that multiple threads reading at once are actually more efficient because you are taking advantage of the disk cache despite the fact that you're doing multiple reads. Of course, this is just an off the cuff assumption that I have no evidence to back up yet so please take it with a grain of salt! I think that if your files are large enough and your disk reading thread actually has a buffer aligned with the disk cache, then adding more threads won't improve your speed at all. If you still see an improvement in speed then do let us know!

Answer 4: Try the following:

Align your buffer with the cache size of your disk.
Reduce the number of applications that are also trying to access disk at the same time.
Load up as much of the image in memory as you can and run enough threads to fully utilize your CPU (you're going to improvise for the number of threads, play around and see where is the "sweet spot").
Use only one disk reading thread and make sure that it's constantly reading!!!

And again, you're disk-bound, so you may never really get 100% CPU utilization.

Answer 4 Update:
I don't think that Intel's TBB are actually the thing that's causing the performance increase you (and I) are seeing... as I said, my best guess is that multiple threads may actually be more efficient if they're providing a better utilization of the disk cache. I'm not even sure if that's a correct assumption, so don't quote me without testing!

Reading:
I found a very detailed dissertation, titled Asynchronous/Multi Threaded I/O on Commodity Systems with Multiple Disks – a performance study, that does some amazing analysis and testing of cases where multitreaded I/O outperforms single threaded I/O. Take a look around page 86.

Dr. Dobbs also has an article on the subject, although I didn't have a chance to read the whole thing, I just skimmed through it.

继续阅读：benchmarking multithreading

Why is multithreading this code leading to such inconsistent timings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？