Why does block I/O completion take so long when crossing CPUs?

2022-12-13 22:31 问答作者：

I am trying to squeeze the most performance out of a Linux block driver for a high-end storage device. One problem that has me a bit stumped at the moment is this: if a user task starts an I/O operation (read or write) on one CPU, and the device interrupt occurs on another CPU, I incur about 80 microseconds of delay before the task resumes execution.

I can see this using O_DIRECT against the raw block device, so this is not page cache or filesystem- related. The driver uses make_request to receive operations, so it has no request queue and does not utilize any kernel I/O scheduler (you'll have to trust me, it's way faster this way).

I can demonstrate to myself that the problem occurs between callin开发者_如何学Pythong bio_endio on one CPU and the task being rescheduled on another CPU. If the task is on the same CPU, it starts very quickly, and if the task is on another physical CPU, it takes a lot longer -- usually about 80 microseconds longer on my current test system (x86_64 on Intel 5520 [NUMA] chipset).

I can instantly double my performance by setting the process and IRQ cpu affinity to the same physical CPU, but that's not a good long-term solution-- I'd rather be able to get good performance no matter where the I/Os originate. And I only have one IRQ so I can only steer it to one CPU at a time -- no good if many threads are running on many CPUs.

I can see this problem on kernels from Centos 5.4's 2.6.18 to the mainline 2.6.32.

So the question is: why does it take longer for the user process to resume, if I called bio_endio from another CPU? Is this a scheduler issue? And is there any way to eliminate or lower the delay?

If you finish your I/O on a particular CPU, then that processor is immediately free to start working on a new thread - if you finish your i/o on the same processor as the thread the requested it, then the next thread is likely to be the one you finished i/o for.

On the other hand, if you finish on a different processor, the thread that requested the i/o won't get to run immediately - it has to wait until whatever's currently executing finishes its quantum or otherwise relinquishes the CPU.

As far as I understand.

It could just be the latency inherent in issuing an IPI from the CPU that completed the bio to the CPU where the task gets scheduled - to test this, try booting with idle=poll.

This patch was just posted to LKML, implementing QUEUE_FLAG_SAME_CPU in the block device layer, which is described as:

Add a flag to make request complete on cpu where request is submitted. The flag implies QUEUE_FLAG_SAME_COMP. By default, it is off.

It sounds like it might be just what you need...

Looks like I misunderstood the problem a bit: it seems to be related to cache misses; when the cpu handling interrupts wasn't the cpu that started the i/o, the cpu can hit 100% utilization, and then everything slows down, giving the impression that there is a long delay communicating between cpus.

Thanks to everyone for their ideas.

继续阅读：linux-device-driver linux-kernel smp

Why does block I/O completion take so long when crossing CPUs?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？