Why does block I/O completion take so long when crossing CPUs?
I am trying to squeeze the most performance out of a Linux block driver for a high-end storage device. One problem that has me a bit stumped at the moment is this: if a user task starts an I/O operation (read or write) on one CPU, and the device interrupt occurs on another CPU, I incur about 80 microseconds of delay before the task resumes execution.
I can see this using O_DIRECT against the raw block device, so this is not page cache or filesystem- related. The driver uses make_request
to receive operations, so it has no request queue and does not utilize any kernel I/O scheduler (you'll have to trust me, it's way faster this way).
I can demonstrate to myself that the problem occurs between callin开发者_如何学Pythong bio_endio
on one CPU and the task being rescheduled on another CPU. If the task is on the same CPU, it starts very quickly, and if the task is on another physical CPU, it takes a lot longer -- usually about 80 microseconds longer on my current test system (x86_64 on Intel 5520 [NUMA] chipset).
I can instantly double my performance by setting the process and IRQ cpu affinity to the same physical CPU, but that's not a good long-term solution-- I'd rather be able to get good performance no matter where the I/Os originate. And I only have one IRQ so I can only steer it to one CPU at a time -- no good if many threads are running on many CPUs.
I can see this problem on kernels from Centos 5.4's 2.6.18 to the mainline 2.6.32.
So the question is: why does it take longer for the user process to resume, if I called bio_endio
from another CPU? Is this a scheduler issue? And is there any way to eliminate or lower the delay?
If you finish your I/O on a particular CPU, then that processor is immediately free to start working on a new thread - if you finish your i/o on the same processor as the thread the requested it, then the next thread is likely to be the one you finished i/o for.
On the other hand, if you finish on a different processor, the thread that requested the i/o won't get to run immediately - it has to wait until whatever's currently executing finishes its quantum or otherwise relinquishes the CPU.
As far as I understand.
It could just be the latency inherent in issuing an IPI from the CPU that completed the bio to the CPU where the task gets scheduled - to test this, try booting with idle=poll
.
This patch was just posted to LKML, implementing QUEUE_FLAG_SAME_CPU
in the block device layer, which is described as:
Add a flag to make request complete on cpu where request is submitted. The flag implies
QUEUE_FLAG_SAME_COMP
. By default, it is off.
It sounds like it might be just what you need...
Looks like I misunderstood the problem a bit: it seems to be related to cache misses; when the cpu handling interrupts wasn't the cpu that started the i/o, the cpu can hit 100% utilization, and then everything slows down, giving the impression that there is a long delay communicating between cpus.
Thanks to everyone for their ideas.
精彩评论