Optimising for O_DIRECT writes
I'm trying to write an application which will need to write to disk very quickly. I've hit my performance target for writing to disk, which is great.
However, I've noticed that writing to disk so quickly is using a lot of CPU time: One core is maxed out, another is at 80% and another开发者_开发技巧 2 are 10-20%. So I heard that O_DIRECT can decrease CPU consumption by avoiding all of those copies into kernel space and then copies to disk.
I ran a small test program which confirmed this - CPU usage drops to 50% of one core - much better.
However, I never got quite the same throughput as I did when doing normal writes and to make it quick, I had to use a really big record size (something like 130MB!)
So, the question is, I guess:
- Is there a better way to decrease CPU usage than O_DIRECT for writes? or
- How can I get a similar throughput to what the kernel gets?
My enviroment is Linux, I'm using a RAID 50, and I'm able to buffer writes until I hit some optimal record size. There will be only one writer at a time.
Quoting this page:
With O_DIRECT the kernel will do DMA directly from/to the physical memory pointed [to] by the userspace buffer passed as [a] parameter to the read/write syscalls. So there will be no CPU and memory bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..).
Basically you are trading throughput for CPU performance when using O_DIRECT
. The kernel stops optimizing the throughput for you, and in return you get predictable results and full control.
Long story short: with O_DIRECT
you'll have to do the caching and other optimizing yourself which increase throughput. The huge record size doesn't seem so weird now.
I'm not aware of any other methods, but I'm not a linux guru. Feel free to ask around :)
You would need to somehow arrange for more I/O to be kept in flight at the same time AND submit them down at the optimal size. When the kernel buffers your write I/Os together there are number of benefit that can happen:
- It may become possible to merge contiguous I/Os together into bigger I/Os. If so there's an opportunity to save overhead because rather than submitting 8 small 4KBytes I/Os down the kernel is now able to submit 1 64Kbyte I/O down (for example).
- It opens up a parallel submission possibility. If the kernel is able to batch up 256k it may now be able to send that down as 8 simultaneous I/Os thus achieving a higher iodepth.
So
Is there a better way to decrease CPU usage than O_DIRECT for writes?
Yes send bigger I/Os up to the optimal size prefered by your disk.
How can I get a similar throughput to what the kernel gets?
Ideally do the above (send optimally sized I/Os) and ensure the maximum I/Os that your disk likes are kept in-flight at once (e.g. by submitting asynchronously or via multiple threads/processes if you're going to use blocking routines) and submit the I/Os in the disk's LBA order. A slightly less optimal trick is to send huge I/Os down and force the kernel to split them to create parallelism but this less optimal.
Have you tried with mmap
and msync
? I don't know if it is faster or less CPU intensive, but as it represent a whole other approach to I/O (basically it's the kernel that does the I/O for you) it can be an interesting venue.
精彩评论