Efficient Linux sockets (DMA/zero-copy)
I'm building a very high performance Linux server (based on epoll, non-blocking sockets, and async disk IO [based on io_submit/io_getevents/eventfd]). Some of my benchmarks show that the way I handle sockets isn't efficient enough for my requirements. In particular, I'm concerned with getting data from the userspace buffer to the network card, and from the network card back to the userspace buffer (let's ignore sendfile call for now).
From what I understand, calling read/write on a non-blocking Linux socket isn't fully asynchronous - the system call blocks while it copies the buffer from the userspace to the kernel (or the other way around), and only then returns. Is there a way to avoid this overheard in Linux? In particular, is there a fully asynchronous write call that I can make on a socket that would return immediately, DMA the userspace buffer to the network 开发者_JAVA百科card as necessary, and signal/set an event/etc. on completion? I know Windows has an interface for this, but I couldn't find anything about this in Linux.
Thanks!
There's been some talk on linux-kernel recently about providing an API for something along these lines, but the sticking point is that you can't DMA from general userspace buffers to the network card, because:
- What looks like contiguous data in the userspace linear address space is probably not-contiguous in physical memory, which is a problem if the network card doesn't do scatter-gather DMA;
- On many machines, not all physical memory addresses are "DMA-able". There's no way at the moment for a userspace application to specifically request a DMA-able buffer.
On recent kernels, you could try using vmsplice
and splice
together to achieve what you want - vmsplice
the pages (with SPLICE_F_GIFT
) you want to send into a pipe, then splice
them (with SPLICE_F_MOVE
) from the pipe into the socket.
AFAIK you are using the most efficient calls available if you cant use sendfile(2). Various aspects of efficient high performance networking code is covered by The C10K problem
精彩评论