Where are possible locations of queueing/buffering delays in Linux multicast?

2022-12-20 17:52 问答作者：

We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity.

The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal.

The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()), then call send(). The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.)

We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it.

For what it's worth, we're using C开发者_运维问答entOS 4.x (RHEL kernel 2.6.9).

This is a great question. On CentOS like most flavors of *nix there is a UDP receive/send buffer for every multicast socket. The size of this buffer is controlled by sysctl.conf you can view the size of your buffers by calling /sbin/sysctl -a

The below items show my default and max udp receive size in bytes. The larger these numbers the more buffering and therefor latency the network/kernel can introduce if your application is too slow in consuming the data. If you have built in good tolerance for data loss you can make these buffers very tiny and you will not see the latency build up and recovery you described above. The trade off is data loss as the buffer overflows - something you may be seeing already.

[~]$ /sbin/sysctl -a | mem net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216

In most cases you need to set default = to your max unless you are controlling this when you create your socket.

the last thing you can do (depending on your kernel version) is view the UDP stats of the PID for your process or at the very least the box overall.

cat /proc/net/snmp | grep -i Udp Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 81658157063 145 616548928 3896986

cat /proc/PID/net/snmp | grep -i Udp Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 81658157063 145 616548928 3896986

If it wasn't clear from my post, the latency is due to your application not consuming the data fast enough and forcing the kernel to buffer traffic in the above structure. The network, kernel, and even your network card ring buffers can play a roll in latency but all those items typically only add a few milliseconds.

Let me know your thoughts and I can give you more information on where to look in your app to squeeze some more performance.

Packets can queue up in the send and receive side kernel, the NIC and the networking infrastructure. You will find a plethora of items you can test and tweak.

For the NIC you can usually find interrupt coalescing parameters - how long the NIC will wait before notifying the kernel or sending to the wire whilst waiting to batch packets.

For Linux you have the send and receive "buffers", the larger they are the more likely you are to experience higher latency as packets get handled in batched operations.

For the architecture and Linux version you have to be aware of how expensive context switches are and whether there are locks or pre-emptive scheduling enabled. Consider minimizing the number of applications running, using process affinity to lock processes to particular cores.

Don't forget timing, the Linux kernel version you are using has pretty terrible accuracy on the gettimeofday() clock (2-4ms) and is quite an expensive call. Consider using alternatives such as reading from the core TSC or an external HPET device.

Diagram from Intel: alt text http://www.theinquirer.net/IMG/142/96142/latency-580x358.png?1272514422

If you decide you need to capture packets in the production environment, it may be worth looking at using monitor ports on your switches and capture the packets using non-production machines. That'll also allow you to capture the packets on multiple points across the transmission path and compare what you're seeing.

继续阅读：latency multicast networking queueing

Where are possible locations of queueing/buffering delays in Linux multicast?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？