Problem between IO heavy operations and network application listening for UDP and SCTP data
We have an application that uses two types of socket, a listening UDP socket and an active SCTP socket.
At certain time we have scripts running on the same machine that have high IO activities (such as "dd, tar, ..."), most of the time when these IO heavy applications run we seem to have the following problems:
- The UDP socket closes
- The SCTP socket is still alive and we can see it in /proc/net/sctp/assocs however no traffic is received anymore from this socket (until we restart the application)
Why are these I/O operations affecting the network based application in such a way?
Is there any kernel configurations to avoid these problems? I would have expected some packets to be lost on the UDP and some retries on the SCTP socket but not this behavior.The application is running on a server with 64-bits 4 quad core CPU and RHEL OS
# uname -a
Linux server1 2.6.1开发者_JS百科8-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
When you say the UDP socket closes, what exactly do you mean? You try send
and it fails?
For SCTP, can you collect wireshark or pcap traces at the time these I/O operations runs (preferably run wireshark on the peer)? My guess is (an educated guess without looking at the code), when these I/O operations comes into the picture, your process gets starved for CPU time. The other end sends SCTP Heartbeat messages
to which it gets no replies. Or if data was flowing, the peer end is not receiving any SACKS
as they have not yet been processed by the SCTP stack at your end.
The peer, therefore, aborts the association internally and stops sending you data (since it sees all the paths as down ergo does not send ABORT. In such a case, your SCTP stack will still think Association is alive).
Try to confirm what are the values for Heartbeat timeout, RTO timeout,SACK timeout, maximum Path retransmission & max Association retransmission
at the peer end. I haven't worked with Kernel SCTP but sysctl should be able to give you those values.
Either ways, collecting pcap traces when you observe this problem would give us much better insight to what is going wrong. I hope it helps.
Here are some things I'd look into:
What is loading on the UDP socket when the scripts are not running? Is it continuous or bursty? Does the socket ever spontaneously close when the scripts are not running? What is happening to the data being read off the socket? How much data generated off of the socket (raw or processed) is being written to disk? Can you monitor CPU, network, and disk IO utilization to see if any of them are saturating? Can the scripts running the IO operations be run at a lower priority or, conversely, can the process running the UDP socket be run at a higher priority?
One thing allot of people don't check for is return values on sends, and they don't check for error conditions like EINTR
on recv
's. Maybe the heavy IO load is causing some of your send
's or recv
's to get interrupted and your app is seeing the errors as a hard errors and closing the socket without you realizing that the errors are transient.
I've seen this kind of thing happen and you should definitely check for it by cranking up your log level and seeing if your app is calling close unexpectedly.
精彩评论