UDP socket network disconnect behavior on Windows-Linux-Mac
I made an application using boost.Asio using UDP multicast. I don't think the question is really specific to boost.Asio but to sockets programming in general, since boost.Asio's network facilities are mostly wrappers to socket functions.
I constructed the application based on the multicast examples ( http://www.boost.org/doc/libs/1_44_0/doc/html/boost_asio/example/multicast/receiver.cpp and ~/sender.cpp) and I deployed it on several machines running on Windows, Linux and a Mac with OSX Leopard. I'm very pleased that multicasting on all platforms works out of the box with the code d开发者_如何学编程erived from the examples.
Where I run into problems, is when I disconnect the network cable. Of course, disconnecting the cable will always cause problems ;) but there are subtle differences that drive me crazy.
My testing setup is always as follows: One machine running a sender and a receiver, to see if the same machine receives its own multicast, and another machine running only the receiver. I pull the network cord on the machine running the sender and the receiver.
Observed behavior:
-Obviously the machine where the receiver runs doesn't receive any more messages. That was to be expected ;)
-When the machine where the network cable is unplugged runs windows, the sender continues to send and the receiver on the same machine continues to receive. No errors detected. It seems windows has an intrinsic fallback to loopback?
-When the machine where the network cable is unplugged runs Mac OSX, the sender continues to send with no error message displayed, but the receiver on the same machine doesn't receive anymore. Before you ask, I checked NOT to set the disable loopback option.
-When the machine where the network cable is unplugged runs Linux, the sender fails with a boost::error "Network is unreachable". Obviously, since the sender can't send the data, the receiver doesn't receive anything anymore.
For Linux, I can fake the behavior of Windows by catching the "unreachable" error (or catching a wrong number of bytes written) and setting a flag in my code, subsequently sending all data to 127.0.0.1 instead of the multicast address. I regularly check if a send_to on the multicast endpoint still yields an error to detect a network reconnect and go back to multicasting. This works like a charm because the receiver is bind() to inaddr_any and thus listens on 127.0.0.1 also.
For Mac OSX I have no means of noticing when the network becomes unreachable to keep up the service for the receiver on the local machine.
I observed that on Mac OSX I get a "Network is unreachable" error momentarily once when the network cable is re-plugged and DHCP hasn't yet acquired a new IP address.
So basically: How can I achieve that on MacOSX the local client can still receive from the local sender? Either by detecting a network loss like I do on Linux or by tricking it into behaving like Windows.
Any advise by people who have a deeper insight into network programming than i have, is greatly appreciated.
When I encountered this problem, my solution was to arrange to get a notification from the OS when the network configuration has changed. When my program received that notification, it would wait a few seconds (to hopefully make sure the network configuration has finished changing), and then tear down and reconstruct all of its sockets. It's a pain, but it seems to work pretty well.
Of course, there is no OS-agnostic way (that I know of) to get a notification from the OS when the network config has changed, so I had to implement it differently under each OS.
For MacOS/X, I spawn a separate watch-the-network-config thread, which looks like this:
#include <SystemConfiguration/SystemConfiguration.h>
void MyNetworkThreadWatcherFunc(void *)
{
SCDynamicStoreRef storeRef = NULL;
CFRunLoopSourceRef sourceRef = NULL;
if (CreateIPAddressListChangeCallbackSCF(IPConfigChangedCallback, this, &storeRef, &sourceRef) == noErr)
{
CFRunLoopAddSource(CFRunLoopGetCurrent(), sourceRef, kCFRunLoopDefaultMode);
while(_threadKeepGoing) // may be set to false by main thread at shutdown time
{
CFRunLoopRun();
}
// cleanup time: release our resources
CFRunLoopRemoveSource(CFRunLoopGetCurrent(), sourceRef, kCFRunLoopDefaultMode);
CFRelease(storeRef);
CFRelease(sourceRef);
}
}
and there is also this setup/support code, called from the above function:
static OSStatus MoreSCError(const void *value) {return MoreSCErrorBoolean(value != NULL);}
static OSStatus CFQError(CFTypeRef cf) {return (cf == NULL) ? -1 : noErr;}
static void CFQRelease(CFTypeRef cf) {if (cf != NULL) CFRelease(cf);}
// Create a SCF dynamic store reference and a corresponding CFRunLoop source. If you add the
// run loop source to your run loop then the supplied callback function will be called when local IP
// address list changes.
static OSStatus CreateIPAddressListChangeCallbackSCF(SCDynamicStoreCallBack callback, void *contextPtr, SCDynamicStoreRef *storeRef, CFRunLoopSourceRef *sourceRef)
{
OSStatus err;
SCDynamicStoreContext context = {0, NULL, NULL, NULL, NULL};
SCDynamicStoreRef ref = NULL;
CFStringRef patterns[2] = {NULL, NULL};
CFArrayRef patternList = NULL;
CFRunLoopSourceRef rls = NULL;
// Create a connection to the dynamic store, then create
// a search pattern that finds all entities.
context.info = contextPtr;
ref = SCDynamicStoreCreate(NULL, CFSTR("AddIPAddressListChangeCallbackSCF"), callback, &context);
err = MoreSCError(ref);
if (err == noErr)
{
// This pattern is "State:/Network/Service/[^/]+/IPv4".
patterns[0] = SCDynamicStoreKeyCreateNetworkServiceEntity(NULL, kSCDynamicStoreDomainState, kSCCompAnyRegex, kSCEntNetIPv4);
err = MoreSCError(patterns[0]);
if (err == noErr)
{
// This pattern is "State:/Network/Service/[^/]+/IPv6".
patterns[1] = SCDynamicStoreKeyCreateNetworkServiceEntity(NULL, kSCDynamicStoreDomainState, kSCCompAnyRegex, kSCEntNetIPv6);
err = MoreSCError(patterns[1]);
}
}
// Create a pattern list containing just one pattern,
// then tell SCF that we want to watch changes in keys
// that match that pattern list, then create our run loop
// source.
if (err == noErr)
{
patternList = CFArrayCreate(NULL, (const void **) patterns, 2, &kCFTypeArrayCallBacks);
err = CFQError(patternList);
}
if (err == noErr) err = MoreSCErrorBoolean(SCDynamicStoreSetNotificationKeys(ref, NULL, patternList));
if (err == noErr)
{
rls = SCDynamicStoreCreateRunLoopSource(NULL, ref, 0);
err = MoreSCError(rls);
}
// Clean up.
CFQRelease(patterns[0]);
CFQRelease(patterns[1]);
CFQRelease(patternList);
if (err != noErr)
{
CFQRelease(ref);
ref = NULL;
}
*storeRef = ref;
*sourceRef = rls;
return err;
}
static void IPConfigChangedCallback(SCDynamicStoreRef /*store*/, CFArrayRef /*changedKeys*/, void *info)
{
printf("Network config changed! Place code here to send a notification to your main thread, telling him to close and recreate his sockets....\n");
}
And there are equivalent (and also fairly obscure) mechanisms for getting a network-config-changed notification under Linux (using socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE))) and Windows (using NotifyAddrChange()) which I can post if they would be helpful, but I don't want to spam up this page too much if you are only interested in the MacOS/X solution.
I think what is happening in Windows, is that even though you disconnect the cable, Windows still holds the ethernet interface open because you have some sockets connected to it, and the multicast_address to which you are sending stays valid. It is also possible that Windows changes which interface the sender/receiver are using, so the change is transparent at a socket level.
I think what is happening in OS X, is that when you disconnect the cable, the sender multicasts to the loopback interface, but the receiver is still connected to the disconnected ethernet interface. It may also be possible that OS X is configuring a self-assigned IP that the sender sends to, but the receiver is still listening on the old DHCP IP.
And in Linux, when you disconnect the cable, the ethernet interface loses it's IPv4 address, removes routes to 239.255.0.1, the loopback interface isn't configured to send anything outside 127...*, and so you get an error.
Perhaps the solution is to periodically rejoin the group on the OS X receiver? (And maybe you will also have to reconstruct the sender's endpoint periodically.)
Another thing to try is use a self-assigned IP on OS X, so you have the same IP & routes when the cable is connected or disconnected.
精彩评论