Infinite timeouts or "fail fast" in custom network protocol?

2022-12-12 13:13 问答作者：

Consider custom network protocol. This custom protocol could be used to control robotic peripherals over LAN from central .NET based workstation. (If it is important, the robot is busy moving fabs in chip production environment).

there are only 2 parties in conversation: .NET station and robotic peripheral board
the robotic side can only receive requests and send responses
the .NET side can only initiate requests and receive responses
there always should be exactly one response per request
the consequent requests can follow immediately one after another without waiting for response, but never exceed the fixed limit of simultaneously served requests (for example 5)

I had exhaustive discussion with my friend (who owns the design, I have discussed the thing as a bystander) about all nice details and ideas. At the end of discussion we had strong disagreement about missing timeouts. My friend's argument is that software on both sides should wait indefinitely. My argument was that timeouts are always needed by any network protocol. We simply could never agree.

One of my reasoning is that in case of any failure you should "fail fast" whatever cost, because if failure already occurred anyway, cost of recovery continues to grow proportionally to time spent to receive an info about failure. Say after 1 minute on LAN you definitely should stop waiting and just invoke some alarm.

But his argument was that recovery should include exactly the repairing of what failed (in this case recovery of network connection) and even if it takes to spend hours to figure out that network was lost and fixed, the software should just continue transparently running, immediately after reconnecting the LAN cables.

I would never seriously think about timeless protoc开发者_JAVA技巧ols, until this discussion.

Which side of argument is right ? The "fail fast" or "never fail" ?

Edit: Example of failure is loss of communication, normally detected by TCP layer. This part was also discussed. In case of TCP layer returning error, the higher custom protocol layer will retry sends and there is no argument about it. The question is: for how long to allow the lower level to keep trying ?

Edit for accepted answer: Answer is more complex than 2 choices: "The most common approach is never give up connection until actual attempt to send fails with solid confirmation that connection is long lost. To calculate that connection is long lost use heartbeats, but keep age of loss for this confirmation only, not for immediate alarm".

Example: When having telnet session, you can keep your terminal up forever and you never know if in between hitting Enter there were failures detectable by lower level routines.

In the scenario where ...

Controller has sent a request
Robot hasn't received the request
Network fails

... then the request has been sent, but has been lost and will never arrive.

Therefore, when the network is restored, the controller must resend the request: the controller cannot simply wait forever for the response.

I prefer your "fast fail" method, but as I think you've discovered, this is highly preferential.

Cisco equipment that I work with work very similarly - you send a request, they respond. (Over telnet.) The problem is when the network fails: I loose the TCP connection. However, neither side will close that connection until a data send is attempted, and since the cisco side rarely does that, it never closes. Worse, you can only have 1 connection at a time, so if there's network failure, you're locked out. (They can be reset, but it's a just a hassle.)

Now, to test a network connection, you need some sort of ping, just a "are you still there?" - many protocols do this, such as AIM and IRC. But those pings cost bandwidth, depending on how often you send them.

So, is the error detection worth the cost in bandwidth? How big does a ping really need to be? I'd say you should be able to get it to <50 octets/ping, and you could ping like once every 10s, 30s, 1m, something like that, I'd say it's well worth it. The earlier you know you have a problem, the better. If the software itself can then use these pings to know it lost the connection and re-establish contact automatically, I'd say that's great, along the lines of "Computer, heal thyself", and makes for less hassle for the operator.

If you're using TCP/IP, it can do this automatically for you -- see TCP Keepalives. Alternatively, you can do it within your application's protocol, as AIM & IRC do.

继续阅读：network-protocols peripherals robotics

Infinite timeouts or "fail fast" in custom network protocol?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？