开发者

Distributed network monitoring - how to tell if the monitored resource has fallen over, or the monitor it's self is at fault

I'm building a system for monitoring several large web sites (resources), using distributed web services controlled by a central controller.

I'm coming to a specific part of the design - the actual reporting of resources that are thought to have fallen 开发者_运维问答over.

My problem is that there is always the chance that the actual monitor it self is at fault, or has lost its network connection to a resource, and the resource is actually fine. I don't want to report issues if they are not really there.

My plan at the moment is to have the monitor request, that all other monitors check the resource if it encounters a problem, and then make a decision as to whether the resource has really fallen over based on collective results.

I'm sure there's someone out there with more experience of this type of programming than myself.

Is there a common solution to this type of problem? Is my solution a decent way of looking at this?


Your solution is about one of the only pragmatic ones.

There is nothing new under the sun. The IETF Routing Information Protocol wasn't the first attempt at addressing this problem, but it is well documented and works.

Note well, that there is no optimal (or perfect) solution to the class of problems which you are facing, the best you can do with in-band monitoring is make good guesses about where the fault is. In systems that need a very high degree of accuracy of fault information (e.g. the public switched telephone network) a parallel out-of-band monitoring network is established which itself must necessarily be monitored by humans.


Quis custodiet ipsos custodes? (Who will watch the watchers?) -- Juvenal, "Satires"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜