Scalable algorithm to detect stale data

2023-03-28 14:54 问答作者：

Here is the problem:

"Agents" installed on many different servers send "heartbeat" signals up to a central server every 5 seconds. How can I find the ones that have missed their开发者_StackOverflow中文版 heartbeat for more than 10 seconds actively and raise an alert?

The problem is simple if you don't think about scalablity. In the simplest form, you can record the timestamp of the latest heartbeat received from each agent in a database table and run a regular query to find the ones older than the threshold.

This solution is however not scalable to millions of agents.

I am looking for algorithms or techologies that make this possible.

Use a map: AgentId --> LastHearbeatTime
Use 11 sets (assuming a resolution of 1 second is enough), each holds Ids of Agents reported in a 1-second window.

Every time an agent report a hearthbeat: 1. Find it in the map 2. Delete it from the relevant set 3. Update it in the map 4. Add it to the relevant set

Define a thread: Once per second, the oldest set expires. It should be empty. If it doesn't - it contains Ids of agents which did not report. Once a set expires, you can reuse it (cyclic array of sets).

I believe it can be implemented without locks (maybe you'll need 12 sets).

Without knowing language and platform it's somewhat hard to advise you on a detailed implementation, however my advice is somewhat similar to Lior Kogan's. In my opinion, however, you only need two sets and no map is involved:

Say you have two variables representing sets, A and B.

Every heartbeat removes the agent id from set A. Every 5 seconds, a different thread raises an alert for every agent id in B, then sets B = A, and last but not least creates a set with all of the agent ids and sets A to equal that (if the number of agent ids is really large, you can prepare the new set between one check and the other and only sleep for the remaining time). Locking would only be needed while changing the variables pointing to each set, provided you use a lock-free set collection. Performance will largely depend on the algorithmic complexity of said implementation, and if you go down this way, you should privilege the one with best performance (not necessarily best big-O, for instance if wost-case latency matters to you) for removals.

As a side note, if memory is not an issue or failures are relatively few, when you check whether you need to raise alerts and do so, you can do that on a thread of its own and getting possibly interesting performance speedups (again, the platform and runtime matter, for in erlang that would be a breeze but in Windows the cost of creating a full-blown new thread might exceed the performance benefit if the failures are few) at the cost of keeping the old B set in memory.

MongoDB is great for this type of use. While not exactly an algorithm, it does fit the bill for a fundamental technology that is needed to create this service. We use it here at CopperEgg for our RevealCloud product to do exactly what you say - we send an alert when the system has gone away for some bit of time - sampling every 5 seconds. I'd love to hear more about your thoughts and use case. Can you provide more details?

继续阅读：algorithm heartbeat monitoring

Scalable algorithm to detect stale data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？