Quickly detect remote process exit/crash

2023-02-14 17:36 问答作者：

I have a distributed app where resources get locked for exclusive use by tasks. Each task runs in its own process. I'd like to automatically unlock resources if a task process exits or the server it's running on dies (eg power failure).

How could I remotely detect such a process exit/failure within a few seconds?

After some Googling I came up with a few ideas, but I don't have direct experience with any of them...

Use advisory lock functions built into mySQL (get_lock) or postgres (pg_advisory_lock). These would automatically release the locks if the database connection closed, which would happen on a process exit or server crash.
Use a dedicated distributed lock manager, like ZooKeeper. This would work, but it seems like more than I need.
Make a TCP connection from the task process to a remote monitoring process with the TCP/socket keepalive option enabled. This seems doable, but I'd rather bu开发者_Python百科ild on something that takes care of the low-level network details for me.

Another thought was to split the problem up. Since server crashes are fairly uncommon, I could use a local watchdog process to monitor for process exits and then use some thing else to monitor for server crashes.

Thanks for any feedback!

You may want to read on "The ϕ Accrual Failure Detectors". I found it is the most generic and theoretically sound approach to failure detectors. It is never a question of "detecting failures within seconds" but always a trade-of between how fast and how reliable is your failure detection. By knowing how to collect and process statistics from failures that were correctly or mistakenly detected in the past you can estimate probability of failure as function of time you were waiting for response from remote server.

TCP keep-alive is useless here - its "ping" is too coarse, like 2 hours by default.

If you don't want to roll your own implementation for anything, you can rely on external services.

You can try using something like lockable, which makes locking primitives available over a network:

# acquire lock
https://lockable.dev/api/acquire/my-lock-name

# release lock
https://lockable.dev/api/release/my-lock-name

Locks automatically expire if they are not renewed, so one approach would be to acquire a lock when your process starts, set the expiration duration to 1 or 2 seconds and every 1 second send a heartbeat to lockable.

With this setup, if you ever see the lock release unexpectedly, it means your process has died.

继续阅读：distributed-computing locking monitoring process

Quickly detect remote process exit/crash

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？