Quickly detect remote process exit/crash
I have a distributed app where resources get locked for exclusive use by tasks. Each task runs in its own process. I'd like to automatically unlock resources if a task process exits or the server it's running on dies (eg power failure).
How could I remotely detect such a process exit/failure within a few seconds?
After some Googling I came up with a few ideas, but I don't have direct experience with any of them...
Use advisory lock functions built into mySQL (get_lock) or postgres (pg_advisory_lock). These would automatically release the locks if the database connection closed, which would happen on a process exit or server crash.
Use a dedicated distributed lock manager, like ZooKeeper. This would work, but it seems like more than I need.
Make a TCP connection from the task process to a remote monitoring process with the TCP/socket keepalive option enabled. This seems doable, but I'd rather bu开发者_Python百科ild on something that takes care of the low-level network details for me.
Another thought was to split the problem up. Since server crashes are fairly uncommon, I could use a local watchdog process to monitor for process exits and then use some thing else to monitor for server crashes.
Thanks for any feedback!
You may want to read on "The ϕ Accrual Failure Detectors". I found it is the most generic and theoretically sound approach to failure detectors. It is never a question of "detecting failures within seconds" but always a trade-of between how fast and how reliable is your failure detection. By knowing how to collect and process statistics from failures that were correctly or mistakenly detected in the past you can estimate probability of failure as function of time you were waiting for response from remote server.
TCP keep-alive is useless here - its "ping" is too coarse, like 2 hours by default.
If you don't want to roll your own implementation for anything, you can rely on external services.
You can try using something like lockable, which makes locking primitives available over a network:
# acquire lock
https://lockable.dev/api/acquire/my-lock-name
# release lock
https://lockable.dev/api/release/my-lock-name
Locks automatically expire if they are not renewed, so one approach would be to acquire a lock when your process starts, set the expiration duration to 1 or 2 seconds and every 1 second send a heartbeat to lockable.
With this setup, if you ever see the lock release unexpectedly, it means your process has died.
精彩评论