Debugging Erlang heart timeouts

2023-02-24 03:23 问答作者：

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am开发者_如何学编程 finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?

By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):

--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
@@ -559,10 +559,11 @@
     int res;
     if(heart_beat_kill_pid != 0){
    pid = (pid_t) heart_beat_kill_pid;
-   res = kill(pid,SIGKILL);
+   res = kill(pid,SIGUSR1);
+   sleep(4);
    for(i=0; i < 5 && res == 0; ++i){
        sleep(1);
-       res = kill(pid,SIGKILL);
+       res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
    }
    if(errno != ESRCH){
        print_error("Unable to kill old process, "

As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.

Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.

You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.

You can try using the erl_call tool with e.g. -a erlang halt 123.

If the erlang node can't respond to this is also interesting information.

Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.

If you have any idea of why it is freezing you could try to trace the module using dbg.

http://www.erlang.org/doc/man/dbg.html

In short try

dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).

If you want to stop this tracing issue

dbg:ctpl()

See documentation for more info.

Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.

Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

继续阅读：erlang heartbeat

Debugging Erlang heart timeouts

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？