开发者

Debugging Erlang heart timeouts

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am开发者_如何学编程 finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?


By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):

--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
@@ -559,10 +559,11 @@
     int res;
     if(heart_beat_kill_pid != 0){
    pid = (pid_t) heart_beat_kill_pid;
-   res = kill(pid,SIGKILL);
+   res = kill(pid,SIGUSR1);
+   sleep(4);
    for(i=0; i < 5 && res == 0; ++i){
        sleep(1);
-       res = kill(pid,SIGKILL);
+       res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
    }
    if(errno != ESRCH){
        print_error("Unable to kill old process, "

As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.

Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.


You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.

You can try using the erl_call tool with e.g. -a erlang halt 123.

If the erlang node can't respond to this is also interesting information.

Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.


If you have any idea of why it is freezing you could try to trace the module using dbg.

http://www.erlang.org/doc/man/dbg.html

In short try

dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).

If you want to stop this tracing issue

dbg:ctpl()

See documentation for more info.

Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.

Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜