开发者

Debugging strange error that depends on the selected scheduler

I am experiencing a strange behavior in a software I am working on. It is a rea开发者_C百科ltime-machine-controller, written in C++, running on Linux and it is making extensive use of multithreading.

When I run the program without asking it to be realtime, everything works like I expect it to. But when I ask it to switch to its realtime mode, there is a clearly reproducible bug that lets the application crash. It must be some deadlock-thing I guess, because it is a mutex that runs into a timeout and ultimately triggers a assertion.

My Question is, how to hunt this one down. Looking at the backtrace from the produced core is not very helpful as the reason for the problem lies somewhere in the past.

The following code does the switching between 'normal' and 'realtime' behaviour:

In main.cpp (simplified, return-codes are checked via assertions):

if(startAsRealtime){
struct sched_param sp;
memset(&sp, 0, sizeof(sched_param));
sp.sched_priority = 99;
sched_setscheduler(getpid(), SCHED_RR, &sp);}

In every thread (simplified, return-codes are checked via assertions):

if(startAsRealtime){
sched_param param;
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_getschedparam(&attr, &param);
param.sched_priority = priority;
pthread_attr_setschedpolicy(&attr, SCHED_RR);
pthread_attr_setschedparam(&attr, &param);}

Thanks in advance


If you're using glibc as your C library, you could use the answer to the question Is it possible to list mutexs which a thread holds to find out the thread that is holding the mutex which is timing out. That should start to narrow things down - you can then inspect that thread and find out why it's not giving up the mutex.


One of your realtime threads might be spinning in a loop (not yielding), thus starving other threads and resulting in a mutex timeout.

There could also be a race condition that only manifests itself when you switch to "realtime mode". The timing of events in realtime mode happens to trigger some kind of deadlock.

If you have places where you acquire multiple levels of locks, or lock recursively, those should be the first places you suspect.

If you really have no clue where the problem is, try the binary search approach for bracketing the problem. Recursively cut out half of the functionality until you narrow it down to the actual problem. You might have to mock some subsystems that are temporarily cut out.

You can apply this binary search technique to your mutex acquisition timeouts to find which one is the culprit.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜