Debugging strange error that depends on the selected scheduler

2023-03-08 00:58 问答作者：

I am experiencing a strange behavior in a software I am working on. It is a rea开发者_C百科ltime-machine-controller, written in C++, running on Linux and it is making extensive use of multithreading.

When I run the program without asking it to be realtime, everything works like I expect it to. But when I ask it to switch to its realtime mode, there is a clearly reproducible bug that lets the application crash. It must be some deadlock-thing I guess, because it is a mutex that runs into a timeout and ultimately triggers a assertion.

My Question is, how to hunt this one down. Looking at the backtrace from the produced core is not very helpful as the reason for the problem lies somewhere in the past.

The following code does the switching between 'normal' and 'realtime' behaviour:

In main.cpp (simplified, return-codes are checked via assertions):

if(startAsRealtime){
struct sched_param sp;
memset(&sp, 0, sizeof(sched_param));
sp.sched_priority = 99;
sched_setscheduler(getpid(), SCHED_RR, &sp);}

In every thread (simplified, return-codes are checked via assertions):

if(startAsRealtime){
sched_param param;
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_getschedparam(&attr, &param);
param.sched_priority = priority;
pthread_attr_setschedpolicy(&attr, SCHED_RR);
pthread_attr_setschedparam(&attr, &param);}

Thanks in advance

If you're using glibc as your C library, you could use the answer to the question Is it possible to list mutexs which a thread holds to find out the thread that is holding the mutex which is timing out. That should start to narrow things down - you can then inspect that thread and find out why it's not giving up the mutex.

One of your realtime threads might be spinning in a loop (not yielding), thus starving other threads and resulting in a mutex timeout.

There could also be a race condition that only manifests itself when you switch to "realtime mode". The timing of events in realtime mode happens to trigger some kind of deadlock.

If you have places where you acquire multiple levels of locks, or lock recursively, those should be the first places you suspect.

If you really have no clue where the problem is, try the binary search approach for bracketing the problem. Recursively cut out half of the functionality until you narrow it down to the actual problem. You might have to mock some subsystems that are temporarily cut out.

You can apply this binary search technique to your mutex acquisition timeouts to find which one is the culprit.

继续阅读：debugging real-time

Debugging strange error that depends on the selected scheduler

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？