开发者

Debugging segmentation fault in a multi-threaded (using clone) program

I wrote a code to create some threads and whenever one of the threads finish a new thread is created to replace it. As I was not able to create very large number of threads (>450) using pthreads, I used clone system call instead. (Please note that I am aware of the implication of having such a huge number of threads, but this program is meant to only stress the system).

As clone() requires the stack space for the child thread to be specified as parameter, I malloc the required chunk of stack space for each thread and free it up when the thread finishes. When a thread finishes I send a signal to the parent to notify it of the same.

The code is given below:

#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <errno.h>

#define NUM_THREADS 5

unsigned long long total_count=0;
int num_threads = NUM_THREADS;
static int thread_pids[NUM_THREADS];
static void *thread_stacks[NUM_THREADS];
int ppid;

int worker() {
 int i;
 union sigval s={0};
 for(i=0;i!=99999999;i++);
 if(sigqueue(ppid, SIGUSR1, s)!=0)
  fprintf(stderr, "ERROR sigqueue");
 fprintf(stderr, "Child [%d] done\n", getpid());
 return 0;
}

void sigint_handler(int signal) {
 char fname[35]="";
 FILE *fp;
 int ch;
 if(signal == SIGINT) {
  fprintf(stderr, "Caught SIGINT\n");
  sprintf(fname, "/proc/%d/status", getpid());
  fp = fopen(fname,"r");
  while((ch=fgetc(fp))!=EOF)
   fprintf(stderr, "%c", (char)ch);
  fclose(fp);
  fprintf(stderr, "No. of threads created so far = %llu\n", total_count);
  exit(0);
 } else
  fprintf(stderr, "Unhandled signal (%d) received\n", signal);
}


int main(int argc, char *argv[]) {
 int rc, i; long t;
 void *chld_stack, *chld_stack2;
 siginfo_t siginfo;
 sigset_t sigset, oldsigset;

 if(argc>1) {
  num_threads = atoi(argv[1]);
  if(num_threads<1) {
   fprintf(stderr, "Number of threads must be >0\n");
   return -1;
  }
 }
 signal(SIGINT, sigint_handler);

 /* Block SIGUSR1 */
 sigemptyset(&sigset);
 sigaddset(&sigset, SIGUSR1); 
 if(sigprocmask(SIG_BLOCK, &sigset, &oldsigset)==-1)
  fprintf(stderr, "ERROR: cannot block SIGUSR1 \"%s\"\n", strerror(errno));

 printf("Number of threads = %d\n", num_threads);
 ppid = getpid();
 for(t=0,i=0;t<num_threads;t++,i++) {
  chld_stack = (void *) malloc(148*512);
  chld_stack2 = ((char *)chld_stack + 148*512 - 1);
  if(chld_stack == NULL) {
   fprintf(stderr, "ERROR[%ld]: malloc for stack-space failed\n", t);
   break;
  }
  rc = clone(worker, chld_stack2, CLONE_VM|CLONE_FS|CLONE_FILES, NULL);
  if(rc == -1) {
   fprintf(stderr, "ERROR[%ld]: return code from pthread_create() is %d\n", t, errno);
   break;
  }
  thread_pids[i]=rc;
  thread_stacks[i]=chld_stack;
  fprintf(stderr, " [index:%d] = [pid:%d] ; [stack:0x%p]\n", i, thread_pids[i], thread_stacks[i]);
  total_count++;
 }
 sigemptyset(&sigset);
 sigaddset(&sigset, SIGUSR1); 
 while(1) {
  fprintf(stderr, "Waiting for signal from childs\n");
  if(sigwaitinfo(&sigset, &siginfo) == -1)
   fprintf(stderr, "- ERROR returned by sigwaitinfo : \"%s\"\n", strerror(errno));
  fprintf(stderr, "Got some signal from pid:%d\n", siginfo.si_pid);

  /* A child finished, free the stack area allocated for it */ 
  for(i=0;i<NUM_THREADS;i++) {
   fprintf(stderr, " [index:%d] = [pid:%d] ; [stack:%p]\n", i, thread_pids[i], thread_stacks[i]);
   if(thread_pids[i]==siginfo.si_pid) {
    free(thread_stacks[i]);
    thread_stacks[i]=NULL;
    break;
   }
  }
  fprintf(stderr, "Search for child ended with i=%d\n",i);
  if(i==NUM_THREADS) 
   continue;
  /* Create a new thread in its place */
  chld_stack = (void *) malloc(148*512);
  chld_stack2 = ((char *)chld_stack + 148*512 - 1);
  if(chld_stack == NULL) {
   fprintf(stderr, "ERROR[%ld]: malloc for stack-space failed\n", t);
   break;
  }
  rc = clone(worker, chld_stack2, CLONE_VM|CLONE_FS|CLONE_FILES, NULL);
  if(rc == -1) {
   fprintf(stderr, "ERROR[%ld]: return code from clone() is %d\n", t, errno);
   break;
  }
  thread_pids[i]=rc;
  thread_stacks[i]=chld_stack;
  total_count++;
 }
 fprintf(stderr, "Broke out of infinite loop. [total_count=%llu] [i=%d]\n",total_count, i);
 return 0;
}

I have used couple of arrays to keep track of the child processes' pid and the stack area base address (for freeing it).

When I run this program it terminates after sometime. Running with gdb tells me that one of the thread gets a SIGSEGV (segmentation fault). But it doesn't gives me any location, the output is similar to the following:

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 15864]
0x00000000 in ?? ()

I tri开发者_StackOverflow中文版ed running it under valgrind with the following commandline:

valgrind --tool=memcheck --leak-check=yes --show-reachable=yes -v --num-callers=20 --track-fds=yes ./a.out

But it keeps running without any issues under valgrind.

I am puzzled as to how to debug this program. I felt that this might be some stack overflow or something but increasing the stack size (upto 74KB) didn't solved the problem.

My only query is why and where is the segmentation fault or how to debug this program.


Found the actual issue.
When the worker thread signals the parent process using sigqueue(), the parent sometimes gets the control immediately and frees up the stack before the child executes the return statement. When the same child thread uses return statement, it causes segmentation fault as the stack got corrupted.
To solve this I replaced

exit(0)

instead of

return 0;


I think i found the answer

Step 1

Replace this:

static int thread_pids[NUM_THREADS];
static void *thread_stacks[NUM_THREADS];

By this:

static int *thread_pids;
static void **thread_stacks;

Step 2

Add this in the main function (after checking arguments):

thread_pids = malloc(sizeof(int) * num_threads);
thread_stacks = malloc(sizeof(void *) * num_threads);

Step 3

Replace this:

chld_stack2 = ((char *)chld_stack + 148*512 - 1);

By this:

chld_stack2 = ((char *)chld_stack + 148*512);

In both places you use it.

I dont know if its really your problem, but after testing it i didnt get any segmentation fault. Btw i did only get segmentation faults when using more than 5 threads.

Hope i helped!

edit: tested with 1000 threads and runs perfectly

edit2: Explanation why the static allocation of thread_pids and thread_stacks causes an error.

The best way to do this is with an example.

Assume num_threads = 10;

The problem occurs in the following code:

for(t=0,i=0;t<num_threads;t++,i++) {
...

thread_pids[i]=rc; 
thread_stacks[i]=chld_stack;

...
}

Here you try to access memory which does not belong to you (0 <= i <= 9, but both arrays have a size of 5). That can cause either segmentation fault or data corruption. Data corruption may happen if both arrays are allocated one after the other, resulting in writing to the other array. Segmentation can happen if you write in memory you dont have allocated (statically or dynamically).

You may be lucky and have no errors at all, but the code is surely not safe.

About the non-aligned pointer: I think i dont have to explain more than in my comment.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜