开发者

OpenMP threads appear to execute serially

I have an application which should essentially evaluate the reverse polish notation of a mathematical expression in parallel may times. My problem is i'm not seeing any gain in performance when using OpenMP. (Im using VS2008, the /openmp compiler option is set.)

My main loop looks like this:

int nMaxThreads = std::min(omp_get_max_threads(), s_MaxNumOpenMPThreads);
int nThreadID;
omp_set_num_threads(nMaxThreads);

#pragma omp parallel for schedule(static) private(nThreadID)
for (i=0; i<nBulkSize; ++i)
{
  nThreadID = omp_get_thread_num();
  printf("Thread %d Idx %d start",nThreadID, i);
  results[i] = EvalRPNInParallel(i, nThreadID);
  printf(" -- %d Idx %d end\n",nThreadID, i);
}

The printfs are there solely for debugging purposes to see if any parallel action is taking place (which should mix them up inbetween the 4 threads). From the debug output i can see that indeed multiple threads are beeing spawned. Each thread is getting a certain chunk of the loop but the threads do not appear to execute in parallel. Thread 0 is calculating its chunk of the loop, then thread 1 calculates its chunk and so on. No parallel execution whatsoever. The Execution time is exactly as if openmp was'nt even active. EvalRPNInParallel is a member function that does the RPN calculation. I do not use any locks, mutexes omp barriers inside this function.

double Foo::EvalRPNInParallel(int nOffset, int nThreadID) const
{
  double *Stack = &m_vStackBuffer[nThreadID * (m_vStackBuffer.size() / 4);
  for (const SToken *pTok = m_pRPN;  ; ++pTok)
  {
    switch (pTok->Cmd)
    {
      case  cmADD:  --sidx; Stack[sidx] += Stack[1+sidx]; continue;
      case  cmSUB:  --sidx; Stack[sidx] -= Stack[1+sidx]; continue;
      case  cmMUL:  --sidx; Stack[sidx] *= Stack[1+sidx]; continue;
      case  cmVAR:  Stack[++sidx] = *(pTok->Val.ptr + nOffset);  continue;
      // ...
      // ...
      // ...
      case  cmEND:  return Stack[m_nFinalResultIdx];  
    }
  }
}

The strange thing is, if i'm deliberately slowing down EvalRPNInParallel with an unnecessary for loop i'm indeed seeing parallel execution of EvalRPNInParallel as i would expect it. Does anyone have an idea why i'm not seeing any gain from using OpenMP her?

[update] I also tried the following openMP constructs neither one did show any parallel exection:

int nIterationsPerThread = nBulkSize/nMaxThreads;
#pragma omp parallel for private(nThreadID, j, k) shared(nMaxThreads, nIterationsPerThread) ordered
for (i=0; i<nMaxThreads; ++i)
{
  for (j=0; j<nIterationsPerThread; ++j)
  {
    nThreadID = omp_get_thread_num();
    k = i*nIterationsPerThread + j;
    printf("Thread %d Idx %d start",nThreadID, k);
    results[k] = ParseCmdCodeBulk(k, nThreadID);
    printf(" -- %d Idx %d end\n",nThreadID, k);
  }
}

using sections:

#pragma omp pa开发者_JAVA百科rallel shared(nBulkSize) private(nThreadID, i)
{
  #pragma omp sections nowait
  {
    #pragma omp section
    for (i=0; i<(nBulkSize/2); ++i)
    {
      nThreadID = omp_get_thread_num();
      printf("Thread %d Idx %d start",nThreadID, i);
      results[i] = ParseCmdCodeBulk(i, nThreadID);
      printf(" -- %d Idx %d end\n",nThreadID, i);
    } // end of section

    #pragma omp section
    for (i=nBulkSize/2; i<nBulkSize; ++i)
    {
      nThreadID = omp_get_thread_num();
      printf("Thread %d Idx %d start",nThreadID, i);
      results[i] = ParseCmdCodeBulk(i, nThreadID);
      printf(" -- %d Idx %d end\n",nThreadID, i);
    } // end of section
  }
} // end of sections


Classic Heisenberg, observing a thread affects its behavior. The printf() function is slow, surely much slower then your expression evaluator. And has to acquire a lock to prevent the characters in the string from getting intermingled with console output requested by other threads. The odds that more than one thread can make it to the EvalRPNInParallel function concurrently are just not very good. Which you can't observe with your diagnostics btw.

And the usual advice applies, only optimize your code after you measured it three times to find out what the bottleneck might be. I'd be surprised if it takes more than a couple of microseconds. You cannot win in that case, starting the thread already takes longer. The same measurement you make to find the bottleneck will also tell you if threading gets you ahead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜