OpenMP threads appear to execute serially

2023-02-21 12:35 问答作者：

I have an application which should essentially evaluate the reverse polish notation of a mathematical expression in parallel may times. My problem is i'm not seeing any gain in performance when using OpenMP. (Im using VS2008, the /openmp compiler option is set.)

My main loop looks like this:

int nMaxThreads = std::min(omp_get_max_threads(), s_MaxNumOpenMPThreads);
int nThreadID;
omp_set_num_threads(nMaxThreads);

#pragma omp parallel for schedule(static) private(nThreadID)
for (i=0; i<nBulkSize; ++i)
{
  nThreadID = omp_get_thread_num();
  printf("Thread %d Idx %d start",nThreadID, i);
  results[i] = EvalRPNInParallel(i, nThreadID);
  printf(" -- %d Idx %d end\n",nThreadID, i);
}

The printfs are there solely for debugging purposes to see if any parallel action is taking place (which should mix them up inbetween the 4 threads). From the debug output i can see that indeed multiple threads are beeing spawned. Each thread is getting a certain chunk of the loop but the threads do not appear to execute in parallel. Thread 0 is calculating its chunk of the loop, then thread 1 calculates its chunk and so on. No parallel execution whatsoever. The Execution time is exactly as if openmp was'nt even active. EvalRPNInParallel is a member function that does the RPN calculation. I do not use any locks, mutexes omp barriers inside this function.

double Foo::EvalRPNInParallel(int nOffset, int nThreadID) const
{
  double *Stack = &m_vStackBuffer[nThreadID * (m_vStackBuffer.size() / 4);
  for (const SToken *pTok = m_pRPN;  ; ++pTok)
  {
    switch (pTok->Cmd)
    {
      case  cmADD:  --sidx; Stack[sidx] += Stack[1+sidx]; continue;
      case  cmSUB:  --sidx; Stack[sidx] -= Stack[1+sidx]; continue;
      case  cmMUL:  --sidx; Stack[sidx] *= Stack[1+sidx]; continue;
      case  cmVAR:  Stack[++sidx] = *(pTok->Val.ptr + nOffset);  continue;
      // ...
      // ...
      // ...
      case  cmEND:  return Stack[m_nFinalResultIdx];  
    }
  }
}

The strange thing is, if i'm deliberately slowing down EvalRPNInParallel with an unnecessary for loop i'm indeed seeing parallel execution of EvalRPNInParallel as i would expect it. Does anyone have an idea why i'm not seeing any gain from using OpenMP her?

[update] I also tried the following openMP constructs neither one did show any parallel exection:

int nIterationsPerThread = nBulkSize/nMaxThreads;
#pragma omp parallel for private(nThreadID, j, k) shared(nMaxThreads, nIterationsPerThread) ordered
for (i=0; i<nMaxThreads; ++i)
{
  for (j=0; j<nIterationsPerThread; ++j)
  {
    nThreadID = omp_get_thread_num();
    k = i*nIterationsPerThread + j;
    printf("Thread %d Idx %d start",nThreadID, k);
    results[k] = ParseCmdCodeBulk(k, nThreadID);
    printf(" -- %d Idx %d end\n",nThreadID, k);
  }
}

using sections:

#pragma omp pa开发者_JAVA百科rallel shared(nBulkSize) private(nThreadID, i)
{
  #pragma omp sections nowait
  {
    #pragma omp section
    for (i=0; i<(nBulkSize/2); ++i)
    {
      nThreadID = omp_get_thread_num();
      printf("Thread %d Idx %d start",nThreadID, i);
      results[i] = ParseCmdCodeBulk(i, nThreadID);
      printf(" -- %d Idx %d end\n",nThreadID, i);
    } // end of section

    #pragma omp section
    for (i=nBulkSize/2; i<nBulkSize; ++i)
    {
      nThreadID = omp_get_thread_num();
      printf("Thread %d Idx %d start",nThreadID, i);
      results[i] = ParseCmdCodeBulk(i, nThreadID);
      printf(" -- %d Idx %d end\n",nThreadID, i);
    } // end of section
  }
} // end of sections

Classic Heisenberg, observing a thread affects its behavior. The printf() function is slow, surely much slower then your expression evaluator. And has to acquire a lock to prevent the characters in the string from getting intermingled with console output requested by other threads. The odds that more than one thread can make it to the EvalRPNInParallel function concurrently are just not very good. Which you can't observe with your diagnostics btw.

And the usual advice applies, only optimize your code after you measured it three times to find out what the bottleneck might be. I'd be surprised if it takes more than a couple of microseconds. You cannot win in that case, starting the thread already takes longer. The same measurement you make to find the bottleneck will also tell you if threading gets you ahead.

继续阅读：openmp visual-studio-2008

OpenMP threads appear to execute serially

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？