How can I optimize a calculation-intensive C++ program with a known bottleneck?

2023-01-09 04:06 问答作者：

I am developing some scientific software for my university. It is being written in C++ on Windows (VS2008). The algorithm must calculate some values for a large number of matrix pairs, that is, at the core resides a loop iterating over the matrices, collecting some data, e.g.:

sumA = sumAsq = sumB = sumBsq = diffsum = diffsumsq = return = 0;
for (int y=0; y < height; ++y)
{
    for (int x=0; x < width; ++x)
    { 
        valA = matrixA(x,y);
        valB = matrixB(x,y);
        sumA+=valA;
        sumAsq+=valA*valA;
        sumB+=valB;
        sumBsq+=valB*valB;
        diffsum+=valA-valB;
        diffsumsq+=(valA-valB)*(valA-valB);
    }
}
return = sumA + sumB / sumAsq + sumBsq * diffsum * diffsumsq

This routine is executed millions of times for different matrixA, matrixB pairs. My problem is that this program is extremely slow, compiled in Release mode with all optimizations activated. Using the "pause-when-busy-and-inspect" debugger technique, I established that the program sits inside this loop virtually every time, even though, as you might expect, this routine is surrounded by a whole bunch of conditions and control branches. What puzzles me the most is that during its execution on a dual-processor Xeon-based system, the program utilizes one of the 4 cores (no surprise, it is single-threaded for now) but only up to about 25% of its limit, and with relatively large oscillations, where I would expect steady, 100% load until the program terminates.

The current version is actually a rewrite, created with optimizing the performance in mind. I was devastated when I found out it's actually slower than the original. The previous version used Boost matrices, which I replaced by OpenCV matrices, after having established them to be over 10 times faster in comparing the execution time of multiplying two 1000x100 matrices. I access the matrix by manually dereferencing a raw pointer to its data which I hoped would gain me some performance. I mad开发者_Go百科e the calculation routine a multi-line #define macro to enforce its inlining and to avoid function calls and returns. I improved the math behind the calculations so that the final value is calculated in a single pass through the matrices (the old version requires two passes). I expected to get huge gains and yet the opposite is true. I'm nowhere near my old program's efficiency, not to mention commercial software for the particular application.

I was wondering if it perhaps had something to do with the matrix data being 8-bit chars, I once saw that access to floats was actually slower than to doubles in my old program, perhaps chars are even slower since the processor retrieves data in 32-bit chunks (this Xeon probably grabs even 64bits). I also considered turning the matrices into vectors to avoid a loop-inside-loop construct, as well as some form of vectorization, like for example calculating the data for 4 (less? more?) consecutive matrix cells on a single loop iteration. Any other ideas please?

EDIT: actual code in the new, OpenCV based version:

const char *Aptr, *Bptr;
double sumA = 0, sumB = 0, sumAsq = 0, sumBsq = 0, diffsum = 0, diffsumsq = 0;
char Aval, Bval;

for (int y=0; y < height; ++y)
{
    Aptr = (char*)(AMatrix.imageData + AMatrix.widthStep * y);
    Bptr = (char*)(BMatrix.imageData + BMatrix.widthStep * y);
    for (int x=0; x < width; ++x)
    {
        Aval = Aptr[x];
        Bval = Bptr[x];

        sumA+=Aval;
        sumB+=Bval;
        sumAsq+=Aval*Aval;
        sumBsq+=Bval*Bval;
        diffsum+=Aval-Bval;
        diffsumsq+=(Aval-Bval)*(Aval-Bval);
    }
}

Various thoughts:

You say that you're only managing to achieve a CPU load of about 25%. I can think of two reasons for this:
1. You're swapping. What is the size of your matrices? Do they fit entirely in physical memory? Look at your application's memory usage and working set size.
2. The rest of your application's code is blocking on I/O. Does the code that surrounds your core routine do any I/O? It could be blocking there for large stretches of time, but of course you're not seeing that using the "pause-when-busy-and-inspect" technique because whenever the process unblocks again, it returns straight into your compute-intensive core routine.
Take a look at the assembly code for your core routine. Does it look reasonable?
Do you actually need to compute diffsum within the loop? It looks as if you could do diffsum=sumA-sumB once outside the loop -- but there may be numerical considerations that prevent you from doing this.
As renick has already commented, this looks like a prime target for SSE optimization. Again, you should make sure the compiler is generating reasonable assembly code (if you're using intrinsics and not writing the assembly yourself).
If you don't want to write SSE code yourself, at least make sure that your compiler's SSE flag is set. This will allow the compiler to use the SSE unit instead of the FPU for scalar floating-point operations, which will by itself will improve performance because the stack-based FPU on the x86 is notoriously ill-suited to compiler code generation.

Your inner loop is calling functions! No matter how trivial they are you pay a heavy penalty. You should try to linearize the matrix accesses (in essence make them 1D) so that you can access them with just pointer dereferencing

vala = *matrixA++; 
valb = *matrixB++;

and since you are dong simple additions and subtractions look at SSE/SSE2 etc depending on your target CPU capabilities and your arithmetic (integer, floating point etc).

EDIT: MMX SSE2 intrinsics are functions that map one to one with CPU SIMD instructions. See these microsoft pages to get started and additionally I suggest looking at the Intel Site for the IA-32/ Intel64 programmers guides or similar manuals from AMD.

I also highly recommend this book on Optimization for Intel Architectures. This will explain all the hidden capabilities of your CPU and compiler..

Can you check the assembler code this loop is generating? If you only get a 25% of processor use, it may be that this loop is memory bound. There are about eight local variables there and I imagine the compiler is not mapping all of them to registers, so that there are many memory operations being done in each loop. One consideration would be tho write that loop in assembler.

Why do you walk the matrix column by column? Matrices will be stored in memory row after row, so if you access a whole column in the inner loop, you are probably requesting more memory loads to your different memory levels (caches and so on).

If you use the "pause" technique, it should tell you more than just that you're in that loop. It should tell you where in the loop.

Never guess when you can just find out. That said, here's my guess :-) You are doing all the summing in floating-point variables, but getting the original numbers as integer characters, right? Then you can expect the conversion from int to double to take some time, and if so you will see your pauses happening in those instructions a good part of the time. So basically I'm wondering why you don't do it all in integer arithmetic.

You say the utilization never goes over 25%. Could that be because it's only using one of the 4 cores?

You say the utilization often drops below 25%. That suggests maybe the thread is blocking to do file I/O. If so, your pauses should catch it in the act and confirm it. If so, you might be able to speed up the I/O by using larger blocks, or possibly doing open/close less often. Note that improvements to your inner loop will shrink the amount of time spent in that loop, but it will not shrink the time spent in I/O, so the percent of time in I/O will increase (causing an apparent decrease in utilization), until you also shrink that.

Utilization is actually not a very useful number. It's only really an indicator of the CPU/IO split, and doesn't tell you at all if you are doing too much of either of those.

As @renick said, get rid of the address calculations. You should be able to step through this loop at the assembly-language level and see it doing nothing more than you would do if you put on your "guru" hat and wrote the assembly yourself.

In any case, vectorizing could be a big win.

If I were in your shoes I would try to find out what exactly is causing the difference in performance between the old and the new code. Perhaps the boost matrices use some sort of caching or lazy/eager evaluation.

You should also try if you can't multithread the loop via a simple setup like OpenMP. the 25% CPU usage sounds like a quad core running a single worker thread.

You should try to get rid of your loops, and try vectorizing the operations instead. Using a library like Eigen your code would look something like this:

Eigen::MatrixXd matrixA(height, width);
Eigen::MatrixXd matrixB(height, width);
double sumA = matrixA.sum();
double sumAsq = matrixA.cwise().square().sum();
double sumB = matrixB.sum();
double sumBsq = matrixB.cwise().square().sum();

Eigen::MatrixXd diff = matrixA - matrixB;
double diffSum = diff.sum();
double diffSumSq = diff.cwise().square().sum();

return sumA + sumB / sumAsq + sumBsq * diffSum * diffSumSq;

Store matrices with the same parameters, outside of the loop? I think that oughta save you some.

"25% of its limit, and with relatively large oscillations, where I would expect steady, 100% load until the program terminates."

You mentioned that function is surrounded by whole bunch of conditions and control branches, so I guess it causes CPU pipeline to be flushed instead of being utilzed efficiently. Try to rewrite your software so that it won't require a lot of branching.

I would also recommend using one of math libraries like Eigen, ATLAS or GSL

TLDR: first, optimize the matrix multiplication algorithm, then watch your number of temporaries, then optimize your matrix internal allocations.

Long answer:

I think the most important thing to address is the optimization of your matrices multiplication. The multiplication of matrices for the most intuitive algorithm is O(n^3) (which is huge even for small matrices).

To give you an example, for 2x2 matrix multiplication you have 16 multiplication operations ("mo"). For 3x3 matrix multiplication, you have 27 mo and for 4x4 you have 64 mo.

I'm not sure how it is implemented in your case, but if it's the intuitive algorithm (as a triple for loop), changing that to matrix multiplication using LU decomposed matrices should increase your performance drastically.

This is because once you have the decomposed matrices you can optimize the multiplication algorithm heavily (no sense multiplying rows and columns for the zero elements).

Further, consider using cached values instead of repeating the operations for adding to diffsumsq:

old code:

diffsum+=valA-valB; // compute difference once
diffsumsq+=(valA-valB)*(valA-valB); // compute it two more times, then multiply

new code:

diff = valA-valB; // compute difference once
diffsum+= diff; // use computed difference
diffsumsq+=diff*diff; // just multiply

The second variant is three times faster on the difference computation (two for cycles - x* y operations are executed only once instead of three times).

You can further watch for your number of temporary objects: each binary operation creates a temporary (which means allocating another x*y matrix in memory and copying values). For example the expression:

diffsumsq+=(valA-valB)*(valA-valB);

creates a temporary for the difference in the first paranthesis, then another one for the difference in the second, then another one for the product.

In my example above, you're better writing

diff = valA;
diff -= valB;

instead of

diff = valA - valB;

as this way you avoid allocating the temporary that is assigned to diff (in the second variant).

Another thing to watch would be your memory consumption: since this gets executed a lot, you could either pre-allocate memory, or pool the used matrices and reuse memory instead of creating new objects.

I'm sure there are other things to watch for.

Edit: How can you multiply the matrices? You should have them match by columns x lines. That is, number of columns in valA should be equal to number of lines in valB (if I remember my matrix multiplications correctly).

One more thing:

I made the calculation routine a multi-line #define macro to enforce its inlining and to avoid function calls and returns.

You don't need macros for optimizing C++ code. To avoid function calls and returns use inlined functions. Macros come with their own problems.

继续阅读：optimization performance visual-c++

How can I optimize a calculation-intensive C++ program with a known bottleneck?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？