开发者

fastest way to iterate matrix of known dimensions

I was wondering what the fastest way to iterate through a matrix is in c/c++.

The best method i've come up with so far is to map the matrix to a single dimension.

Then use pointer arithmetic, any other method that might be faster?

Dimensions are known at run time but not compile time, matrix is fully populated.

#include <iostream>
#include <time.h>
#define XMAX 500
#define YMAX 400
#define ZMAX 300

int main()
{
    srand(0);   
    register double sum = 0;
    register int i;
    register int j;
    register int k;

    double *arr_ptr;
    arr_ptr = new double[XMAX*YMAX*ZMAX];

    for (i=0; i<XMAX*YMAX*ZMAX; ++i)
    {
        *(arr_ptr+i) = rand()/double(RAND_MAX);
    }

    clock_t start, finish;
    start = clock();

    for (i=0; i<XMAX; ++i)
    {
        for (j=0; j<YMAX; ++j)
        {
            for (k=0; k<ZMAX; ++k)
            {
                sum += *(arr_ptr+i*YMAX*ZMAX+j*ZMAX+k);
            }
        }
   开发者_Python百科 }

    finish = clock();
    std::cout << "sum: " << sum << "\telapsed: " << finish - start << std::endl;
    std::cin.get();

    delete[] arr_ptr;
}


Why bother with the three nested for-loops? You could just do

for (i=0; i<XMAX*YMAX*ZMAX; ++i)
{
    sum += *(arr_ptr+i);
}


This is 650ms faster than your code for XMAX 500, YMAX 400 and ZMAX 100, run 100 times, according to ideone.com compiler.

double *p_current, *p_end;

p_current = arr_ptr;
p_end = (arr_ptr + XMAX*YMAX*ZMAX);
while(p_current != p_end) {
    sum += *p_current++;
}

See: old version, new version


Actually it doesn't matter because the compiler will optimize it anyway. So arr[i][j][k] and *(arr_ptr+i*YMAX*ZMAX+j*ZMAX+k) will be equally fast.


In your example the bounds are even constants, so normal three dimensional arrays would work with this, be it C or C++.

Then, C and C++ are really different languages on what is concerned dynamically allocated arrays with variable bounds, don't mix them up. For C++ use vector classes and such things. They are made for this and they should be efficient.

In C, since C99 there are VLA, variable length arrays. Contrary to urban myth they may be quite efficient, if you don't allocate them on the stack. Use malloc as for any big chunk of memory in C.

double (*arr_ptr)[XMAX][YMAX][ZMAX]
  = malloc(sizeof(*arr_ptr));

for (register size_t i=0; i<XMAX; ++i)
  for (register size_t j=0; j<YMAX; ++j)
    for (register size_t k=0; k<ZMAX; ++k)
       (*arr_ptr)[i][j][k] = rand()/double(RAND_MAX);

.

free(arr_ptr);

Modern processors have quite complex addressing schemes, so it might not be necessary to effectively do the complete index calculation. Your compiler usually knows better than you.

Then, to be efficient it might be much more important how you declare and handle your loop variables. Use the correct types for indexing, size_t is the correct unsigned type for that. int may easily overflow when you compute 3-D flattened indices and having a signed type here makes not much sense.

Then declare those variables as local as possible, makes things clearer for you and the compiler.

register is just a contract with the compiler that you never will take the address of such an index. Usually this doesn't improve things a lot. But may prevent you from doing inefficient things when you modify your code later.

And last but not least, if you really worry about efficiency, check what your compiler produces. E.g gcc has the option -S to produce the intermediate assembler. Read it instead of speculating about efficiency.


OpenCV uses pointer arithmetic:

double *ptr = arr_ptr;
for (i=0; i<XMAX*YMAX*ZMAX; ++i)
{
    sum += *ptr++;
}

I guess it might be a little faster somehow. Try it and show us the timings!


double *ptr = arr_ptr;
for (int i=XMAX*YMAX*ZMAX; i>0; --i)
{
    sum += *ptr++;
} 

Comparing the loop variable to zero, instead of to some constant may save one or two clock cycles for each iteration (for example, using JNZ instruction on intel CPUs)


The first thing that needs to be said is that stack allocated multidimensional arrays are stored in memory (in C and in C++) in row major order. Namely matrix[ 2 ][ 2 ] = { { 1, 2 }, { 3, 4 } } is going to be stored in memory just like you had actually declared an array[ 4 ] = { 1, 2, 3, 4 } and the matrix[][] syntax is just syntactic sugar for *( matrix + i * 2 + j).

So the fastest way to traverse the matrix depends on how you traverse it: in row major order or column major and how big the matrix is:

  • if the entire matrix can fit in the CPU cache than the traversal order doesn't matter;
  • if the matrix is bigger than the CPU cache than doing row major traversal cause less CPU cache misses.

The best way to know if you have a performance problem in matrix operations and what causes it is to profile your code.


For very large blocks of data, consider parallel operations. In this case, a sum can be computed with a gather operation -- the form of which will depend on the parallel framework you choose.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜