开发者

improve locality and decrease Cache pollution in a medical image reconstruction implementation

I'm doing a research for my University related to an Image reconstruction algorithm for medical usage.

I'm stuck in something up to 3 weeks, I need to improve the performance of the following code:

for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
  LOR_X = P.symmLOR[lor].x;
  LOR_Y = P.symmLOR[lor].y;
  LOR_XY = P.symmLOR[lor].xy;
  lor_z = P.symmLOR[lor].z;
  LOR_Z_X = P.symmLOR[lor_z].x;
  LOR_Z_Y = P.symmLOR[lor_z].y;
  LOR_Z_XY = P.symmLOR[lor_z].xy;  

  s0 = P.a2r[lor];
  s1 = P.a2r[lor+1];

  for (s=s0; s < s1; s++)
  {
    pixel     = P.a2b[s];
    v         = P.a2p[s]; 

    b[lor]    += v * x[pixel];

    p          = P.symm_Xpixel[pixel];
    b[LOR_X]  += v * x[p];

    p          = P.symm_Ypixel[pixel];
    b[LOR_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel];
    b[LOR_XY] += v * x[p];


    // do Z symmetry.
    pixel_z    = P.symm_Zpixel[pixel];
    b[lor_z]  += v * x[pixel_z];


    p          = P.symm_Xpixel[pixel_z];
    b[LOR_Z_X]  += v * x[p];


    p          = P.symm_Ypixel[pixel_z];
    b[LOR_Z_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel_z];
    b[LOR_Z_XY] += v * x[p];

   }

}

for anyone who wants to know, The code implements the MLEM forward function and all the variables are FLOAT.

After several tests, I had notice that the big delay was on this part of the code. (you know, the 90 - 10 rule).

Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirm that the performance decreases due to higher amount of misses, particularly for the random access to b vector (huge in size).

Reading information on the Internet I just know two options to improve performance so far: improve data locality or decrease data pollution.

To do the first improvement, I'll try to change the code to be cache aware, just like was propossed by Ulrich Drepper on What every programmer should know about the memory (www.akkadia.org/drepper/cpumemory.pdf) A.1 Matrix multiplication.

I believe that blocking the SpMV (sparse matrix-vector Multiplication) will improve the performance.

On the other hand,开发者_StackOverflow every time the program tried to access to b vector, we had what is known as cache pollution.

Is there's a way to do a load a value from the b vector with SIMD instruction without using the Cache?

Also, it is possible to use a function like void _mm_stream_ps(float * p , __m128 a ) to store ONE float value on the vector b without polluting the Cache?

I can't use _mm_stream_ps because always store 4 floats but the access to the b vector is clearly random.

I'd hope to be clear in my dilemma.

More info: v is the column value of an Sparse Matrix store with CRS format. I realize that other optimization could be done if I tried to change CRS format to other, however, like I said before, I'd made several test for months and I know that the performance decrease is related to random access on vector b. from 400.000.000 L1D Misses I can go to 100~ Misses when I don't store in vector b.

Thanks.


I'd say, at first try to help your compiler a bit.

  • Declare the bounds for the outer loop as const before the loop.
  • Declare all variables that you may (all the LOR_..) as local variables, something like: float LOR_X = P.symmLOR[lor].x; or size_t s0 = P.a2r[lor];
  • This also in particular for loop variables if you happen to have modern, C99 compliant, compiler: for (size_t s=s0; s < s1; s++)
  • Separate load and store for your b vector. The locations of the items that you access, there, do not depend on s. So create a local variable to hold the actual value for all the distinct cases that you handle before the inner loop, update these local variables inside the inner loop, and store the results after the inner loop.
  • Perhaps separate your inner loop in several. The index computations are relatively cheap, and then your system might better recognize the streaming access to some of your vectors.
  • Look at the assembler that your compiler produces and identify the code for the inner loop. Experiment a bit to move as many of the "far" load and stores out of that loop.

Edit: after re-reading gravitron's answer and your comment, the important thing here is really to declare the variables as local as possible and to check the assembler that the compiler succeeds in having the cache-missing loads and stores outside the inner loop.


A simple optimization to reduce the random access on vector b would be to never write into vector b within the inner for loop.

Instead load all values needed from vector B into temporary variables, do the entire inner for loop while updating these temporary variables, then write the temporary variables back into vector B.

The temporary variables will at worst be located on the same cache lines, depending on your compiler and environment you may also hint for the compiler to use registers for these variables.


I won't even pretend that I know what the code is doing :) But one possible cause for some extra memory access is aliasing: if the compiler can't be sure that b, x, and the various P.symm arrays don't overlap, then writes to b will affect how reads from x and the P.symm's can be scheduled. If the compiler is particularly pessimistic, it might even force the fields of P to be re-fetched from memory. All of this will contribute to the cache misses you're seeing. Two simple ways to improve this are:

  1. Use __restrict on b. This guarantees that b doesn't overlap the other arrays, so writes to it won't affect reads (or writes) from other arrays.

  2. Reorder things so that all the reads from P.symm's are first, then the reads from x, then finally all the writes to b. This should break up some of the dependencies in the reads and the compiler schedule the reads from P.symm's in parallel, then the reads from x in parallel, and hopefully do the writes to b sensibly.

One other stylistic thing (which will help with point #2) is to not reuse variables so p so much. There's no reason you can't have e.g. p_x, p_y, p_xy, etc. and it will make reordering the code easier.

Once that's all in place, you can start sprinkling prefetch instructions (i.e. __builtin_prefetch on gcc) ahead of known cache misses.

Hope that helps.


These are good answers, and I would ask why so much indexing? by index values that don't change much locally?

Also, it wouldn't kill you to do a few random pauses to see where it typically is.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜