OpenMP C parallelisation algorithm
in the boo开发者_高级运维k "Using OpenMP" is an example for bad memory access in C and I think this is the main problem in my attempt to parallelism the gaussian algorithm.
The example looks something like this:
k= 0 ;
for( int j=0; j<n ; j++)
for(int i = 0; i<n; i++)
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
So, I do understand why this causes a bad memory access. In C a 2d array is stored by rows and here in every i step a new row will be copied from memory to cache.
I am trying to find a solution for this, but im not getting a good speed up. The effects of my attempts are minor.
Can someone give me a hint what I can do?
The easiest way would be to swap the for loops, but I want to do it columnwise.
The second attempt:
for( int j=0; j<n-1 ; j+=2)
for(int i = 0; i<n; i++)
{
a[i][j] = a[i][j] - a[i][k]*a[k][j] ;
a[i][j+1] = a[i][j+1] - a[i][k]*a[k][j+1] ;
}
didn't make a difference at all.
The third attempt:
for( int j=0; j<n ; j++)
{
d= a[k][j] ;
for(int i = 0; i<n; i++)
{
e = a[i][k] ;
a[i][j] = a[i][j] - e*d ;
}
}
Thx alot
Greets Stepp
use flat array instead, eg:
#define A(i,j) A[i+j*ldA]
for( int j=0; j<n ; j++)
{
d= A(k,j) ;
...
}
Your loop order will cause a cache miss on every iteration, as you point out. So just swap the order of the loop statements:
for (int i = 0; i < n; i++) // now "i" is first
for (int j = 0; j < n; j++)
a[i][j] = a[i][j] - a[i][k]*a[k][j];
This will fix the row in a
and vary just the columns, which means your memory accesses will be contiguous.
This memory access problem is just related to CACHE usage not to Openmp.
To make a good use of cache in general you should access contiguous memory locations. Remember also that if two or more threads are accessing the same memory area then you can have a "false shearing" problem forcing cache to be reloaded unnecessarily.
See for example:
http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/
精彩评论