开发者

How to make the following code faster

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            sim开发者_运维知识库db = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.


Unfortunately the most obvious optimisations are probably already being done by the compiler:

  • You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
  • You can pull &res1[i] our of the inner loop.
  • Using different variables for the two inner operations, and reordering them, might allow for better pipelining.

Possibly swapping the outer loops would improve cache locality on elm1.


Well, you could always call it fewer times :-)

The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.


There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).


l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}
  1. Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
  2. try _mm_stream_si128(), it might speed up the storing process.
  3. try prefetch
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜