How to make the following code faster

2023-01-30 12:03 问答作者：

int u1, u2;  
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long     
res1, res2 initialized to zero.  

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);  
            simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res1[i + k], simdb);  

            simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);  
            sim开发者_运维知识库db = _mm_load_si128 ((__m128i *)&res2[i + k]);  
            simdb = _mm_xor_si128  (simda, simdb);  
            _mm_store_si128 ((__m128i *)&res2[i + k], simdb);  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

The above mentioned code is called many times in my program (profiler shows 98%).

EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct

Any suggestion to make it faster is welcome.

Unfortunately the most obvious optimisations are probably already being done by the compiler:

You can pull &_mulpre[u1] and &mulpre[u2] our of the inner loop.
You can pull &res1[i] our of the inner loop.
Using different variables for the two inner operations, and reordering them, might allow for better pipelining.

Possibly swapping the outer loops would improve cache locality on elm1.

Well, you could always call it fewer times :-)

The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.

There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).

l = 60;  
while (l)  
{  
    for (i = 0; i < 20; i += 2)  
    {  
        u1 = (elm1[i] >> l) & 15;  
        u2 = (elm1[i + 1] >> l) & 15;

        for (k = 0; k < 20; k += 2)  
        {  
            _mm_stream_si128 ((__m128i *)&res1[i + k],
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
                                    _mm_load_si128 ((__m128i *) &res1[i + k]
                                   ));  

            mm_stream_si128 ((__m128i *)&res2[i + k],    
                    _mm_xor_si128  (
                                    _mm_load_si128 ((__m128i *)&_mulpre[u2][k]), 
                                    _mm_load_si128 ((__m128i *)&res2[i + k])
                                   ));  
        } 
    }
    l -= 4;
    All res1, res2 values are left shifted by 4 bits.  
}

Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
try _mm_stream_si128(), it might speed up the storing process.
try prefetch

继续阅读：c optimization simd sse sse2

How to make the following code faster

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？