开发者

Variable running time of a C program

My (simd) implementation takes varied amount of time, though it is run for fixed input. The running time varies between say 100 million clock cycles to 120 million clock cycles. The program calls a function around 600 times, and the most expensive part of the function is in it memory is accessed ~2000 times. Thus, overall memory involvement in quite high in my program.

Is the variation in running time due to memory access patterns/initial memory contents?

I used valgrind to analyze profile my program. It shows each memory access takes about 8 instructions. Is this normal?

Following is the piece of code (function) that is called 600 times. Mulprev[32][20] is the array which is accessed most number of times.

j = 15;  
u3v = _mm_set_epi64x (0xF, 0xF);
while (j + 1)  
{

    l = j << 2;  
    for (i = 0; i < 20; i++)
    {
        val1v   = _mm_load_si128 ((__m128i *) &elm1v[i]);       
        uv  = _mm_and_si128 (_mm_srli_epi64 (val1v, l), u3v);
        u1  = _mm_extract_epi16 (uv, 0);
        u2  = _mm_extract_epi16 (uv, 4) + 16;

        for (ival = i, ival1 = i + 1, k = 0; k < 20; k += 2, ival += 2, ival1 += 2)
        {
            temp11v = _mm_load_si128 ((__m128i *) &mulprev[u1][k]); 
            temp12v = _mm_load_si128 ((__m128i *) &mulprev[u2][k]);

            val1v   = _mm_load_si128 ((__m128i *) &res[ival]);
            val2v   = _mm_load_si128 ((__m1开发者_开发技巧28i *) &res[ival1]); 

            bv  = _mm_xor_si128 (val1v, _mm_unpacklo_epi64 (temp11v, temp12v));
            av  = _mm_xor_si128 (val2v, _mm_unpackhi_epi64 (temp11v, temp12v));

            _mm_store_si128 ((__m128i *) &res[ival], bv);                                   
            _mm_store_si128 ((__m128i *) &res[ival1], av); 
        }
    }

    if (j == 0)
        break;
    val0v = _mm_setzero_si128 ();

    for (i = 0; i < 40; i++)
    {
        testv   = _mm_load_si128 ((__m128i *)  &res[i]);
        val1v   = _mm_srli_epi64 (testv, 60);
        val2v   = _mm_xor_si128  (val0v, _mm_slli_epi64 (testv, 4));
        _mm_store_si128 (&res[i], val2v);
        val0v   = val1v;
    }
    j--;
}       

I want to reduce the computation time of my program. Any suggestions?


You are performing almost no computation in between loads and stores, hence your execution time will most likely be dominated by the cost of I/O to/from cache/memory. Even worse, your data set appears to be relatively small. Probably the only way you can optimise this further is to improve the memory access pattern (make accesses sequential where possible, and ensure that cache lines are not wasted, etc) and/or combine these operations with other code which operates on the same data set before/after this routine (so that the cost of loads/stores in amortised somewhat).

EDIT: note that I gave a very similar answer when you asked much the same question for an apparently earlier version of this routine: How to make the following code faster - you seem to have missed the point that your main performance problem here is memory access, not computation.


Computers are complicated. Could easily be background processes interfering in some way. It is hard to suggest improvements without additional info. Generally, the best optimizations are the high-level ones. Choose better algorithms, minimize expensive operations. If you don't think there is much room for improvement there, don't expect too high gains. You say that your memory accesses take a lot of cycles. I could suggest that you use restricted pointers where possible, but it's hard to give general advice on optimization issues. You sort of have to try out things yourself.


8 cycles for a memory access is quite a long time. Another process might be having a negative impact on the CPU caches causing your program a lot of cache-misses, or if your memory is dynamically allocated you might be seeing unaligned memory access penalties.

It could be anything.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜