开发者

Loop optimization by the IBM xlC compiler with Altivec

I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!

Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.

int main(void) {
        const int m = 1000;

        __vector signed int va;
        __vector signed int vb;
        __vector signed int vc;
        __vector signed int vd;

        int a[m];
        int b[m];
        int c[m];

        for( int i=0 ; i < m ; i++ ) {
                a[i] = i;
                b[i] = i;
                c[i] = 0;
        }

        for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
                vd = (__vector signed int){cnt,cnt,cnt,cnt};

          开发者_开发技巧      for( int i = 0 ; i < m/4 ; i+=4 ) {
                        va = vec_ld(0, &a[i]);
                        vb = vec_ld(0, &b[i]);
                        vc = vec_add(vd, vec_add(va,vb));
                        vec_st(vc, 0, &c[i]);
                }
        }

        std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";

        return 0;
}


I've done some stuff on Power 7, and I have seen very odd things with the XLC compiler. But not as odd as this! (not 60x at least...)

One thing to note about the PowerPC series (at least for Power6 and Power7), is that the instruction latencies are very long and the out-of-order execution is very weak compared to x86/x64.

Therefore, the inner loop (as written in your code) will get extremely low IPC.

Now, the only way I can imagine you getting 60x speedup is that the inner loop is completely unrolled under -O3. This is possible since the trip count of the inner loop can be statically determined to be 63.

Unrolling that inner loop will basically allow the entire pipeline to be filled.

Of course I'm just guessing. Your best bet is to look at the assembly.

Also, how are you timing this? A lot of the weird behavior I've seen on PowerPC is from the timers themselves...

EDIT:

Since your sample code is fairly simple, it should be very easy to spot (in the assembly) whether or not that inner loop is partially or completely unrolled.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜