while (n > 1) is 25% faster than while (n)?

2023-04-06 00:44 问答作者：

I have two logically equivalent functions:

long ipow1(int base, int exp) {
    // HISTORICAL NOTE:
    // This wasn't here in the original question, I edited it in,
    if (exp == 0) return 1;

    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result * base;
}

long ipow2(int base, int exp) { 
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}

开发者_如何学CNOTICE:

These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.

Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?

I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.

And this is my profiling code:

int main(int argc, char *argv[]) {
    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick;
    LARGE_INTEGER start_ticks, end_ticks, cputime;

    double totaltime = 0;
    int repetitions = 10000;
    int rep = 0;
    int nopti = 0;

    for (rep = 0; rep < repetitions; rep++) {
        if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
        if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");  
        QueryPerformanceCounter(&start_ticks); 

        /* start real code */

        for (int i = 0; i < 55; i++) {
            for (int j = 0; j < 11; j++) {
                nopti = ipow1(i, j); // or ipow2
            }
        }

        /* end code */

        QueryPerformanceCounter(&end_ticks); 
        cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
        totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
    }   

    printf("\tTotal elapsed CPU time:   %.9f  sec  with %d repetitions - %ld:\n", totaltime, repetitions, nopti);

    return 0;
}

No, really, the two ARE NOT equivalent. ipow2 returns correct results when ipow1 doesn't.

http://ideone.com/MqyqU

P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.

P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.

It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop). With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).

So with while (exp) you have an extra iteration, which takes the extra time to run.

EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.

If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.

Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result * base;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}

0000000000000000 <ipow1>:
   0:   83 fe 01                cmp    $0x1,%esi
   3:   ba 01 00 00 00          mov    $0x1,%edx
   8:   7e 1d                   jle    27 <ipow1+0x27>
   a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  10:   40 f6 c6 01             test   $0x1,%sil
  14:   74 07                   je     1d <ipow1+0x1d>
  16:   48 63 c7                movslq %edi,%rax
  19:   48 0f af d0             imul   %rax,%rdx
  1d:   d1 fe                   sar    %esi
  1f:   0f af ff                imul   %edi,%edi
  22:   83 fe 01                cmp    $0x1,%esi
  25:   7f e9                   jg     10 <ipow1+0x10>
  27:   48 63 c7                movslq %edi,%rax
  2a:   48 0f af c2             imul   %rdx,%rax
  2e:   c3                      retq   
  2f:   90                      nop

0000000000000030 <ipow2>:
  30:   85 f6                   test   %esi,%esi
  32:   b8 01 00 00 00          mov    $0x1,%eax
  37:   75 0a                   jne    43 <ipow2+0x13>
  39:   eb 19                   jmp    54 <ipow2+0x24>
  3b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  40:   0f af ff                imul   %edi,%edi
  43:   40 f6 c6 01             test   $0x1,%sil
  47:   74 07                   je     50 <ipow2+0x20>
  49:   48 63 d7                movslq %edi,%rdx
  4c:   48 0f af c2             imul   %rdx,%rax
  50:   d1 fe                   sar    %esi
  52:   75 ec                   jne    40 <ipow2+0x10>
  54:   f3 c3                   repz retq

Isolating the loops:

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }


//if exp & 1 not true jump to 1d to skip   
  10:   40 f6 c6 01             test   $0x1,%sil
  14:   74 07                   je     1d <ipow1+0x1d>
//result *= base  
  16:   48 63 c7                movslq %edi,%rax
  19:   48 0f af d0             imul   %rax,%rdx
//exp>>=1  
  1d:   d1 fe                   sar    %esi
//base *= base  
  1f:   0f af ff                imul   %edi,%edi
//while(exp>1) stayin the loop  
  22:   83 fe 01                cmp    $0x1,%esi
  25:   7f e9                   jg     10 <ipow1+0x10>

Comparing something to zero normally saves you an instruction and you can see that here

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
    }


//base *= base  
  40:   0f af ff                imul   %edi,%edi
//if exp & 1 not true jump to skip  
  43:   40 f6 c6 01             test   $0x1,%sil
  47:   74 07                   je     50 <ipow2+0x20>
//result *= base  
  49:   48 63 d7                movslq %edi,%rdx
  4c:   48 0f af c2             imul   %rdx,%rax
//exp>>=1  
  50:   d1 fe                   sar    %esi
//no need for a compare  
  52:   75 ec                   jne    40 <ipow2+0x10>

Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:

starttime = ... for(rep=bignumber;rep;rep--) { //code under test ... } endtime = ... total = endtime - starttime;

Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.

Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).

If we look at this from the perspective of the base multiplies:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int mults;

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
        mults++;
    }

    result *= base;

    return result;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) result *= base;
        exp >>= 1;
        base *= base;
        mults++;
    }

    return result;
}


int main ( void )
{
    int i;
    int j;

    mults = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

    mults=0;

        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

}

there are

mults 1045
mults 1595

50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.

ipow1() gets a little back on the other multiplies:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int mults;

long ipow1(int base, int exp) {
    long result = 1;

    while (exp > 1) {
        if (exp & 1) mults++;
        exp >>= 1;
        base *= base;
    }
    mults++;

    return result;
}

long ipow2(int base, int exp) {
    long result = 1;

    while (exp) {
        if (exp & 1) mults++;
        exp >>= 1;
        base *= base;
    }

    return result;
}


int main ( void )
{
    int i;
    int j;

    mults = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

    mults=0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("mults %u\n",mults);

}

ipow1() performs the result*=base a different number (more) times than ipow2()

mults 990
mults 935

being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().

Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int ops;

long ipow1(int base, int exp) {
    long result = 1;
    ops++; //result = immediate
    while (exp > 1) {
        ops++; // compare exp - 1
        ops++; // conditional jump
            //if (exp & 1)
        ops++; //exp&1
        ops++; //conditional jump
        if (exp & 1)
        {
            result *= base;
            ops++;
        }
        exp >>= 1;
        ops++;
        //ops+=?; //using a signed number can cost you this on some systems
        //always use unsigned unless you have a specific reason to use signed.
        //if this had been a short or char variable it might cost you even more
        //operations
        //if this needs to be signed it is what it is, just be aware of
        //the cost
        base *= base;
        ops++;
    }
    result *= base;
    ops++;
    return result;
}

long ipow2(int base, int exp) {
    long result = 1;
    ops++;
    while (exp) {
        //ops++; //cmp exp-0, often optimizes out;
        ops++; //conditional jump
        //if (exp & 1)
        ops++;
        ops++;
        if (exp & 1)
        {
            result *= base;
            ops++;
        }
        exp >>= 1;
        ops++;
        //ops+=?; //right shifting a signed number
        base *= base;
        ops++;
    }
    return result;
}



int main ( void )
{
    int i;
    int j;

    ops = 0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow1(i, j); // or ipow2
            }
        }
    printf("ops %u\n",ops);

    ops=0;
        for (i = 0; i < 55; i++) {
            for (j = 0; j < 11; j++) {
                ipow2(i, j); // or ipow2
            }
        }
    printf("ops %u\n",ops);

}

Assuming I counted all the major operations and didnt unfairly give one function more than another:

ops 7865
ops 9515

ipow2 is 21% slower using this analysis.

I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.

Your functions are not "logically equal".

while (exp > 1){...}

is NOT logically equal to

while (exp){...}

Why do you say it is?

Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.

ipow1.s:

cmpl    $1, -24(%rbp)
jg  .L4
movl    -20(%rbp), %eax
cltq
imulq   -8(%rbp), %rax
leave

ipow2.s:

cmpl    $0, -24(%rbp)
jne .L4
movq    -8(%rbp), %rax
leave

Perhaps the processor's branch prediction is just more effective with jg than with jne? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl has done most of the heavy lifting)

继续阅读：c gcc micro-optimization performance

while (n > 1) is 25% faster than while (n)?

开发者_如何学CNOTICE:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

开发者_如何学CNOTICE:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？