while (n > 1) is 25% faster than while (n)?
I have two logically equivalent functions:
long ipow1(int base, int exp) {
// HISTORICAL NOTE:
// This wasn't here in the original question, I edited it in,
if (exp == 0) return 1;
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
开发者_如何学CNOTICE:
These loops are equivalent because in the former case we are returning result * base
(handling the case when exp
is or has been reduced to 1
) but in the second case we are returning result
.
Strangely enough, both with -O3
and -O0
ipow1
consequently outperforms ipow2
by about 25%. How is this possible?
I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99
.
And this is my profiling code:
int main(int argc, char *argv[]) {
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick;
LARGE_INTEGER start_ticks, end_ticks, cputime;
double totaltime = 0;
int repetitions = 10000;
int rep = 0;
int nopti = 0;
for (rep = 0; rep < repetitions; rep++) {
if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");
QueryPerformanceCounter(&start_ticks);
/* start real code */
for (int i = 0; i < 55; i++) {
for (int j = 0; j < 11; j++) {
nopti = ipow1(i, j); // or ipow2
}
}
/* end code */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
}
printf("\tTotal elapsed CPU time: %.9f sec with %d repetitions - %ld:\n", totaltime, repetitions, nopti);
return 0;
}
No, really, the two ARE NOT equivalent. ipow2
returns correct results when ipow1
doesn't.
http://ideone.com/MqyqU
P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.
P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.
It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop). With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).
So with while (exp) you have an extra iteration, which takes the extra time to run.
EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.
If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.
Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
0000000000000000 <ipow1>:
0: 83 fe 01 cmp $0x1,%esi
3: ba 01 00 00 00 mov $0x1,%edx
8: 7e 1d jle 27 <ipow1+0x27>
a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
1d: d1 fe sar %esi
1f: 0f af ff imul %edi,%edi
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
27: 48 63 c7 movslq %edi,%rax
2a: 48 0f af c2 imul %rdx,%rax
2e: c3 retq
2f: 90 nop
0000000000000030 <ipow2>:
30: 85 f6 test %esi,%esi
32: b8 01 00 00 00 mov $0x1,%eax
37: 75 0a jne 43 <ipow2+0x13>
39: eb 19 jmp 54 <ipow2+0x24>
3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
40: 0f af ff imul %edi,%edi
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
50: d1 fe sar %esi
52: 75 ec jne 40 <ipow2+0x10>
54: f3 c3 repz retq
Isolating the loops:
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//if exp & 1 not true jump to 1d to skip
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
//result *= base
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
//exp>>=1
1d: d1 fe sar %esi
//base *= base
1f: 0f af ff imul %edi,%edi
//while(exp>1) stayin the loop
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
Comparing something to zero normally saves you an instruction and you can see that here
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//base *= base
40: 0f af ff imul %edi,%edi
//if exp & 1 not true jump to skip
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
//result *= base
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
//exp>>=1
50: d1 fe sar %esi
//no need for a compare
52: 75 ec jne 40 <ipow2+0x10>
Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:
starttime = ... for(rep=bignumber;rep;rep--) { //code under test ... } endtime = ... total = endtime - starttime;
Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.
Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).
If we look at this from the perspective of the base multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
result *= base;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
there are
mults 1045
mults 1595
50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.
ipow1() gets a little back on the other multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
mults++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
ipow1() performs the result*=base a different number (more) times than ipow2()
mults 990
mults 935
being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().
Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int ops;
long ipow1(int base, int exp) {
long result = 1;
ops++; //result = immediate
while (exp > 1) {
ops++; // compare exp - 1
ops++; // conditional jump
//if (exp & 1)
ops++; //exp&1
ops++; //conditional jump
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //using a signed number can cost you this on some systems
//always use unsigned unless you have a specific reason to use signed.
//if this had been a short or char variable it might cost you even more
//operations
//if this needs to be signed it is what it is, just be aware of
//the cost
base *= base;
ops++;
}
result *= base;
ops++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
ops++;
while (exp) {
//ops++; //cmp exp-0, often optimizes out;
ops++; //conditional jump
//if (exp & 1)
ops++;
ops++;
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //right shifting a signed number
base *= base;
ops++;
}
return result;
}
int main ( void )
{
int i;
int j;
ops = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
ops=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
}
Assuming I counted all the major operations and didnt unfairly give one function more than another:
ops 7865
ops 9515
ipow2 is 21% slower using this analysis.
I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.
Your functions are not "logically equal".
while (exp > 1){...}
is NOT logically equal to
while (exp){...}
Why do you say it is?
Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.
ipow1.s:
cmpl $1, -24(%rbp)
jg .L4
movl -20(%rbp), %eax
cltq
imulq -8(%rbp), %rax
leave
ipow2.s:
cmpl $0, -24(%rbp)
jne .L4
movq -8(%rbp), %rax
leave
Perhaps the processor's branch prediction is just more effective with jg
than with jne
? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl
has done most of the heavy lifting)
精彩评论