How can executing more instructions speed up exection
When I run the following function, I get somewhat unexpected results.
On my machine, the code below consistently takes about 6 seconds to run. However, if I uncomment the ";dec [variable + 24]
" line, therefore executing more code it takes about 4.5 seconds to run. Why?
.DATA
variable dq 0 dup(4)
.CODE
runAssemblyCode PROC
mov rax, 2330 * 1000 * 1000
start:
dec [variable]
dec [variable + 8]
dec [variable + 16]
;dec [variable + 24]
dec rax
jnz start
ret
runAssemblyCode ENDP
END
I have noticed that there开发者_如何学C are similar questions already on Stack Overflow, but their code samples are not as simple as this and I couldn't find any succinct answers to this question.
I have tried padding the code with nop instructions to see if it is an alignment problem, and also set the affinity to a single processor. Neither made any difference.
The simple answer is because modern CPUs are extremely complex. There is a lot going on under the hood that appears unpredictable or random to the observer.
Inserting that extra instruction might cause it to schedule instructions differently, which, in a tight loop like this, might make a difference. But that's just a guess.
As far as I can see, it touches the same cache line as the previous instruction, so it doesn't seem to be a kind of prefetching. I can't really think of a logical explanation, but again, the CPU makes use of a lot of undocumented heuristics and guesses to execute code as fast as possible, and sometimes, that means weird corner cases where they fail, and the code becomes slower than you'd expect.
Have you tested this on different CPU models? Would be interesting to see if this is just on your specific CPU, or if other x86 CPUs exhibit the same thing.
bob.s
.data
variable:
.word 0,0,0,0
.word 0,0,0,0
.word 0,0,0,0
.word 0,0,0,0
.word 0,0,0,0
.word 0,0,0,0
.text
.globl runAssemblyCode
runAssemblyCode:
mov $0xFFFFFFFF,%eax
start_loop:
decl variable+0
decl variable+8
decl variable+16
;decl variable+24
dec %eax
jne start_loop
retq
ted.c
#include <stdio.h>
#include <time.h>
void runAssemblyCode ( void );
int main ( void )
{
volatile unsigned int ra,rb;
ra=(unsigned int)time(NULL);
runAssemblyCode();
rb=(unsigned int)time(NULL);
printf("%u\n",rb-ra);
return(0);
}
gcc -O2 ted.c bob.s -o ted
this was with the extra instruction:
00000000004005d4 <runAssemblyCode>:
4005d4: b8 ff ff ff ff mov $0xffffffff,%eax
00000000004005d9 <start_loop>:
4005d9: ff 0c 25 28 10 60 00 decl 0x601028
4005e0: ff 0c 25 30 10 60 00 decl 0x601030
4005e7: ff 0c 25 38 10 60 00 decl 0x601038
4005ee: ff 0c 25 40 10 60 00 decl 0x601040
4005f5: ff c8 dec %eax
4005f7: 75 e0 jne 4005d9 <start_loop>
4005f9: c3 retq
4005fa: 90 nop
I dont see a difference, maybe you can correct my code or others can try on their systems to see what they see...
that is an extremely painful instruction plus if you are doing something other than byte based memory decrements that is unaligned and going to be painful for the memory system. so this routine should be sensitive to cache lines as well as number of cores, etc.
it took about 13 seconds with or without the extra instruction.
amd phenom 9950 quad-core processor
on an
Intel(R) Core(TM)2 CPU 6300
took about 9-10 seconds with or without the extra instruction.
A two processor: Intel(R) Xeon(TM) CPU
took about 13 seconds with or without the extra instruction.
On this: Intel(R) Core(TM)2 Duo CPU T7500
8 seconds with or without.
All are running Ubuntu 64 bit 10.04 or 10.10, might be an 11.04 in there.
Some more machines, 64 bit, ubuntu
Intel(R) Xeon(R) CPU X5450 (8 core)
6 seconds with or without extra instruction.
Intel(R) Xeon(R) CPU E5405 (8 core)
9 seconds with or without.
What is the speed of your DDR/DRAM in your system? What kind of processor are you running (cat /proc/cpuinfo if on linux).
Intel(R) Xeon(R) CPU E5440 (8 core)
6 seconds with or without
Ahh, found a single core, xeon though: Intel(R) Xeon(TM) CPU
15 seconds with or without the extra instruction
It's not that bad. On average, the complete loop takes 2.6 ns to execute, while the other takes 1.9 ns. Assuming a 2GHz CPU, which has a period of 0.5 ns, the difference is about (2.6 - 1.9) / 0.5 = 1 clock cycle
per loop, nothing surprising.
The time difference becomes so noticeable, though, due to the number of cycles you requested: 0.5 ns * 2330000000 = 1.2 seconds
, the difference you observed.
精彩评论