开发者

How can executing more instructions speed up exection

When I run the following function, I get somewhat unexpected results.

On my machine, the code below consistently takes about 6 seconds to run. However, if I uncomment the ";dec [variable + 24]" line, therefore executing more code it takes about 4.5 seconds to run. Why?

.DATA
variable dq 0 dup(4)
.CODE             

runAssemblyCode PROC
    mov rax, 2330 * 1000 * 1000
start:
    dec [variable]
    dec [variable + 8]
    dec [variable + 16]
    ;dec [variable + 24]
    dec rax
    jnz start
    ret 
runAssemblyCode ENDP 
END

I have noticed that there开发者_如何学C are similar questions already on Stack Overflow, but their code samples are not as simple as this and I couldn't find any succinct answers to this question.

I have tried padding the code with nop instructions to see if it is an alignment problem, and also set the affinity to a single processor. Neither made any difference.


The simple answer is because modern CPUs are extremely complex. There is a lot going on under the hood that appears unpredictable or random to the observer.

Inserting that extra instruction might cause it to schedule instructions differently, which, in a tight loop like this, might make a difference. But that's just a guess.

As far as I can see, it touches the same cache line as the previous instruction, so it doesn't seem to be a kind of prefetching. I can't really think of a logical explanation, but again, the CPU makes use of a lot of undocumented heuristics and guesses to execute code as fast as possible, and sometimes, that means weird corner cases where they fail, and the code becomes slower than you'd expect.

Have you tested this on different CPU models? Would be interesting to see if this is just on your specific CPU, or if other x86 CPUs exhibit the same thing.


bob.s

.data
variable:
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0

.text
.globl runAssemblyCode
runAssemblyCode:
  mov    $0xFFFFFFFF,%eax

start_loop:
  decl variable+0
  decl variable+8
  decl variable+16
  ;decl variable+24
  dec    %eax
  jne    start_loop
  retq

ted.c

#include <stdio.h>
#include <time.h>

void runAssemblyCode ( void );

int main ( void )
{
    volatile unsigned int ra,rb;

    ra=(unsigned int)time(NULL);
    runAssemblyCode();
    rb=(unsigned int)time(NULL);
    printf("%u\n",rb-ra);
    return(0);
}

gcc -O2 ted.c bob.s -o ted

this was with the extra instruction:

00000000004005d4 <runAssemblyCode>:
  4005d4:   b8 ff ff ff ff          mov    $0xffffffff,%eax

00000000004005d9 <start_loop>:
  4005d9:   ff 0c 25 28 10 60 00    decl   0x601028
  4005e0:   ff 0c 25 30 10 60 00    decl   0x601030
  4005e7:   ff 0c 25 38 10 60 00    decl   0x601038
  4005ee:   ff 0c 25 40 10 60 00    decl   0x601040 
  4005f5:   ff c8                   dec    %eax
  4005f7:   75 e0                   jne    4005d9 <start_loop>
  4005f9:   c3                      retq   
  4005fa:   90                      nop

I dont see a difference, maybe you can correct my code or others can try on their systems to see what they see...

that is an extremely painful instruction plus if you are doing something other than byte based memory decrements that is unaligned and going to be painful for the memory system. so this routine should be sensitive to cache lines as well as number of cores, etc.

it took about 13 seconds with or without the extra instruction.

amd phenom 9950 quad-core processor

on an

Intel(R) Core(TM)2 CPU 6300

took about 9-10 seconds with or without the extra instruction.

A two processor: Intel(R) Xeon(TM) CPU

took about 13 seconds with or without the extra instruction.

On this: Intel(R) Core(TM)2 Duo CPU T7500

8 seconds with or without.

All are running Ubuntu 64 bit 10.04 or 10.10, might be an 11.04 in there.

Some more machines, 64 bit, ubuntu

Intel(R) Xeon(R) CPU X5450 (8 core)

6 seconds with or without extra instruction.

Intel(R) Xeon(R) CPU E5405 (8 core)

9 seconds with or without.

What is the speed of your DDR/DRAM in your system? What kind of processor are you running (cat /proc/cpuinfo if on linux).

Intel(R) Xeon(R) CPU E5440 (8 core)

6 seconds with or without

Ahh, found a single core, xeon though: Intel(R) Xeon(TM) CPU

15 seconds with or without the extra instruction


It's not that bad. On average, the complete loop takes 2.6 ns to execute, while the other takes 1.9 ns. Assuming a 2GHz CPU, which has a period of 0.5 ns, the difference is about (2.6 - 1.9) / 0.5 = 1 clock cycle per loop, nothing surprising.
The time difference becomes so noticeable, though, due to the number of cycles you requested: 0.5 ns * 2330000000 = 1.2 seconds, the difference you observed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜