How can executing more instructions speed up exection

2023-04-10 14:15 问答作者：

When I run the following function, I get somewhat unexpected results.

On my machine, the code below consistently takes about 6 seconds to run. However, if I uncomment the ";dec [variable + 24]" line, therefore executing more code it takes about 4.5 seconds to run. Why?

.DATA
variable dq 0 dup(4)
.CODE             

runAssemblyCode PROC
    mov rax, 2330 * 1000 * 1000
start:
    dec [variable]
    dec [variable + 8]
    dec [variable + 16]
    ;dec [variable + 24]
    dec rax
    jnz start
    ret 
runAssemblyCode ENDP 
END

I have noticed that there开发者_如何学C are similar questions already on Stack Overflow, but their code samples are not as simple as this and I couldn't find any succinct answers to this question.

I have tried padding the code with nop instructions to see if it is an alignment problem, and also set the affinity to a single processor. Neither made any difference.

The simple answer is because modern CPUs are extremely complex. There is a lot going on under the hood that appears unpredictable or random to the observer.

Inserting that extra instruction might cause it to schedule instructions differently, which, in a tight loop like this, might make a difference. But that's just a guess.

As far as I can see, it touches the same cache line as the previous instruction, so it doesn't seem to be a kind of prefetching. I can't really think of a logical explanation, but again, the CPU makes use of a lot of undocumented heuristics and guesses to execute code as fast as possible, and sometimes, that means weird corner cases where they fail, and the code becomes slower than you'd expect.

Have you tested this on different CPU models? Would be interesting to see if this is just on your specific CPU, or if other x86 CPUs exhibit the same thing.

bob.s

.data
variable:
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0
    .word 0,0,0,0

.text
.globl runAssemblyCode
runAssemblyCode:
  mov    $0xFFFFFFFF,%eax

start_loop:
  decl variable+0
  decl variable+8
  decl variable+16
  ;decl variable+24
  dec    %eax
  jne    start_loop
  retq

ted.c

#include <stdio.h>
#include <time.h>

void runAssemblyCode ( void );

int main ( void )
{
    volatile unsigned int ra,rb;

    ra=(unsigned int)time(NULL);
    runAssemblyCode();
    rb=(unsigned int)time(NULL);
    printf("%u\n",rb-ra);
    return(0);
}

gcc -O2 ted.c bob.s -o ted

this was with the extra instruction:

00000000004005d4 <runAssemblyCode>:
  4005d4:   b8 ff ff ff ff          mov    $0xffffffff,%eax

00000000004005d9 <start_loop>:
  4005d9:   ff 0c 25 28 10 60 00    decl   0x601028
  4005e0:   ff 0c 25 30 10 60 00    decl   0x601030
  4005e7:   ff 0c 25 38 10 60 00    decl   0x601038
  4005ee:   ff 0c 25 40 10 60 00    decl   0x601040 
  4005f5:   ff c8                   dec    %eax
  4005f7:   75 e0                   jne    4005d9 <start_loop>
  4005f9:   c3                      retq   
  4005fa:   90                      nop

I dont see a difference, maybe you can correct my code or others can try on their systems to see what they see...

that is an extremely painful instruction plus if you are doing something other than byte based memory decrements that is unaligned and going to be painful for the memory system. so this routine should be sensitive to cache lines as well as number of cores, etc.

it took about 13 seconds with or without the extra instruction.

amd phenom 9950 quad-core processor

on an

Intel(R) Core(TM)2 CPU 6300

took about 9-10 seconds with or without the extra instruction.

A two processor: Intel(R) Xeon(TM) CPU

took about 13 seconds with or without the extra instruction.

On this: Intel(R) Core(TM)2 Duo CPU T7500

8 seconds with or without.

All are running Ubuntu 64 bit 10.04 or 10.10, might be an 11.04 in there.

Some more machines, 64 bit, ubuntu

Intel(R) Xeon(R) CPU X5450 (8 core)

6 seconds with or without extra instruction.

Intel(R) Xeon(R) CPU E5405 (8 core)

9 seconds with or without.

What is the speed of your DDR/DRAM in your system? What kind of processor are you running (cat /proc/cpuinfo if on linux).

Intel(R) Xeon(R) CPU E5440 (8 core)

6 seconds with or without

Ahh, found a single core, xeon though: Intel(R) Xeon(TM) CPU

15 seconds with or without the extra instruction

It's not that bad. On average, the complete loop takes 2.6 ns to execute, while the other takes 1.9 ns. Assuming a 2GHz CPU, which has a period of 0.5 ns, the difference is about (2.6 - 1.9) / 0.5 = 1 clock cycle per loop, nothing surprising.
The time difference becomes so noticeable, though, due to the number of cycles you requested: 0.5 ns * 2330000000 = 1.2 seconds, the difference you observed.

继续阅读：assembly intel timing

How can executing more instructions speed up exection

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？