开发者

nested function call faster or not?

I have this silly argument with a friend and need an authoritative word on it.

I have these two snippet and want to know which one is faster ? [A or B]

(assuming that compiler does not optimize anything)

[A]

if ( foo () ); 

[B]

int t = foo ();
if ( t )

EDIT : Guys, this might look a silly question to you but I have 开发者_如何学JAVAa hardware engineer friend, who was arguing that even WITHOUT optimization (take any processor, any compiler pair) CASE B is always faster because it DOES NOT fetch the memory for the result from previous instruction but directly accesses result from Common Data Bus by bypassing that data (remember the 5-stage pipeline).

While My Argument was that, without compiler informing how much data to copy or check, it is not possible to do that(you have to go to memory to get the data, WITHOUT compiler optimizing that)


The "optimisation" required to convert [B] into [A] is so trivial (especially if t is not used anywhere else) that the compiler probably won't even call it an optimisation. It might be something that it just does as a matter of course, whether or not optimisations are explicitly enabled.

The only way to tell is to ask your compiler to generate an assembly source listing for both bits of code, then compare them.


Executive Summary
1. We are talking about nanoseconds. Light moves a whopping 30cm in that time. 2. Sometimes, if you are really lucky, [A] is faster


Side note: [B] may have a different meaning
if the return type of foo is not int but an object that has implicit conversions to both int and bool, different code paths are executed. One might contain a Sleep.

Assuming a function returning int:

Depends on the compiler
Even with the restriction of "no optimization", there is no guarantee how the generated code will look like. B could be 10 times faster and the compiler would still be compliant (and you most likely wouldn't notice).

Depends on the hardware
Depending on your architecture, there might not even be a difference for the generated code, no matter how much your compiler tries.

Assuming a modern compiler on a modern x86 / x64 architecture:

On typical compilers, the difference is at most miniscule
that stores t in a stack variable, the two extra stack loads typically cost 2 clock cycles (less than a nanosecond on my CPU). That is negligible compared to the "surrounding cost" - a call to foo, the cost of foo itself, and a branch. An unoptimized call with a full stack frame can easily cost you 20.200 cycles depending on patform.

For comparison: cycle cost of a single memory access that is not in 1st level cache (roughly: 100 cycles from 2nd level, 1000 from main, hundreds of thousands from disk)

...or even nonexistent
Even if your compiler isn't optimizing, your CPU might. Due to pairing / microcode generation, the cycle cost may actually be identical.


For the record, gcc, when compiling with optimization specifically disabled (-O0), produces different code for the two inputs (in my case, the body of foo was return rand(); so that the result would not be determined at compile time).

Without temporary variable t:

        movl    $0, %eax
        call    foo
        testl   %eax, %eax
        je      .L4
        /* inside of if block */
.L4:
        /* rest of main() */

Here, the return value of foo is stored in the EAX register, and the register is tested against itself to see if it is 0, and if so, it jumps over the body of the if block.

With temporary variable t:

        movl    $0, %eax
        call    foo
        movl    %eax, -4(%rbp)
        cmpl    $0, -4(%rbp)
        je      .L4
        /* inside of if block */
.L4:
        /* rest of main() */

Here, the return value of foo is stored in the EAX register, then pushed onto the stack. Then, the contents of the location on the stack are compared to literal 0, and if they are equal, it jumps over the body of the if block.

And so if we assume further that the processor is not doing any "optimizations" when it generates the microcode for this, then the version without the temporary should be a few clock cycles faster. It's not going to be substantially faster because even though the version with a temporary involves a stack push, the stack value is almost certainly still going to be in the processor's L1 cache when the comparison instruction is executed immediately afterwords, and so there's not going to be a round trip to RAM.

Of course the code becomes identical as soon as you turn on any optimization level, even -O1, and who compiles anything that is so critical that they care about a handful of clock cycles with all optimizations off?

Edit: With regard to your further information about your hardware engineer friend, I can't see how accessing a value in the L1 cache would ever be faster than accessing a register directly. I could see it being just about as fast if the value never even leaves the pipeline, but I can't see it being faster, especially since it still has to execute the movl instruction in addition to the comparison. But show him the assembly code above and ask what he thinks; it will be more productive than trying to discuss the problem in terms of C.


They are likely both going to be the same. That int will be stored into a register in either case.


It really depends on how the compiler is built. But I think in most cases, A will be faster. Here's why:

In B, the compiler might not bother finding out whether t is ever used again, so it will be forced to preserve the value after the if statement. And that could mean pushing it onto the stack.


A will likely be just a tiny bit faster because it does not do a variable assignment. The difference we're talking about is way too small to measure .

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜