开发者

memcpy() performance- Ubuntu x86_64

I am observing some weird behavior which I am not being able to explain. Following are the details :-

#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>

void memcpy_test() {
    int size = 32*4;
    char* src = new char[size];
    char* dest = new char[size];
    general_utility::ProcessTimer tmr;
    unsigned int num_cpy = 1024*1024*16; 
    struct timespec start_time__, end_time__;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    for(unsigned int i=0; i < num_cpy; ++i) {
        __builtin_memcpy(dest, src, size);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
    delete [] src;
    delete [] dest;
}

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?

EDIT 1: Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

EDIT 2: Following are the detailed performance analysis (__builtin_memcpy()) :-

size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3

size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5

EDIT 3 :

This observation does not change even if I allocate int64_t/int32_t.

EDIT 4 :

size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in 开发者_运维技巧reporting this number, it was wrongly written as 26.5, now it is correct )

I have run these many times and numbers are consistent across each run.


I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?

Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):

    rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8

And the version without gets compiled into 16 load/store pairs like:

    mov    0x20(%rbp),%rdx
    mov    %rdx,0x20(%rbx)

Which apparently is faster on our computers.

If anything, I would expect -march=native to produce optimized code.

In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.

Is there other functions which could show this type of behavior ?

Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.

Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.


It's quite known issue (and really old one).

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

look at some bottom comment in a bug report:

"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this problem"

Looks like glibc's memcpy is far better than builtin...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜