memcpy() performance- Ubuntu x86_64

2023-03-22 16:23 问答作者：

I am observing some weird behavior which I am not being able to explain. Following are the details :-

#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>

void memcpy_test() {
    int size = 32*4;
    char* src = new char[size];
    char* dest = new char[size];
    general_utility::ProcessTimer tmr;
    unsigned int num_cpy = 1024*1024*16; 
    struct timespec start_time__, end_time__;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    for(unsigned int i=0; i < num_cpy; ++i) {
        __builtin_memcpy(dest, src, size);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
    delete [] src;
    delete [] dest;
}

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?

EDIT 1: Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

EDIT 2: Following are the detailed performance analysis (__builtin_memcpy()) :-

size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3

size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5

EDIT 3 :

This observation does not change even if I allocate int64_t/int32_t.

EDIT 4 :

size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in 开发者_运维技巧reporting this number, it was wrongly written as 26.5, now it is correct )

I have run these many times and numbers are consistent across each run.

I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?

Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):

    rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8

And the version without gets compiled into 16 load/store pairs like:

    mov    0x20(%rbp),%rdx
    mov    %rdx,0x20(%rbx)

Which apparently is faster on our computers.

If anything, I would expect -march=native to produce optimized code.

In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.

Is there other functions which could show this type of behavior ?

Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.

Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.

It's quite known issue (and really old one).

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

look at some bottom comment in a bug report:

"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this problem"

Looks like glibc's memcpy is far better than builtin...

继续阅读：64-bit memcpy

memcpy() performance- Ubuntu x86_64

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？