memcpy() performance- Ubuntu x86_64
I am observing some weird behavior which I am not being able to explain. Following are the details :-
#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>
void memcpy_test() {
int size = 32*4;
char* src = new char[size];
char* dest = new char[size];
general_utility::ProcessTimer tmr;
unsigned int num_cpy = 1024*1024*16;
struct timespec start_time__, end_time__;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
for(unsigned int i=0; i < num_cpy; ++i) {
__builtin_memcpy(dest, src, size);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
delete [] src;
delete [] dest;
}
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?
EDIT 1: Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
EDIT 2: Following are the detailed performance analysis (__builtin_memcpy()) :-
size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3
size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5
EDIT 3 :
This observation does not change even if I allocate int64_t/int32_t.
EDIT 4 :
size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in 开发者_运维技巧reporting this number, it was wrongly written as 26.5, now it is correct )
I have run these many times and numbers are consistent across each run.
I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2
, Linux 2.6.38-10-generic #46-Ubuntu x86_64
on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?
Because the -march=native version gets compiled into (found using objdump -D
you could also use gcc -S -fverbose-asm
):
rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8
And the version without gets compiled into 16 load/store pairs like:
mov 0x20(%rbp),%rdx
mov %rdx,0x20(%rbx)
Which apparently is faster on our computers.
If anything, I would expect -march=native to produce optimized code.
In this case it turned out to be a pessimization to favor rep movsq
over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.
Is there other functions which could show this type of behavior ?
Any function for which the generated code differs when you specify -march=native
, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin
. Possibly also (floating point) math functions.
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
This is because then they both compile to rep movsq
, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);
) then it simply turns into a call to memcpy
with or without the switch.
It's quite known issue (and really old one).
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
look at some bottom comment in a bug report:
"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this problem"
Looks like glibc's memcpy is far better than builtin...
精彩评论