开发者

VS2010 memcpy fast magic

Hello (and sorry for my bad english), for some portability issues i need to write myself a memory copy function. But my best attempt do this is 40-70% slower than visual studio's 2010 standard memcpy. And i don't know why. Next you can see my main copy loop, that copy all 128 byte chunks of data (all other code from function is limited in number of operations and can be assumed as O(1))

MOVDQA XMM0,DQWORD PTR DS:[ESI]
MOVDQA XMM1,DQWORD PTR DS:[ESI+10]
MOVDQA XMM2,DQWORD PTR DS:[ESI+20]
MOVDQA XMM3,DQWORD PTR DS:[ESI+30]
MOVDQA DQWORD PTR DS:[EDI],XMM0
MOVDQA DQWORD PTR DS:[EDI+10],XMM1
MOVDQA DQWORD PTR DS:[EDI+20],XMM2
MOVDQA DQWORD PTR DS:[EDI+30],XMM3
MOVDQA XMM4,DQWORD PTR DS:[ESI+40]
MOVDQA XMM5开发者_JAVA技巧,DQWORD PTR DS:[ESI+50]
MOVDQA XMM6,DQWORD PTR DS:[ESI+60]
MOVDQA XMM7,DQWORD PTR DS:[ESI+70]
MOVDQA DQWORD PTR DS:[EDI+40],XMM4
MOVDQA DQWORD PTR DS:[EDI+50],XMM5
MOVDQA DQWORD PTR DS:[EDI+60],XMM6
MOVDQA DQWORD PTR DS:[EDI+70],XMM7
LEA ESI,[ESI+80]
LEA EDI,[EDI+80]
DEC ECX
JNE SHORT 002410B9

And next i found in standard memcpy

MOVDQA XMM0,DQWORD PTR DS:[ESI]
MOVDQA XMM1,DQWORD PTR DS:[ESI+10]
MOVDQA XMM2,DQWORD PTR DS:[ESI+20]
MOVDQA XMM3,DQWORD PTR DS:[ESI+30]
MOVDQA DQWORD PTR DS:[EDI],XMM0
MOVDQA DQWORD PTR DS:[EDI+10],XMM1
MOVDQA DQWORD PTR DS:[EDI+20],XMM2
MOVDQA DQWORD PTR DS:[EDI+30],XMM3
MOVDQA XMM4,DQWORD PTR DS:[ESI+40]
MOVDQA XMM5,DQWORD PTR DS:[ESI+50]
MOVDQA XMM6,DQWORD PTR DS:[ESI+60]
MOVDQA XMM7,DQWORD PTR DS:[ESI+70]
MOVDQA DQWORD PTR DS:[EDI+40],XMM4
MOVDQA DQWORD PTR DS:[EDI+50],XMM5
MOVDQA DQWORD PTR DS:[EDI+60],XMM6
MOVDQA DQWORD PTR DS:[EDI+70],XMM7
LEA ESI,[ESI+80]
LEA EDI,[EDI+80]
DEC EDX
JNE SHORT 6B150A72

As you can see this cycle is almost identical to my own, but my function is getting slower and slower (in comparison to std memcpy) with increasing amount of data to copy.

Can anyone answer where is my mistake?

P.S. That's my code from main()

void main(void){    

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);

int* mas = new int[10000000];
for(int i = 0; i < 10000000; ++i)
    mas[i] = i;

LARGE_INTEGER mmcpy = { 0 };
LARGE_INTEGER mmsse = { 0 };

for(int i = 0; i < 10000; ++i)
{
    LARGE_INTEGER beforeMemcpy_sse, afterMemcpy_sse;
    QueryPerformanceCounter(&beforeMemcpy_sse);
    TestMemcpy_sse(mas, (char*)mas + 300000, 4400000);
    QueryPerformanceCounter(&afterMemcpy_sse);

    LARGE_INTEGER beforeMemcpy, afterMemcpy;
    QueryPerformanceCounter(&beforeMemcpy);
    memcpy(mas, (char*)mas + 300000, 4400000);
    QueryPerformanceCounter(&afterMemcpy);

    mmcpy.QuadPart += afterMemcpy.QuadPart - beforeMemcpy.QuadPart ;
    mmsse.QuadPart += afterMemcpy_sse.QuadPart - beforeMemcpy_sse.QuadPart;
}

delete[] mas;

/*printf("Memcpy Time: %f\n", (afterMemcpy.QuadPart - beforeMemcpy.QuadPart) / (float)freq.QuadPart);
printf("SSE Memcpy Time: %f\n\n", (afterMemcpy_sse.QuadPart - beforeMemcpy_sse.QuadPart) / (float)freq.QuadPart);*/

printf("Memcpy Time: %f\n", mmcpy.QuadPart / ((float)freq.QuadPart * 10000));
printf("SSE Memcpy Time: %f\n\n", mmsse.QuadPart / ((float)freq.QuadPart * 10000));

system("pause");

}


It's because the second memcpy is accessing cache warmed data (warmed by the first memcpy). You copy within a 5MB region, and then you copy within it again - your L3 cache is likely 6MB-12MB. Try switch the order of the copies, and see what results you get. :-)


You might be seeing cache effects. Depending on the size of your cache you could be copying a subset of the mas array with the cache cold in your first test with your memcpy function and then seeing a warm cache when you test the built in memcpy.

Generally when measuring the performance of code like this you should average over many runs and be careful to avoid cache effects by testing with a data set much larger than your cache or with one that is deliberately small enough to fit in cache and warming the cache before testing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜