VS2010 memcpy fast magic

2023-04-03 10:46 问答作者：

Hello (and sorry for my bad english), for some portability issues i need to write myself a memory copy function. But my best attempt do this is 40-70% slower than visual studio's 2010 standard memcpy. And i don't know why. Next you can see my main copy loop, that copy all 128 byte chunks of data (all other code from function is limited in number of operations and can be assumed as O(1))

MOVDQA XMM0,DQWORD PTR DS:[ESI]
MOVDQA XMM1,DQWORD PTR DS:[ESI+10]
MOVDQA XMM2,DQWORD PTR DS:[ESI+20]
MOVDQA XMM3,DQWORD PTR DS:[ESI+30]
MOVDQA DQWORD PTR DS:[EDI],XMM0
MOVDQA DQWORD PTR DS:[EDI+10],XMM1
MOVDQA DQWORD PTR DS:[EDI+20],XMM2
MOVDQA DQWORD PTR DS:[EDI+30],XMM3
MOVDQA XMM4,DQWORD PTR DS:[ESI+40]
MOVDQA XMM5开发者_JAVA技巧,DQWORD PTR DS:[ESI+50]
MOVDQA XMM6,DQWORD PTR DS:[ESI+60]
MOVDQA XMM7,DQWORD PTR DS:[ESI+70]
MOVDQA DQWORD PTR DS:[EDI+40],XMM4
MOVDQA DQWORD PTR DS:[EDI+50],XMM5
MOVDQA DQWORD PTR DS:[EDI+60],XMM6
MOVDQA DQWORD PTR DS:[EDI+70],XMM7
LEA ESI,[ESI+80]
LEA EDI,[EDI+80]
DEC ECX
JNE SHORT 002410B9

And next i found in standard memcpy

MOVDQA XMM0,DQWORD PTR DS:[ESI]
MOVDQA XMM1,DQWORD PTR DS:[ESI+10]
MOVDQA XMM2,DQWORD PTR DS:[ESI+20]
MOVDQA XMM3,DQWORD PTR DS:[ESI+30]
MOVDQA DQWORD PTR DS:[EDI],XMM0
MOVDQA DQWORD PTR DS:[EDI+10],XMM1
MOVDQA DQWORD PTR DS:[EDI+20],XMM2
MOVDQA DQWORD PTR DS:[EDI+30],XMM3
MOVDQA XMM4,DQWORD PTR DS:[ESI+40]
MOVDQA XMM5,DQWORD PTR DS:[ESI+50]
MOVDQA XMM6,DQWORD PTR DS:[ESI+60]
MOVDQA XMM7,DQWORD PTR DS:[ESI+70]
MOVDQA DQWORD PTR DS:[EDI+40],XMM4
MOVDQA DQWORD PTR DS:[EDI+50],XMM5
MOVDQA DQWORD PTR DS:[EDI+60],XMM6
MOVDQA DQWORD PTR DS:[EDI+70],XMM7
LEA ESI,[ESI+80]
LEA EDI,[EDI+80]
DEC EDX
JNE SHORT 6B150A72

As you can see this cycle is almost identical to my own, but my function is getting slower and slower (in comparison to std memcpy) with increasing amount of data to copy.

Can anyone answer where is my mistake?

P.S. That's my code from main()

void main(void){    

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);

int* mas = new int[10000000];
for(int i = 0; i < 10000000; ++i)
    mas[i] = i;

LARGE_INTEGER mmcpy = { 0 };
LARGE_INTEGER mmsse = { 0 };

for(int i = 0; i < 10000; ++i)
{
    LARGE_INTEGER beforeMemcpy_sse, afterMemcpy_sse;
    QueryPerformanceCounter(&beforeMemcpy_sse);
    TestMemcpy_sse(mas, (char*)mas + 300000, 4400000);
    QueryPerformanceCounter(&afterMemcpy_sse);

    LARGE_INTEGER beforeMemcpy, afterMemcpy;
    QueryPerformanceCounter(&beforeMemcpy);
    memcpy(mas, (char*)mas + 300000, 4400000);
    QueryPerformanceCounter(&afterMemcpy);

    mmcpy.QuadPart += afterMemcpy.QuadPart - beforeMemcpy.QuadPart ;
    mmsse.QuadPart += afterMemcpy_sse.QuadPart - beforeMemcpy_sse.QuadPart;
}

delete[] mas;

/*printf("Memcpy Time: %f\n", (afterMemcpy.QuadPart - beforeMemcpy.QuadPart) / (float)freq.QuadPart);
printf("SSE Memcpy Time: %f\n\n", (afterMemcpy_sse.QuadPart - beforeMemcpy_sse.QuadPart) / (float)freq.QuadPart);*/

printf("Memcpy Time: %f\n", mmcpy.QuadPart / ((float)freq.QuadPart * 10000));
printf("SSE Memcpy Time: %f\n\n", mmsse.QuadPart / ((float)freq.QuadPart * 10000));

system("pause");

}

It's because the second memcpy is accessing cache warmed data (warmed by the first memcpy). You copy within a 5MB region, and then you copy within it again - your L3 cache is likely 6MB-12MB. Try switch the order of the copies, and see what results you get. :-)

You might be seeing cache effects. Depending on the size of your cache you could be copying a subset of the mas array with the cache cold in your first test with your memcpy function and then seeing a warm cache when you test the built in memcpy.

Generally when measuring the performance of code like this you should average over many runs and be careful to avoid cache effects by testing with a data set much larger than your cache or with one that is deliberately small enough to fit in cache and warming the cache before testing.

继续阅读：assembly memcpy optimization visual-c++visual-studio-2010

VS2010 memcpy fast magic

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？