Benchmarking SSE instructions

2022-12-11 05:59 问答作者：

I'm benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3.

Can someone tell me what is wrong with the benchmarking code below? And perhaps suggest another approach that accurately shows the speeds for both the SSE and non-SSE code.

#include <time.h>
#include <string.h>
#include <stdio.h>

#define ITERATIONS 100000

#define MULT_FLOAT4(X, Y) ({ \
asm volatile ( \
    "movaps (%0), %%xmm0\n\t" \
    "mulps (%1), %%xmm0\n\t" \
    "movaps %%xmm0, (%1)" \
    :: "r" (X), "r" (Y)); })

int main(void)
{
    int i, j;
    float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
    time_t timer, sse_time, std_time;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            MULT_FLOAT4(a, b);

        }
    sse_time = time(NULL) - timer;

    timer = time(NULL);
    for(j = 0; j < 5000; ++j)
        for(i = 0; i < ITERATIONS; ++i) {
            float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

            b[0] *= a[0];
            b[1] *开发者_高级运维= a[1];
            b[2] *= a[2];
            b[3] *= a[3];

    }
    std_time = time(NULL) - timer;

    printf("sse_time %d\nstd_time %d\n", sse_time, std_time);

    return 0;
}

When you enable optimizations the non-SSE code is eliminated completely, whereas the SSE code remains there, so this case is trivial. The more interesting part is when the optimizations are turned off: in this case the SSE-code is still slower whereas the loops' code is the same.

Non-SSE code of the innermost loop's body:

movl    $0x3dcccccd, %eax
movl    %eax, -80(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -76(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -72(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -68(%rbp)
movss   -80(%rbp), %xmm1
movss   -48(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -80(%rbp)
movss   -76(%rbp), %xmm1
movss   -44(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -76(%rbp)
movss   -72(%rbp), %xmm1
movss   -40(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -72(%rbp)
movss   -68(%rbp), %xmm1
movss   -36(%rbp), %xmm0
mulss   %xmm1, %xmm0
movss   %xmm0, -68(%rbp)

SSE code of the innermost loop's body:

movl    $0x3dcccccd, %eax
movl    %eax, -64(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -60(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -56(%rbp)
movl    $0x3dcccccd, %eax
movl    %eax, -52(%rbp)
leaq    -48(%rbp), %rax
leaq    -64(%rbp), %rdx
movaps (%rax), %xmm0
mulps (%rdx), %xmm0
movaps %xmm0, (%rdx)

I'm not sure about this, but here's my guess:

As you can see the compiler just stores the 4 floating values by 4 32-bit stores. This is then read back by a 16 byte load. This causes store forwarding stall which is costly when happens. You can look up this in the Intel manuals. It doesn't occur in the scalar version and this makes the performance difference.

To make it faster you need to make sure that this stall doesn't occur. If you are using a constant array of 4 floats, make it const and store the results in an another aligned array. This way the compiler hopefully won't make those unnecessary 4 byte movs before the load. Or, if you need to fill up the resulting array, do it with a 16 byte store command. If you can't avoid those 4 byte movs, you need to do something else after the store but before the load (for example calculating something else).

继续阅读：assembly benchmarking gcc sse

Benchmarking SSE instructions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？