SSE2 instructions not working in inline assembly with C++

2023-01-25 10:18 问答作者：

I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs:

template<typename T>
void simdAdd(T *lhs,T *rhs)
{
    asm volatile("movups %0,%%xmm0"::"m"(lhs));
    asm volatile("movups %0,%%xmm1"::"m"(rhs));

    switch(sizeof(T))
    {
        case sizeof(uint8_t):
        asm volatile("paddb %%xmm0,%%xmm1":);
        break;

        case sizeof(uint16_t):
        asm volatile("paddw %%xmm0,%%xmm1":);
        break;

        case sizeof(float):
        asm volatile("addps %%xmm0,%%xmm1":);
        break;

        case sizeof(double):
        asm volatile("addpd %%xmm0,%%xmm1":);
        break;

        default:
        std::cout<<"error"<<std::endl;
        break;
    }

    asm volatile("movups %%xmm0,%0":"=m"(lhs));
}

and my code uses the function like this:

float *values=new float[4];
float *values2=new float[4];

values[0]=1.0f;
values[1]=2.0f;
values[2]=3.0f;
values[3]=4.0f;

values2[0]=1.0f;
values2[1]=2.0f;
values2[2]=3.0f;
values2[3]=4.0f;

simdAdd(values,values2);
for(uint32_t count=0;count<4;count++) std::cout<<values[count]<<开发者_开发知识库;std::endl;

However this isn't working because when the code runs it outputs 1,2,3,4 instead of 2,4,6,8

I've found that inline assembly support isn't reliable in most modern compilers (as in, the implementations are just plain buggy). You are generally better off using compiler intrinsics which are declarations that look like C functions, but actually compile to a specific opcode.

Intrinsics let you specify an exact sequence of opcodes, but leave the register coloring to the compiler. It's much more reliable than trying to move data between C variables and asm registers, which is where inline assemblers have always fallen down for me. It also lets the compiler schedule your instructions, which can provide better performance if it works around pipeline hazards. Ie, in this case you could do

void simdAdd(float *lhs,float *rhs)
{
   _mm_storeu_ps( lhs, _mm_add_ps(_mm_loadu_ps( lhs ), _mm_loadu_ps( rhs )) );
}

In your case, anyway, you've two problems:

The terrible GCC inline assembly syntax which makes great confusion of the difference between pointers and values. Use *lhs and *rhs instead of just lhs and rhs; apparently the "=m" syntax means "implicitly use a pointer to this thing that I'm passing you instead of the thing itself."
GCC has a source,destination syntax -- The addps stores its result in the second parameter, so you you need to output xmm1, not xmm0.

I've put a fixed example on codepad (to avoid cluttering up this answer, and to demonstrate that it works).

Couple things I see wrong here. Firstly, your statements that load up the XMM registers and store values back to your variable are wrong.

asm volatile("movups %0,%%xmm0"::"m"(lhs));
asm volatile("movups %0,%%xmm1"::"m"(rhs));
...
asm volatile("movups %%xmm0,%0":"=m"(lhs));

Should read

asm volatile("movups %0,%%xmm0"::"m"(*lhs));
asm volatile("movups %0,%%xmm1"::"m"(*rhs));
...
asm volatile("movups %%xmm0,%0":"=m"(*lhs));

Note the *'s. You were loading up and adding the pointer values, and then storing them back in a temporary which was used to pass the pointer argument (which consequently is forgotten without writing to memory when the function call returns).

Even with these fixes, in general, this is not a good technique. I had written my own example with asm statements, but it was flawed because I forgot to account for the unaligned nature of the parameters being passed in. It becomes very cumbersome to do with asm statements and far easier and more readable using intrinsic functions. Just use caution to use the correct data types:

template<typename T>
void simdAdd(T *lhs,T *rhs)
{
    switch(sizeof(T))
    {
        case sizeof(uint8_t):
        {
          __m128i lh128;
          lh128 = _mm_add_epi8( _mm_loadu_si128( (__m128i *)lhs ),
                                _mm_loadu_si128( (__m128i *)rhs ) );
          _mm_storeu_si128( (__m128i *)lhs, lh128 );
        }
        break;

        case sizeof(uint16_t):
        {
          __m128i lh128;
          lh128 = _mm_add_epi16( _mm_loadu_si128( (__m128i *)lhs ),
                                 _mm_loadu_si128( (__m128i *)rhs ) );
          _mm_storeu_si128( (__m128i *)lhs, lh128 );
        }
        break;

        case sizeof(float):
        {
          __m128 lh128;
          lh128 = _mm_add_ps( _mm_loadu_ps( (float *)lhs ),
                              _mm_loadu_ps( (float *)rhs ) );
          _mm_storeu_ps( (float *)lhs, lh128 );
        }
        break;

        case sizeof(double):
        {
          __m128d lh128;
          lh128 = _mm_add_pd( _mm_loadu_pd( (double *)lhs ),
                              _mm_loadu_pd( (double *)rhs ) );
          _mm_storeu_pd( (double *)lhs, lh128 );
        }
        break;

        default:
        std::cout<<"error"<<std::endl;
        break;
    }
}

Something to be aware of is the size of your data types is not sufficient to know which data type you were passed. Just because a template type shares the same size as the basic types you are checking, doesn't mean it is the same type. So I force the casting to cover this case in my example. This might generally be an unsafe practice unless you are certain that this function will only ever be used with the types you have specified. For example, using a float-sized integer will result in an unexpectedly wrong answer, and the compiler won't be able to warn you about it.

继续阅读：gcc inline-assembly sse2

SSE2 instructions not working in inline assembly with C++

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？