开发者

how to work with 128 bits C variable and xmm 128 bits asm?

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?

asm (
    "movdqa %1, %%xmm1;"
    "movdqa %0, %%xmm0;"
     "pxor %%xmm1,%%xmm0;"
     "movdqa %%xmm0, %0;"

    :"=x"(buff) /* output开发者_Python百科 operand */
    :"x"(bu), "x"(buff)
    :"%xmm0","%xmm1"
    );

but i have a Segmentation fault error; this is the objdump output:

movq   -0x80(%rbp),%xmm2

movq   -0x88(%rbp),%xmm3

movdqa %xmm2,%xmm1

movdqa %xmm2,%xmm0

pxor   %xmm1,%xmm0

movdqa %xmm0,%xmm2

movq   %xmm2,-0x78(%rbp)


You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.

C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.

Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.

The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
            x;

    asm (
        "movdqa %1, %%xmm0;"      /* xmm0 <- a */
        "movdqa %2, %%xmm1;"      /* xmm1 <- b */
        "pxor %%xmm1, %%xmm0;"    /* xmm0 <- xmm0 xor xmm1 */
        "movdqa %%xmm0, %0;"      /* x <- xmm0 */

        :"=x"(x)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        :"%xmm0","%xmm1"  /* clobbered registers */
    );

    /* printf the arguments and result as 2 64-bit hex values */
    print128(a);
    print128(b);
    print128(x);
}

compile with (gcc, ubuntu 32 bit)

gcc -msse2 -o app app.c

output:

10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff

In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.

print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.


The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *px = (int64_t*) &value;
    printf("%.16llx %.16llx\n", px[1], px[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);

    asm (
        "pxor %2, %0;"    /* a <- b xor a  */

        :"=x"(a)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        );

    print128(a);
}

compile as before:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff

Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main()
{
    __m128i x = _mm_xor_si128(
        _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
        _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));

    print128(x);
}

compile:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff


Umm, why not use the __builtin_ia32_pxor intrinsic?


Under late model gcc (mine is 4.5.5) the option -O2 or above implies -fstrict-aliasing which causes the code given above to complain:

supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here

This can be remedied by supplying additional type attributes as follows:

typedef int64_t __attribute__((__may_alias__)) alias_int64_t; 
void print128(__m128i value) {
    alias_int64_t *v64 = (int64_t*)  &value;
    printf("%.16lx %.16lx\n", v64[1], v64[0]); 
}

I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.

BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.

And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜