开发者

numpy calling sse2 via ctypes

In brief, I am trying to call into a shared library from python, more specifically, from numpy. The shared library is implemented in C using sse2 instructions. Enabling optimisation, i.e. building the library with -O2 or –O1, I am facing strange segfaults when calling into the shared library via ctypes. Disabling optimisation (-O0), everything works out as expected, as is the case when linking the library to a c-program directly (optimised or not). Attached you find a snipped which exhibits the delineated behaviour on my system. With optimisation enabled, gdb reports a segfault in __builtin_ia32_loadupd (__P) at emmintrin.h:113. The value of __P is reported as optimised out.

test.c:

#include <emmintrin.h>
#include <complex.h>
void test(const int m, const double* x, double complex* y) {

    int i;
    __m128d _f, _x, _b;
    double complex f __attribute__( (aligned(16)) );
    double comple开发者_如何学JAVAx b __attribute__( (aligned(16)) );
    __m128d* _p;

    b = 1;
    _b = _mm_loadu_pd( (double *) &b );

    _p = (__m128d*) y;

    for(i=0; i<m; ++i) {
        f = cexp(-I*x[i]);
        _f = _mm_loadu_pd( (double *) &f );
        _x = _mm_loadu_pd( (double *) &x[i] );      
        _f = _mm_shuffle_pd(_f, _f, 1);
        *_p = _mm_add_pd(*_p, _f);
        *_p = _mm_add_pd(*_p, _x); 
        *_p = _mm_mul_pd(*_p,_b);
        _p++;
    }
    return;
}

Compiler flags: gcc -o libtest.so -shared -std=c99 -msse2 -fPIC -O2 -g -lm test.c

test.py:

import numpy as np
import os

def zerovec_aligned(nr, dtype=np.float64, boundary=16):
    '''Create an aligned array of zeros.
    '''
    size = nr * np.dtype(dtype).itemsize
    tmp = np.zeros(size + boundary, dtype=np.uint8)
    address = tmp.__array_interface__['data'][0]
    offset = boundary - address % boundary
    return tmp[offset:offset + size].view(dtype=dtype)


lib = np.ctypeslib.load_library('libtest', '.' )
lib.test.restype = None
lib.test.argtypes = [np.ctypeslib.ctypes.c_int,
                     np.ctypeslib.ndpointer(np.float64, flags=('C', 'A') ),
                     np.ctypeslib.ndpointer(np.complex128, flags=('C', 'A', 'W') )]


n = 13
y = zerovec_aligned(n, dtype=np.complex128)
x = np.ones(n, dtype=np.float64)
# x = zerovec_aligned(n, dtype=np.float64)
# x[:] = 1.

lib.test(n,x,y)

Calling test from C works as expected:

call_from_c.c:

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <emmintrin.h>

void test(const int m, const double* x, double complex* y);

int main() {

    int i; 
    const int n = 11;
    double complex *y = (double complex*) _mm_malloc(n*sizeof(double complex), 16);
    double *x = (double *) malloc(n*sizeof(double));
    for(i=0; i<n; ++i) {
        x[i] = 1;
        y[i] = 0;
    }

    test(n, x, y);
    for(i=0; i<n; ++i)
            printf("[%f %f]\n", creal(y[i]), cimag(y[i]));

    return 1;

}

Compile and call:

gcc -std=c99 -otestc -msse2 -L. -ltest call_from_c.c

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:.

./testc

... works.

My system:

  • Ubuntu Linux i686 2.6.31-22-generic
  • Compiler: gcc (Ubuntu 4.4.1-4ubuntu9)
  • Python: Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1]
  • Numpy: 1.4.0

I have taken provisions (cf. python code) that y is aligned and the alignment of x should not matter (I think; explicitly aligning x does not solve the problem though).

Note also that i use _mm_loadu_pd instead of _mm_load_pd when loading b and f. For the C-only version _mm_load_pd works (as expected). However, when calling the function via ctypes using _mm_load_pd always segfaults (independent of optimisation).

I have tried several days to sort out this issue without success ... and I am on the verge beating my monitor to death. Any input welcome. Daniel


I just got bitten by this trying to call some SSE-code from python, the problem seems to be that GCC wants to assume that the stack is aligned on 16-byte boundaries (the largest native type on the architecture, i.e. the SSE-types), and calculates all offset with that assumption. When that assumption is false, the SSE-instructions will trap.

The answer seems to be to compile with

gcc -mstackrealign
which changes the function prologues to always align the stack to 16 bytes.


Try building your extension using numpy build system to discount potential cflags/ldflags differences: http://projects.scipy.org/numpy/wiki/NumpySconsExtExamples


Have you try upgrading to Numpy 1.5.0b2. Just run the following (but be careful it might break other things (you will have to recompile all pyrex):

sudo easy_install -U numpy

I was having similar problems with ctypes when I was trying to use H5PY (I had to recompile the .deb to get with the latest version of numpy) and also there was major issues with weave that the latest upgrade fixed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜