Bilinear filter with SSE4.1 intrinsics

2023-03-04 21:11 问答作者：

I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine.

So far I have the following:

inline __m128i DivideBy255_8xUint16(const __m128i value)
{
    //  Blinn 16bit divide by 255 trick but across 8 packed 16bit values
    const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128));
    const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8);          //  TODO:   Should this be an arithmetic or logical shift or does it matter?
    const __m128i partial = _mm_add_epi16(plus128, plus128ThenDivideBy256);
    const __m128i result = _mm_srli_epi16(partial, 8);                          //  TODO:   Should this be an arithmetic or logical shift or does it matter?


    return result;
}


inline uint32_t BilinearSSE41(const uint8_t* data, uint32_t pitch, uint32_t width, uint32_t height, float u, float v)
{
    //  TODO:   There are probably intrinsics I haven't found yet to avoid using these?
    //  0x80 is high bit set which means zero out that component
    const __m128i unpack_fraction_u_mask = _mm_set_epi8(0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0, 0x80, 0);
    const __m128i unpack_fraction_v_mask = _mm_set_epi8(0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1, 0x80, 1);
    const __m128i unpack_two_texels_mask = _mm_set_epi8(0x80, 7, 0x80, 6, 0x80, 5, 0x80, 4, 0x80, 3, 0x80, 2, 0x80, 1, 0x80, 0);


    //  TODO:   Potentially wasting two channels of operations for now
    const __m128i size = _mm_set_epi32(0, 0, height - 1, width - 1);
    const __m128 uv = _mm_set_ps(0.0f, 0.0f, v, u);

    const __m128 floor_uv_f = _mm_floor_ps(uv);
    const __m128 fraction_uv_f = _mm_sub_ps(uv, floor_uv_f);
    const __m128 fraction255_uv_f = _mm_mul_ps(fraction_uv_f, _mm_set_ps1(255.0f));
    const __m128i fraction255_uv_i = _mm_cvttps_epi32(fraction255_uv_f);    //  TODO:   Did this get rounded correctly?

    const __m128i fraction255_u_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_u_mask); //  Splat fraction_u*255 across all 16 bit words
    const __m128i fraction255_v_i = _mm_shuffle_epi8(fraction255_uv_i, unpack_fraction_v_mask); //  Splat fraction_v*255 across all 16 bit words

    const __m128i inverse_fraction255_u_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_u_i);
    const __m128i inverse_fraction255_v_i = _mm_sub_epi16(_mm_set1_epi16(255), fraction255_v_i);

    const __m128i floor_uv_i = _mm_cvttps_epi32(floor_uv_f);
    const __m128i clipped_floor_uv_i = _mm_min_epu32(floor_uv_i, size); //  TODO:   I haven't clamped this probably if uv was less than zero yet...


    //  TODO:   Calculating the addresses in the SSE register set would maybe be better

    int u0 = _mm_extract_epi32(floor_uv_i, 0);
    int v0 = _mm_extract_epi32(floor_uv_i, 1);


    const uint8_t* row = data + (u0<<2) + pitch*v0;


    const __m128i row0_packed = _mm_loadl_epi64((const __m128i*)data);
    const __m128i row0 = _mm_shuffle_epi8(row0_packed, unpack_two_texels_mask);

    const __m128i row1_packed = _mm_loadl_epi64((const __m128i*)(data + pitch));
    const __m128i row1 = _mm_shuffle_epi8(row1_packed, unpack_two_texels_mask);


    //  Compute (row0*fraction)/255 + row1*(255 - fraction)/255 - probably slight precision loss across addition!
    const __m128i vlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(row0, fraction255_v_i));
    const __m128i vlerp1 = DivideBy255_8xUint16(_mm_mullo_epi16(row1, inverse_fraction255_v_i));
    const __m128i vlerp = _mm_adds_epi16(vlerp0, vlerp1);

    const __m128i hlerp0 = DivideBy255_8xUint16(_mm_mullo_epi16(vlerp, fraction255_u_i));
    const __m128i hlerp1 = DivideBy255_8xUint16(_mm_srli_si128(_mm_mullo_epi16(vlerp, inverse_fraction255_u_i), 16 - 2*4));
    const __m128i hlerp = _mm_adds_epi16(hlerp0, hlerp1);


    //  Pack down to 8bit from 16bit components and return 32bit ARGB result
    return _mm_extract_epi32(_mm_packus_epi16(hlerp, hlerp), 0);
}

The code assumes the image data is ARGB8 and has an extra column and row to handle edge cases without having to branch.

I am after advice on what instructions I can use to bring down the size of this gangly mess and of course how 开发者_C百科it can be improved to run faster!

Thanks :)

Nothing specific to say about your code. But I wrote my own Bilinear scaling code using SSE2. See the StackOverflow question Help me improve some more SSE2 code for more details.

In my code I calculate the horizontal and vertical fractions and indexes first rather than per pixel. I think this is faster.

My code under core2 cpus seems to be memory limited rather than cpu so not doing the precalc might be faster.

Noticed your comment "TODO: Should this be an arithmetic or logical shift or does it matter?"

Arithmetic shift is for signed integers. Logical shift is for unsigned integers.

    0x80000000 >> 4 is 0xf8000000 // Arithmetic shift
    0x80000000 >> 4 is 0x08000000 // Logical shift

继续阅读：c filtering intrinsics optimization sse

Bilinear filter with SSE4.1 intrinsics

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？