Optimizing loop with few instructions(SSE2, SSE4) with TBB

2023-02-10 03:42 问答作者：

I have a simple image processing related algorithm. Briefly, an image(mean) in float is subtracted by an 8-bit image the result is then save to an float image(dest)

this function is mainly written by intrinsics.

I have tried to optimize this function with TBB, parrallel_for, but I received no gain in speed but penalty.

What should I do ? Should I use more low-level scheme such as TBB task to optimize the code ?

float           *m, **m_data,
                *o, **o_data;
unsigned char   *p, **src_data;
register unsigned long len, i;
unsigned long   nr,
                nc;

src_data    =   src->UByteData;    // 2d array
m_data      =   mean->FloatData;   // 2d array
o_data      =   dest->FloatData;   // 2d array
nr          =   src->Rows;
nc          =   src->Cols;

__m128i xmm0;

for(i=0; i<nr; i++)
{
    m = m_data[i];
    o = o_data[i];
    p = src_data[i];
    len = nc;
    do
    {
        _mm_prefetch((const char *)(p + 16),  _MM_HINT_NTA);
        _mm_prefetch((const char *)(m + 16),  _MM_HINT_NTA);

        xmm0 = _mm_load_si128((__m128i *) (p));

        _mm_stream_ps(
                        o,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))),
                                    _mm_load_ps(m + offset)
                                )
                    );
        _mm_stream_ps(
                        o + 4,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))),
                                    _开发者_开发知识库mm_load_ps(m + offset + 4)
                                )
                    );
        _mm_stream_ps(
                        o + 8,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))),
                                    _mm_load_ps(m + offset + 8)
                                )
                    );
        _mm_stream_ps(
                        o + 12,
                        _mm_sub_ps(
                                    _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))),
                                    _mm_load_ps(m + offset + 12)
                                )
                    );

        p += 16;
        m += 16;
        o += 16;
        len -= 16;
    }
    while(len);
}

You are doing almost no computation here, relative to the number of loads and stores, so it's likely that you are being limited by memory bandwidth rather than computation. This would explain why you don't see any improvement in throughput when you optimise the computation.

I would get rid of the _mm_prefetch instructions though - they are almost certainly not helping here and may even be hurting performance.

If possible you should combine this loop with any other operations that you are doing before/after this - that way you amortise the cost of memory I/O over more computation.

继续阅读：image-processing optimization parallel-processing sse2 tbb

Optimizing loop with few instructions(SSE2, SSE4) with TBB

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？