Are there SIMD instructions to speed up checksum calculations?

2023-03-19 15:30 问答作者：

I'm going to have to code a very basic checksum function, something like:

char sum(const char * data, const int len)
{
    char sum(0);
    for (const char * end=data+len ; data<end ; ++data)
        sum += *data;
    return sum;
}

That's trivial. Now, how should I optimize this? First, I should probably use some std::for_each with a lambda or something like that:

char sum2(const char * data, const int len)
{
    char sum(0);
    std::for_each(data, data+len, [&sum](char b){sum+=b;});
    return sum;
}

Next, I could use multiple threads/cores to sum up chunks, then add the results. I won't write it down, and I'm afraid the cost of creating threads (or getting them from a pool anyway), then cutting up the array, then dispatching everything, etc, would not be very good considering that I would mostly calculate checksums for small arrays, mostly 10-100 bytes, rarely up to 1000.

But what I really want is something lower level, some SIMD stuff that would sum up bytes on 128b registers, or maybe sum bytes independently between two registers without carrying the carry, or both.

Is there any such thing out there ?

Note: This IS actual premature optimization, but it's fun, so what the hell?

Edit: I still need a way to sum up all the bytes in an SSE register, something better than

char ptr[16];
_mm_storeu_si128((__m128i*)ptr, sum);
checksum += ptr[0] + ptr[1] + ptr[2]  + ptr[3]  + ptr[4]  + ptr[5]  + ptr[6]  + ptr[7]
          + ptr[8] + ptr[9] + ptr[10] + ptr[11] + ptr[12] + ptr[13] + ptr[开发者_开发技巧14] + ptr[15];

Yes, there are such instructions in the MMX instruction set, called "Packed ADD":

_mm_add_pi8 in Visual C++
__builtin_ia32_paddb in gcc

And in the SSE2 instruction set:

_mm_add_epi8 in Visual C++
__builtin_ia32_paddb128 in gcc

EDIT: A faster way to add the partial sums:

__m128i sums;

sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 1));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 2));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 4));
sums = _mm_add_epi8(sums, _mm_srli_si128(sums, 8));
checksum += _mm_cvtsi128_si32(sums);

Look at _mm_add_ps. Simultaneous add of 128-bit contiguous block. You'll need to zero pad your array or process the last few non SIMD style.

继续阅读：checksum simd

Are there SIMD instructions to speed up checksum calculations?

更多精彩内容

精彩评论

最新问答

第一次出国飞行流程+注意事项？

再生油（关于再生油的介绍）？

东莞科技进修学院（关于东莞科技进修学院的介绍）？

均为镇政府人员平均年龄不超30？

手机msn在哪里下载（其实很简单）？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？