Processing byte pixels with SSE/SSE2 intrinsics in C

2022-12-14 10:02 问答作者：

I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable - for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.

However, I am having a data format problem. My webcam library gives me webcam frames as 开发者_C百科a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):

float pixel[] = {(float)r, (float)g, {float)b, 0.0f};

then load another float array full of constants

float constants[] = {0.299, 0.587, 0.114, 0.0f};

cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc etc... there is no overall performance gain because all the shuffling stuff around takes up so much time!

Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?

If you are willing to use MMX...

MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.

Like the 8-bit values you're working with.

There's a good primer here.

I think your performance bottleneck could come from the casting to float, that is a rather expensive operation.

If I remember well, that casting is about 50 clock cycles in most architectures... and considering the worst case in which the FP multiplications could take, let's say, about 4 clocks each one with no overlapping in the pipeline, doing all of them in parallel in 1 cycle could save you 15 cycles at most, still no gain.

I'd definitively go for working always with the same number format (integer in this case), if streamed with MMX like Shmoopty said, then better.

First, the data you're copying from (I'm guessing it's pointed to by that void* pointer) should be memory aligned for optimal performance - if not copy it to a memory aligned buffer.

Second, you can still use SSE2 once you've moved your data into a memory aligned buffer, it's quite easy - I used the code here without any issues with the intrinsics (but had problems with the assembly as detailed here).

Hope this is useful - I too worked with images and stored them as unsigned char in the main memory and copied them to the SSE2 registers (made sense since R,G, or B varied from 0-255) - but I used the assembly code since I felt it was easier.

But if you want to make it cross-platform, I suppose using the intrinsics would be cleaner.

Good luck!

继续阅读：c image-processing optimization webcam

Processing byte pixels with SSE/SSE2 intrinsics in C

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？