Processing byte pixels with SSE/SSE2 intrinsics in C
I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable - for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.
However, I am having a data format problem. My webcam library gives me webcam frames as 开发者_C百科a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):
float pixel[] = {(float)r, (float)g, {float)b, 0.0f};
then load another float array full of constants
float constants[] = {0.299, 0.587, 0.114, 0.0f};
cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc etc... there is no overall performance gain because all the shuffling stuff around takes up so much time!
Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?
If you are willing to use MMX...
MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.
Like the 8-bit values you're working with.
There's a good primer here.
I think your performance bottleneck could come from the casting to float, that is a rather expensive operation.
If I remember well, that casting is about 50 clock cycles in most architectures... and considering the worst case in which the FP multiplications could take, let's say, about 4 clocks each one with no overlapping in the pipeline, doing all of them in parallel in 1 cycle could save you 15 cycles at most, still no gain.
I'd definitively go for working always with the same number format (integer in this case), if streamed with MMX like Shmoopty said, then better.
First, the data you're copying from (I'm guessing it's pointed to by that void*
pointer) should be memory aligned for optimal performance - if not copy it to a memory aligned buffer.
Second, you can still use SSE2 once you've moved your data into a memory aligned buffer, it's quite easy - I used the code here without any issues with the intrinsics (but had problems with the assembly as detailed here).
Hope this is useful - I too worked with images and stored them as unsigned char
in the main memory and copied them to the SSE2 registers (made sense since R,G, or B varied from 0-255) - but I used the assembly code since I felt it was easier.
But if you want to make it cross-platform, I suppose using the intrinsics would be cleaner.
Good luck!
精彩评论