Fast rgb565 to YUV (or even rgb565 to Y)

2022-12-16 19:22 问答作者：

I'm working on a thing where I want to have the output option to go to a video overlay. Some support rgb565, If so sweet, just copy the data across.

If not I have to copy data across with a conversion and it's a frame buffer at a time. I'm going to try a few things, but I thought this might be one of those things that optimisers would be keen on having a go at for a bit of a challenge.

There a variety of YUV formats that are commonly supported easiest would be the Plane of Y followed by either interleaved or individual planes of UV.

Using Linux / xv, but at the level I'm dealing with it's just bytes and an x86.

I'm going to focus on speed at the cost of quality, but there are potentially hundreds of different paths to try out. There's a balance in there somewhere.

I looked at mmx but I'm not sure if there is anything useful there. There's nothing that strikes me as particularly suited to the task and it's a lot of shuffling to get things into the right place in registers.

Trying a crude version with Y = Green*0.5 + R*0.25 + Blue*notm开发者_开发问答uch. The U and V are even less of a concern quality wise. You can get away with murder on those channels.

For a simple loop.

loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop

of course every instruction depends on the one before and word reads aren't the best so interleaving two might gain a bit

loop: 
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop

It would be quite easy to do that with 4 at a time, maybe for a benefit.

Can anyone come up with anything faster, better?

An interesting side point to this is whether or not a decent compiler can produce similar code.

A decent compiler, given the appropriate switches to tune for the CPU variants of most interest, almost certainly knows a lot more about good x86 instruction selection and scheduling than any mere mortal!

Take a look at the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual...

If you want to get into hand-optimising code, a good strategy might be to get the compiler to generate assembly source for you as a starting point, and then tweak that; profile before and after every change to ensure that you're actually making things better.

What you really want to look at, I think, is using MMX or the integer SSE instructions for this. That will let you work with a few pixels at a time. I imagine your compiler will be able to generate such code if you specify the correct switches, especially if your code is written nicely enough.

Regarding your existing codes, I wouldn't bother with interleaving instructions of different iterations to gain performance. The out-of-order engine of all x86 processors (excluding Atom) and the caches should handle that pretty well.

Edit: If you need to do horizontal adds you might want to use the PHADDD and PHADDW instructions. In fact, if you have the Intel Software Designer's Manual, you should look for the PH* instructions. They might have what you need.

继续阅读：assembly blit optimization x86 yuv

Fast rgb565 to YUV (or even rgb565 to Y)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？