Faster alpha blending using a lookup table?

2023-02-20 09:35 问答作者：

I have made a lookup table that allows you to blend two single-byte channels (256 colors per channel) using a single-byte alpha channel using no floating point values (hence no float to int conversions). Each index in the lookup table corresponds to the value of 256ths of a channel, as related to an alpha value.

In all, to fully calculate a 3-channel RGB blend, it would require two lookups into the array per channel, plus an addition. This is a total of 6 lookups and 3 additions. In the example below, I split the colors into separate values for ease of demonstration. This example shows how to blend three channels, R G and B by an alpha value ranging from 0 to 256.

BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;

BYTE av; // Alpha value
BYTE rem = 255 - av; // Remaining fraction

rDest = _lookup[r1][rem] + _lookup[r2][av];
gDest = _lookup[g1][rem] + _lookup[g2][av];
bDest = _lookup[b1][rem] + _lookup[b2][av];

It works great. Precise as you can get using 256 color channels. In fact, you would get the same exact values using the actual floating point calcul开发者_如何转开发ations. The lookup table was calculated using doubles to begin with. The lookup table is too big to fit in this post (65536 bytes). (If you would like a copy of it, email me at ten.turtle.toes@gmail.com, but don't expect a reply until tomorrow because I am going to sleep now.)

So... what do you think? Is it worth it or not?

I would be interested in seeing some benchmarks.

There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)

I also did an SSE implementation of this long ago, if you are interested...

void PreOver_SSE2(void* dest, const void* source1, const void* source2, size_t size)
{
    static const size_t STRIDE = sizeof(__m128i)*4;
    static const u32 PSD = 64;

    static const __m128i round = _mm_set1_epi16(128);
    static const __m128i lomask = _mm_set1_epi32(0x00FF00FF);

    assert(source1 != NULL && source2 != NULL && dest != NULL);
    assert(size % STRIDE == 0);

    const __m128i* source128_1 = reinterpret_cast<const __m128i*>(source1);
    const __m128i* source128_2 = reinterpret_cast<const __m128i*>(source2);
    __m128i*       dest128 = reinterpret_cast<__m128i*>(dest);  

    __m128i d, s, a, rb, ag, t;

    for(size_t k = 0, length = size/STRIDE; k < length; ++k)    
    {
        // TODO: put prefetch between calculations?(R.N)
        _mm_prefetch(reinterpret_cast<const s8*>(source128_1+PSD), _MM_HINT_NTA);
        _mm_prefetch(reinterpret_cast<const s8*>(source128_2+PSD), _MM_HINT_NTA);   

        // work on entire cacheline before next prefetch
        for(int n = 0; n < 4; ++n, ++dest128, ++source128_1, ++source128_2)
        {
            // TODO: assembly optimization use PSHUFD on moves before calculations, lower latency than MOVDQA (R.N) http://software.intel.com/en-us/articles/fast-simd-integer-move-for-the-intel-pentiumr-4-processor/

            // TODO: load entire cacheline at the same time? are there enough registers? 32 bit mode (special compile for 64bit?) (R.N)
            s = _mm_load_si128(source128_1);        // AABGGRR
            d = _mm_load_si128(source128_2);        // AABGGRR

            // PRELERP(S, D) = S+D - ((S*D[A]+0x80)>>8)+(S*D[A]+0x80))>>8
            // T = S*D[A]+0x80 => PRELERP(S,D) = S+D - ((T>>8)+T)>>8

            // set alpha to lo16 from dest_
            a = _mm_srli_epi32(d, 24);          // 000000AA 
            rb = _mm_slli_epi32(a, 16);         // 00AA0000
            a = _mm_or_si128(rb, a);            // 00AA00AA

            rb = _mm_and_si128(lomask, s);      // 00BB00RR     
            rb = _mm_mullo_epi16(rb, a);        // BBBBRRRR 
            rb = _mm_add_epi16(rb, round);      // BBBBRRRR
            t = _mm_srli_epi16(rb, 8);          
            t = _mm_add_epi16(t, rb);
            rb = _mm_srli_epi16(t, 8);          // 00BB00RR 

            ag = _mm_srli_epi16(s, 8);          // 00AA00GG     
            ag = _mm_mullo_epi16(ag, a);        // AAAAGGGG     
            ag = _mm_add_epi16(ag, round);
            t = _mm_srli_epi16(ag, 8);
            t = _mm_add_epi16(t, ag);
            ag = _mm_andnot_si128(lomask, t);   // AA00GG00     

            rb = _mm_or_si128(rb, ag);          // AABGGRR      pack

            rb = _mm_sub_epi8(s, rb);           // sub S-[(D[A]*S)/255]
            d = _mm_add_epi8(d, rb);            // add D+[S-(D[A]*S)/255]

            _mm_stream_si128(dest128, d);
        }
    }   
    _mm_mfence();   //ensure last WC buffers get flushed to memory      
}

Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.

I don't know why you're concerned about floating point conversions, this can all be done as integers.

BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;

BYTE av; // Alpha value BYTE
rem = 255 - av; // Remaining fraction

rDest = (r1*rem + r2*av) / 255;
gDest = (g1*rem + g2*av) / 255;
bDest = (b1*rem + b2*av) / 255;

If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.

Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.

BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;

BYTE av; // Alpha value BYTE
int aMult = 0x10000 * av / 255;
rem = 0x10000 - aMult; // Remaining fraction

rDest = (r1*rem + r2*aMult) >> 16;
gDest = (g1*rem + g2*aMult) >> 16;
bDest = (b1*rem + b2*aMult) >> 16;

Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.

继续阅读：alphablending c graphics optimization

Faster alpha blending using a lookup table?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？