Fast search and replace some nibble in int [c; microoptimisation]

2023-03-06 12:17 问答作者：

This is variant of Fast search of some nibbles in two ints at same offset (C, microoptimisation) question with different task:

The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:

int:   0x3d542753 (input)
           ^   ^
output:0x3dE427E3 (output int)

There can be other pair of nibble to search and nibble to replace (known at compile time).

I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is th开发者_如何学运维e 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.

My current code is slow (I rewrite it as function, but it is a code from loop):

static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
  int i;
  uint32_t mask = 0xf;
  uint32_t search = 0x5;
  uint32_t replace = 0xe;
  for(i=0;i<8;i++) {
    if( (A&mask) == search ) 
        A = (A & (~mask) )   // clean i-th nibble
           | replace;        // and replace
    mask <<= 4; search <<= 4; replace <<= 4;
  }
  return A;
}

Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888 and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.

My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).

This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
    return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}

I would explain how it works, but... I think explaining it spoils the fun.

Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:

static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
    uint32_t i;
    for (i = 0; i < n; ++i) {
        uint32_t x = p[i];
        uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
        uint32_t y = (~(ONES * SEARCH)) ^ x;
        y &= y >> 2;
        y &= y >> 1;
        y &= ONES;
        y *= 15;
        p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
    }
}

Using GCC, this function will automatically be converted to SSE code at -O3, assuming proper use of the -march flag. You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print out which loops are vectorized, e.g.:

$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.

Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.

Edit: I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t to uint16_t. This brings the total speedup to about 12x over the original. Changing to uint8_t causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.

Edit 2: Changed *= 7 to *= 15, this invalidates the speed tests.

Edit 3: Here's a change that is obvious in retrospect:

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    return x ^ (y * (SEARCH ^ REPLACE));
}

继续阅读：c micro-optimization optimization

Fast search and replace some nibble in int [c; microoptimisation]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？