How to optimize this simple function which translates input bits into words?

2023-01-02 17:32 问答作者：

I have written a function which reads an input buffer of bytes and produces an output buffer of words where every word can be either 0x0081 for each ON bi开发者_如何学编程t of the input buffer or 0x007F for each OFF bit. The length of the input buffer is given. Both arrays have enough physical place. I also have about 2Kbyte free RAM which I can use for lookup tables or so.

Now, I found that this function is my bottleneck in a real time application. It will be called very frequently. Can you please suggest a way how to optimize this function? I see one possibility could be to use only one buffer and do in-place substitution.

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
 int32 i,j,z=0;

 for(i=0; i<BufInLen; i++)
 {
  for(j=0; j<8; j++, z++)
  {
   pw_BufOut[z] = 
                    ( ((pc_BufIn[i] >> (7-j))&0x01) == 1? 
                    0x0081: 0x007f );
  }
 }
}

Please do not offer any library-, compiler specific or CPU/Hardware specific optimization, because it is a multi-platform project.

I also have about 2Kbyte free RAM which I can use for lookup tables

Your lookup tables can placed in a const array at compile time, so it could be in ROM - does this give you room for the straightforward 4KB table?

If you can afford 4KB of ROM space, the only problem is building the table as an initialized array in a .c file - but that only has to be done once, and you can write a script to do it (which may help ensure it's correct and may also help if you decide that the table needs to change for some reason in the future).

You'd have to profile to ensure that the copy from ROM to the destination array is actually faster than calculating what needs to go into the destination - I wouldn't be surprised if something along the lines of:

/* untested code - please forgive any bonehead errors */
void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
    while (BufInLen--) {
        unsigned int tmp = *pc_BufIn++;

        *pw_BufOut++ = (tmp & 0x80) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x40) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x20) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x10) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x08) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x04) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x02) ? 0x0081 : 0x007f;
        *pw_BufOut++ = (tmp & 0x01) ? 0x0081 : 0x007f; 
    }
}

ends up being faster. I'd expect that an optimized build of that function would have everything in registers or encoded into the instructions except for a single read of each input byte and a single write of each output word. Or pretty close to that.

You might be able to further optimize by acting on more than one input byte at a time, but then you have to deal with alignment issues and how to handle input buffers that aren't a multiple of the chunk size you're dealing with. Those aren't problems that are too hard to deal with, but they do complicate things, and it's unclear what kind of improvement you might be able to expect.

I assume you can't use parellellism?

This is only a guess - you really need to be guided by a profiler - but I think lookup tables could work.

If I understand correctly, each byte in the input array produces 16 bytes in the output. So a lookup table that gives the 16 byte output for a single byte input should take 4KiB - which is more than you have to spare.

You could split each byte into two parts of 4 bits instead, which would reduce the size of the requried table to 256bytes:

int16[0x0F][4] values = {...};
void inline BitsToWords(int8    *pc_BufIn, int16   *pw_BufOut, int32   BufInLen)
{  
  for(int32 i=0; i<BufInLen; ++i, BufOut+=8)
  {
    memcpy(pw_BufOut,values[pc_BufIn[i]&0x0F]);
    memcpy(pw_BufOut+4,values[(pc_BufIn[i]&0xF0)>>4]);
  }
}

Also, if you're finding that the loop overhead is excessive, you could use a Duff's Device.

First attempt:

void inline BitsToWords(int8    *pc_BufIn,  
                        int16   *pw_BufOut,  
                        int32   BufInLen) 
{ 
 int32 i,j=0;
 int8 tmp;
 int16 translate[2] = { 0x007f, 0x0081 };

 for(i=0; i<BufInLen; i++) 
 { 
  tmp = pc_BufIn[i];
  for(j=0x80; j!=0; j>>=1) 
  { 
   *pw_BufOut++ = translate[(tmp & j) != 0];
  } 
 } 
}

Second attempt, stealing shamelessly from Michael Burr (who already got a +1 from me):

void inline BitsToWords(int8    *pc_BufIn,  
                        int16   *pw_BufOut,  
                        int32   BufInLen) 
{ 
    while (BufInLen--) { 
        int16 tmp = *pc_BufIn++; 

        *pw_BufOut++ = 0x007f + ((tmp >> 6) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 5) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 4) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 3) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 2) & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp >> 1) & 0x02); 
        *pw_BufOut++ = 0x007f + (tmp & 0x02); 
        *pw_BufOut++ = 0x007f + ((tmp << 1) & 0x02);  
    } 
}

On the assumption that pc_bufIn and pw_bufOut point to non-overlapping memory regions, I can think of several optimizations off the top of my head. The first is that you can declare the pointers to be non-aliased:

void inline BitsToWords(int8  * restrict pc_BufIn, 
                        int16 * restrict pw_BufOut, 
                        int32            BufInLen)

This will allow the compiler to perform optimizations that otherwise wouldn't be permitted. Note that your compiler may use a different keyword; I think some use __restrict__ or may have a compiler-specific attribute. Note that the only requirement is that pc_bufIn and pw_bufOut do not overlap. This should give you an immediate performance speedup, since the compiler will not attempt to re-read pc_bufIn whenever pw_bufOut is written out to (saving you 7 reads for every 8 writes).

If that keyword is not available, an alternative optimization is possible:

{
 char* bufInEnd = pc_bufIn + BufInLen;
 While(pc_bufIn != bufInEnd) {
 {
  char tmp = *pc_bufIn++;
  for(int j=0; j<8; j++)
  {
   *pw_BufOut++ =  ( (tmp & (0x80 >> j) != 0)? 
                    0x0081: 0x007f );
  }
 }
}

The above slight rewrite is, to me, easier to follow as it states unequivocally the path the CPU takes, but I hope the optimization is obvious: Store the value at pc_bufIn[i] to a temporary local variable, instead of hitting the pointer every iteration of the inner loop.

Another, less obvious optimization would utilize the increasingly common vector hardware available on most CPUs (including ARM's NEON and Intel's SSE) to synthesize the result 16 bytes at a time. I'd recommend investigating that option.

If you're going for raw speed, then using a lookup table (to avoid the inner loop with bit shifts) is probably the best approach.

static int16 [] lookup = {
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x007f, 0x0081, 0x0081,
  /* skip 251 entries */
  0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 0x0081, 
};

void inline BitsToWords(int8 * input, int16 * output, int32 length) {
  while ( length-- ) {
    memcpy( output, lookup[ *input++ ], 16 );
    output += 8; 
  }
}

The problem there is that the lookup table itself would be 4KB (256*16), which is larger than you have available. That can be worked around in one of two ways. The simplest and smallest solution would be something like this:

static int16 [] lookup = {
  0x007f, 0x007f, 0x007f, 0x007f, 
  0x007f, 0x007f, 0x007f, 0x0081, 
  0x007f, 0x007f, 0x0081, 0x007f, 
  0x007f, 0x007f, 0x0081, 0x0081,
  /* skip 11 entries */
  0x0081, 0x0081, 0x0081, 0x0081, 
};

void inline BitsToWords(int8 * input, int16 * output, int32 length) {
  while ( length-- ) {
    int 8 c = *input++;
    memcpy( output, &lookup[ c &0x0f ], 8 );
    memcpy( output+4, &lookup[ c >> 4 ], 8 );
    output += 8; 
  }
}

The more complex, but possibly faster way would be to use a De Bruijn sequence to encode all of the possible lookup values. This would reduce the lookup table from 4KB to 512+14, but would require an additional level of indirection and another index table (256 bytes), for a total of 782 bytes. This would remove one of the memcpy() calls, as well as the shift and the bitwise and, at the cost of one more index. Probably not necessary in your case, but interesting all the same.

I was going to suggest a boost::for_each since it would unravel the loop but the end isn't known. Best I think you'll get is to unravel the inner loop. I'd look for ways to do that. boost::for_each over an mpl::range may be an option there.

You can extract pc_BufIn[i] into the outer loop. Also at first glance, when counting backwards in the second loop, you can skip the 7-j calculation.

I might suggest creating a lookup table of the 8 possible single bit masks (i.e. 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80) and then use these to compare to your bitfield in a loop. Pseudocode (bitmasks above called bitmask, in the appropriate order):

for(i=0,i<BufInLen;i++)
  for(j=0;j<8;j++,z++)
    pw_BufOut[z]=(pc_BufIn[i]&bitmask[j])==0?0x007f:0x0081;

First off, since you are bit twiddling, change everything to unsigned. This eliminates any adverse effects due to sign extension or other sign related operations.

You could use a modified Duff's device:

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
    uint32 i,j,z=0;

    for(i=0; i<BufInLen; i++)
    {
        uint8   byte = pc_BufIn[i];
        for (j = 0; j < 2; ++j)
        {
            switch (byte & 0x0F)
            {
                case 0:     // 0000 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    break;
                case 1:     // 0001 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x81;
                    break;
                case 2:     // 0010 binary
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x7F;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x7F;
                    break;

               // And so on ...
                case 15:        // 1111 binary
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    pw_BufOut[z++] = 0x81;
                    break;
            } // End: switch
            byte >>= 1;
        }
    }
}

If you don't mind having 256 pw_Bufout in memory you could try to generate all the possible outputs and skip the second loop by changing that to pw_BufOut[i]=perm[pc_BufIn[i]]; (perm is an array with all the permutations)

What comes to mind immediately:

Unroll the inner loop (the compiler might do so already, but you can optimize further if you do so manually, see below)
Instead of keeping "z", keep a pointer that is incremented (the compiler might do that already)
Instead of performing a comparison, for each unrolled item, shift down the shift you extracted so that it's in the second place. Add this to 0x7f and put that in the value. That will give you 0x7F or 0x81 for each.

Best thing is to see what sort of assembler that is generated for your target platforms, and see what the compiler is doing.

EDIT: I wouldn't use a lookup table. The cost of the extra cache misses will probably be more than the cost of the simple calculation.

EDIT2: Let me get to the other computer and fire up the compiler, and I'll see what I can do.

Firstly, you're doing this for an 8 segment display, aren't you?

You may want to

#include <stdint.h>

It contains typedefs for sized integers with names like uint8_t and uint_fast8_t. Your types serve similar purposes to the first form, but the fast versions can be bigger if the target processor works better with data of that size. You probably won't want to change your array types, though; mostly just your local variable types.

void inline BitsToWords(int8    *pc_BufIn, 
                        int16   *pw_BufOut, 
                        int32   BufInLen)
{
  //int32 i,j,z=0;
  /* This is a place you might want to use a different type, but
   * I don't know for sure.  It depends on your processor, and I
   * didn't use these variables */

  int8 * end = pc_BufIn + BufInLen; /* So that you can do pointer math rather than
                                    * index. */
  while (end < pc_BufIn)
  {
    uint_fast8_t cur = *(pc_BufIn++);
    uint_fast8_t down = 8;

    do
    {
       *(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 ); /* When the bottom bit is set, add 2 */
       /* By doing this with addition we avoid a jump. */

       cur >>= 1; /* next smallest bit */
    } while (--down);
  }
}

In this code I changed the order of the second loop to count down rather than up. This is often more efficient if your lower limit is 0 or -1. Also, it seemed like you were going from the most significant bit to the least anyway.

Alternately you could unroll the inner loop and produce faster code and do away with the down variable. Your compiler may already be doing this for you.

*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );
cur >>= 1; /* next smallest bit */
*(pw_BufOut++) = 0x07f + ( (mask&cur)<< 1 );

For the outer loop I changed it to just increment a pointer rather than use array[index] and index test as your conditional. Many processors can actually do the pointer+offset for you, and on those processors the pointer++ method may not be a win for you. In that case I would suggest that you might try reversing the outer loop and counting down your index until index < 0. Attempting to decrement it just before the test will often result in the same flags being set as explicitly testing the value against 0, and compilers usually take advantage of that when optimizations are turned on.

Another thing that you may want to try is to use bigger chunks than bytes as your input. You would have to worry about endian issues and non-word sized input arrays.

One more thing you may want to consider is not doing this operation for an entire variable length string at one time. You could do it one input byte or one word per call and then pass that 8 * 16 chunk of memory to something else (a piece of hardware, I assume). Then you may be able to reduce the memory requirements for your output array, which may improve cache performance.

继续阅读：algorithm bit-manipulation c performance

How to optimize this simple function which translates input bits into words?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？