开发者

Is there a more efficient way to get the length of a 32bit integer in bytes?

I'd like a shortcut for the following little function, where performance is very important (the function is called more than 10.000.000 times):

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

Does anyone have any开发者_如何学Go idea... a cool bitoperation trick? Thanks for your help in advance!


How about this one?

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

Removing the inline keyword, g++ -O2 compiles this to the following branchless code:

movl    8(%ebp), %edx
movl    %edx, %eax
andl    $-16777216, %eax
cmpl    $1, %eax
sbbl    %eax, %eax
addl    $4, %eax
xorl    %ecx, %ecx
testl   $-65536, %edx
sete    %cl
subl    %ecx, %eax
andl    $-256, %edx
sete    %dl
movzbl  %dl, %edx
subl    %edx, %eax

If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:

int len(uint32 val)
{
    asm("mov 8(%ebp), %eax");
    asm("or  $255, %eax");
    asm("bsr %eax, %eax");
    asm("shr $3, %eax");
    asm("inc %eax");
    asm("mov %eax, 8(%ebp)");
    return val;
}

Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. But you should get the basic idea.

The GNU compiler also has an interesting built-in function called __builtin_clz:

inline int len(uint32 val)
{
    return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}

This looks much better than the inline assembly version to me :)


I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler.

Here's what I saw:

This took 11497 ticks

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

While this took 14399 ticks

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

edit: my idea about why one was faster is wrong because:

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
        ;
}

This version used only 11107 ticks. Since + is faster than - perhaps? I'm not sure.

Even faster though was the binary search at 7161 ticks

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

And fastest so far is using the MS intrinsic function, at 4399 ticks

#pragma intrinsic(_BitScanReverse)

inline int len2(uint32 val)
{
    DWORD index;
    _BitScanReverse(&index, val);

    return (index>>3)+1;

}

For reference - here's the code i used to profile:

int _tmain(int argc, _TCHAR* argv[])
{
    int j = 0;
    DWORD t1,t2;

    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);


    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len2(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
}

Had to print j to prevent the loops from being optimized out.


Do you really have profile evidence that this is a significant bottleneck in your application? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it.


Binary search MIGHT save a few cycles, depending on the processor architecture.

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

Or, finding out which is the most common case might bring down the average number of cycles, if most inputs are one byte (eg when building UTF-8 encodings, but then your break points wouldn't be 32/24/16/8):

inline int len(uint32 val)
{
    if (val & 0xffffff00) {
       if (val & 0xffff0000) {
           if (val & 0xff000000) return 4;
           return 3;
       }
       return 2;
    }
    return 1;
}

Now the short case does the fewest conditional tests.


If bit ops are faster than comparison on your target machine you can do this:

inline int len(uint32 val)
{
    if(val & 0xff000000) return 4;
    if(val & 0x00ff0000) return 3;
    if(val & 0x0000ff00) return 2;
    return 1;
} 


You can avoid the conditional branches that can be costly if the distribution of your numbers does not make prediction easy:

return 4 - (val <= 0x000000ff) - (val <= 0x0000ffff) - (val <= 0x00ffffff);

Changing the <= to a & will not change anything much on a modern processor. What is your target platform?

Here is the generated code for x86-64 with gcc -O:

    cmpl    $255, %edi
    setg    %al
    movzbl  %al, %eax
    addl    $3, %eax
    cmpl    $65535, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax
    cmpl    $16777215, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax

There are comparison instructions cmpl of course, but these are followed by setg or setle instead of conditional branches (as would be usual). It's the conditional branch that is expensive on a modern pipelined processor, not the comparison. So this version saves the expensive conditional branches.

My attempt at hand-optimizing gcc's assembly:

    cmpl    $255, %edi
    setg    %al
    addb    $3, %al
    cmpl    $65535, %edi
    setle   %dl
    subb    %dl, %al
    cmpl    $16777215, %edi
    setle   %dl
    subb    %dl, %al
    movzbl  %al, %eax


You may have a more efficient solution depending on your architecture.

MIPS has a "CLZ" instruction that counts the number of leading zero-bits of a number. What you are looking for here is essentially 4 - (CLZ(x) / 8) (where / is integer division). PowerPC has the equivalent instruction cntlz, and x86 has BSR. This solution should simplify down to 3-4 instructions (not counting function call overhead) and zero branches.


On some systems this could be quicker on some architectures:

inline int len(uint32_t val) {
   return (int)( log(val) / log(256) );  // this is the log base 256 of val
}

This may also be slightly faster (if comparison takes longer than bitwise and):

inline int len(uint32_t val) {
    if (val & ~0x00FFffFF) {
        return 4;
    if (val & ~0x0000ffFF) {
        return 3;
    }
    if (val & ~0x000000FF) {
        return 2;
    }
    return 1;

}

If you are on an 8 bit microcontroller (like an 8051 or AVR) then this will work best:

inline int len(uint32_t val) {
    union int_char { 
          uint32_t u;
          uint8_t a[4];
    } x;
    x.u = val; // doing it this way rather than taking the address of val often prevents
               // the compiler from doing dumb things.
    if (x.a[0]) {
        return 4;
    } else if (x.a[1]) {
       return 3;
    ...

EDIT by tristopia: endianness aware version of the last variant

int len(uint32_t val)
{
  union int_char {
        uint32_t u;
        uint8_t a[4];
  } x;
  const uint16_t w = 1;

  x.u = val;
  if( ((uint8_t *)&w)[1]) {   // BIG ENDIAN (Sparc, m68k, ARM, Power)
     if(x.a[0]) return 4;
     if(x.a[1]) return 3;
     if(x.a[2]) return 2;
  }
  else {                      // LITTLE ENDIAN (x86, 8051, ARM)
    if(x.a[3]) return 4;
    if(x.a[2]) return 3;
    if(x.a[1]) return 2;
  }
  return 1;
}

Because of the const, any compiler worth its salt will only generate the code for the right endianness.


to Pascal Cuoq and the 35 other people who up-voted his comment:

"Wow! More than 10 million times... You mean that if you squeeze three cycles out of this function, you will save as much as 0.03s? "

Such a sarcastic comment is at best rude and offensive.

Optimization is frequently the cumulative result of 3% here, 2% there. 3% in overall capacity is nothing to be sneezed at. Suppose this was an almost saturated and unparallelizable stage in a pipe. Suppose CPU utilization went from 99% to 96%. Simple queuing theory tells one that such a reduction in CPU utilization would reduce the average queue length by over 75%. [the qualitative (load divided by 1-load)]

Such a reduction can frequently make or break a particular hardware configuration as this has feed back effects on memory requirements, caching the queued items, lock convoying, and (horror of horrors should it be a paged system) even paging. It is precisely these sorts of effects that cause bifurcated hysteresis loop type system behavior.

Arrival rates of anything seem to tend to go up and field replacement of a particular CPU or buying a faster box is frequently just not an option.

Optimization is not just about wall clock time on a desktop. Anyone who thinks that it is has much reading to do about the measurement and modelling of computer program behavior.

Pascal Cuoq owes the original poster an apology.


Just to illustrate, based on FredOverflow's answer (which is nice work, kudos and +1), a common pitfall regarding branches on x86. Here's FredOverflow's assembly as output by gcc:

movl    8(%ebp), %edx   #1/.5
movl    %edx, %eax      #1/.5
andl    $-16777216, %eax#1/.5
cmpl    $1, %eax        #1/.5
sbbl    %eax, %eax      #8/6
addl    $4, %eax        #1/.5
xorl    %ecx, %ecx      #1/.5
testl   $-65536, %edx   #1/.5
sete    %cl             #5
subl    %ecx, %eax      #1/.5
andl    $-256, %edx     #1/.5
sete    %dl             #5
movzbl  %dl, %edx       #1/.5
subl    %edx, %eax      #1/.5
# sum total: 29/21.5 cycles

(the latency, in cycles, is to be read as Prescott/Northwood)

Pascal Cuoq's hand-optimized assembly (also kudos):

cmpl    $255, %edi      #1/.5
setg    %al             #5
addb    $3, %al         #1/.5
cmpl    $65535, %edi    #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
cmpl    $16777215, %edi #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
movzbl  %al, %eax       #1/.5
# sum total: 22/18.5 cycles

Edit: FredOverflow's solution using __builtin_clz():

movl 8(%ebp), %eax   #1/.5
popl %ebp            #1.5
orb  $-1, %al        #1/.5
bsrl %eax, %eax      #16/8
sarl $3, %eax        #1/4
addl $1, %eax        #1/.5
ret
# sum total: 20/13.5 cycles

and the gcc assembly for your code:

movl $1, %eax        #1/.5
movl %esp, %ebp      #1/.5
movl 8(%ebp), %edx   #1/.5
cmpl $255, %edx      #1/.5
jbe  .L3             #up to 9 cycles
cmpl $65535, %edx    #1/.5
movb $2, %al         #1/.5
jbe  .L3             #up to 9 cycles
cmpl $16777216, %edx #1/.5
sbbl %eax, %eax      #8/6
addl $4, %eax        #1/.5
.L3:
ret
# sum total: 16/10 cycles - 34/28 cycles

in which the instruction cache line fetches which come as the side-effect of the jcc instructions probably cost nothing for such a short function.

Branches can be a reasonable choice, depending on the input distribution.

Edit: added FredOverflow's solution which is using __builtin_clz().


Ok one more version. Similar to Fred's one, but with less operations.

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
    ;
}


This gives you less comparisons. But may be less efficient if memory access operation costs more than a couple of comparisons.

int precalc[1<<16];
int precalchigh[1<<16];
void doprecalc()
{
    for(int i = 0; i < 1<<16; i++) {
        precalc[i] = (i < (1<<8) ? 1 : 2);
        precalchigh[i] = precalc[i] + 2;
    }
}
inline int len(uint32 val)
{
    return (val & 0xffff0000 ? precalchigh[val >> 16] : precalc[val]);
}


The minimum number of bits required to store an integer is:

int minbits = (int)ceil( log10(n) / log10(2) ) ;

The number of bytes is:

int minbytes = (int)ceil( log10(n) / log10(2) / 8 ) ;

This is an entirely FPU bound solution, performance may or may not be better than a conditional test, but worth investigation perhaps.

[EDIT] I did the investigation; a simple loop of ten million iterations of the above took 918ms whereas FredOverflow's accepted solution took just 49ms (VC++ 2010). So this is not an improvement in terms of performance, though may remain useful if it were the number of bits that were required, and further optimisations are possible.


If I remember 80x86 asm right, I'd do something like:

  ; Assume value in EAX; count goes into ECX
  cmp eax,16777215 ; Carry set if less
  sbb ecx,ecx      ; Load -1 if less, 0 if greater
  cmp eax,65535
  sbb ecx,0        ; Subtract 1 if less; 0 if greater
  cmp eax,255
  sbb ecx,-4       ; Add 3 if less, 4 if greater

Six instructions. I think the same approach would also work for six instructions on the ARM I use.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜