Optimizing variable-length encoding

2023-03-02 11:54 问答作者：

I've got a case where I need to compress a lot of often small values. Thus I compress them with a variable-length byte encoding (ULEB128, to be specific):

size_t
compress_unsigned_int(unsigned int n, char* data)
{
  size_t size = 0;
  while (n  > 127)
  {
    ++size;
    *data++ = (n & 127)|128;
    n >>= 7;
  }
  *data++ = n;
  return ++size;
}

Is there a more efficient way to do this (maybe using SS开发者_StackOverflow中文版E)?

Edit: After this compression, the result is stored into data, taking size bytes. Then, the compression function is called on the next unsigned int.

The first thing you want to do is test any possible solution against your current code.

I think you may want to try and get rid of data dependencies, to allow the processor to do more work at the same time.

What are data dependencies? As data flows through your function, the current value of n depends on the previous value of n, which depends on the value before that... which is a long chain of data dependencies. In the code below, n is never modified so the processor can "skip ahead" and do a couple different things at the same time without having to wait for the new n to be computed.

// NOTE: This code is actually incorrect, as caf noted.
// The byte order is reversed.
size_t
compress_unsigned_int(unsigned int n, char *data)
{
    if (n < (1U << 14)) {
        if (n < (1U << 7)) {
            data[0] = n;
            return 1;
        } else {
            data[0] = (n >> 7) | 0x80;
            data[1] = n & 0x7f;
            return 2;
        }
    } else if (n < (1U << 28)) {
        if (n < (1U << 21)) {
            data[0] = (n >> 14) | 0x80;
            data[1] = ((n >> 7) & 0x7f) | 0x80;
            data[2] = n & 0x7f;
            return 3;
        } else {
            data[0] = (n >> 21) | 0x80;
            data[1] = ((n >> 14) & 0x7f) | 0x80;
            data[2] = ((n >> 7) & 0x7f) | 0x80;
            data[3] = n & 0x7f;
            return 4;
        }
    } else {
        data[0] = (n >> 28) | 0x80;
        data[1] = ((n >> 21) & 0x7f) | 0x80;
        data[2] = ((n >> 14) & 0x7f) | 0x80;
        data[3] = ((n >> 7) & 0x7f) | 0x80;
        data[4] = n & 0x7f;
        return 5;
    }
}

I tested the performance by executing it in a tight loop from 0..UINT_MAX. On my system, the execution times are:

(Lower is better)
Original: 100%
caf's unrolled version: 79%
My version: 57%

Some minor tweaking may produce better results, but I doubt you'll get much more improvement unless you go to assembly. If your integers tend to be in specific ranges, then you can use profiling to get the compiler to add the right branch predictions to each branch. This might get you a few extra percentage points of speed. (EDIT: I got 8% from reordering the branches, but it's a perverse optimization because it relies on the fact that each number 0...UINT_MAX appears with equal frequency. I don't recommend this.)

SSE won't help. SSE is designed to operate on multiple pieces of data with the same width at the same time, it is notoriously difficult to get SIMD to accelerate anything with a variable length encoding. (It's not necessarily impossible, but it might be impossible, and you'd have to be pretty smart to figure it out.)

You might find fast implementation in google protocol buffers:

http://code.google.com/p/protobuf/

Look at CodedOutputStream::WriteVarintXXX methods.

First method might be rewritten as:

char *start = data;
while (n>=0x80)
{
    *data++=(n|0x80);
    n>>=7;
}
*data++=n;
return data-start;

According to my test google buffers implementation is the best, then come other implementations. However my test is rather artificial, it is better to test each approach in your application and choose the best. Presented optimizations work better on specific number values.

Here is code of my test application. (Note I've removed code from compress_unsigned_int_google_buf. You might find implementation in the following file from google buffer protocol: coded_stream.cc method CodedOutputStream::WriteVarint32FallbackToArrayInline)

size_t compress_unsigned_int(unsigned int n, char* data)
{
    size_t size = 0;
    while (n  > 127)
    {
        ++size;
        *data++ = (n & 127)|128;
        n >>= 7;
    }
    *data++ = n;
    return ++size;
}

size_t compress_unsigned_int_improved(unsigned int n, char* data)
{
    size_t size;

    if (n < 0x00000080U) {
        size = 1;
        goto b1;
    }
    if (n < 0x00004000U) {
        size = 2;
        goto b2;
    }
    if (n < 0x00200000U) {
        size = 3;
        goto b3;
    }
    if (n < 0x10000000U) {
        size = 4;
        goto b4;
    }
    size = 5;

    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b4:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b3:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b2:
    *data++ = (n & 0x7f) | 0x80;
    n >>= 7;
b1:
    *data = n;
    return size;
}

size_t compress_unsigned_int_more_improved(unsigned int n, char *data)
{
    if (n < (1U << 14)) {
        if (n < (1U << 7)) {
            data[0] = n;
            return 1;
        } else {
            data[0] = (n >> 7) | 0x80;
            data[1] = n & 0x7f;
            return 2;
        }
    } else if (n < (1U << 28)) {
        if (n < (1U << 21)) {
            data[0] = (n >> 14) | 0x80;
            data[1] = ((n >> 7) & 0x7f) | 0x80;
            data[2] = n & 0x7f;
            return 3;
        } else {
            data[0] = (n >> 21) | 0x80;
            data[1] = ((n >> 14) & 0x7f) | 0x80;
            data[2] = ((n >> 7) & 0x7f) | 0x80;
            data[3] = n & 0x7f;
            return 4;
        }
    } else {
        data[0] = (n >> 28) | 0x80;
        data[1] = ((n >> 21) & 0x7f) | 0x80;
        data[2] = ((n >> 14) & 0x7f) | 0x80;
        data[3] = ((n >> 7) & 0x7f) | 0x80;
        data[4] = n & 0x7f;
        return 5;
    }
}

size_t compress_unsigned_int_simple(unsigned int n, char *data)
{
    char *start = data;
    while (n>=0x80)
    {
        *data++=(n|0x80);
        n>>=7;
    }
    *data++=n;
    return data-start;
}

inline size_t compress_unsigned_int_google_buf(unsigned int value, unsigned char* target) {

          // This implementation might be found in google protocol buffers

}



#include <iostream>
#include <Windows.h>
using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    char data[20];
    unsigned char udata[20];
    size_t size = 0;
    __int64 timer;

    cout << "Plain copy: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        memcpy(data,&i,sizeof(i));
        size += sizeof(i);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Original: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size << endl;

    cout << "Improved: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_improved(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "More Improved: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_more_improved(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Simple: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_simple(i,data);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    cout << "Google Buffers: ";

    timer = GetTickCount64();

    size = 0;

    for (int i=0; i<536870900; i++)
    {
        size += compress_unsigned_int_google_buf(i,udata);
    }

    cout << GetTickCount64() - timer << " Size: " << size <<  endl;

    return 0;
}

On my machine with Visual C++ compiler I've got following results:

Plain copy: 358 ms

Original: 2497 ms

Improved: 2215 ms

More Improved: 2231 ms

Simple: 2059 ms

Google Buffers: 968 ms

If your unsigned int values are limited to a specific range - say, 32 bits - you can unroll the loop:

size_t
compress_unsigned_int(unsigned int n, char* data)
{
  size_t size;

  if (n < 0x00000080U) {
    size = 1;
    goto b1;
  }
  if (n < 0x00004000U) {
    size = 2;
    goto b2;
  }
  if (n < 0x00200000U) {
    size = 3;
    goto b3;
  }
  if (n < 0x10000000U) {
    size = 4;
    goto b4;
  }
  size = 5;

  *data++ = (n & 0x7f) | 0x80;
  n >>= 7;
b4:
  *data++ = (n & 0x7f) | 0x80;
  n >>= 7;
b3:
  *data++ = (n & 0x7f) | 0x80;
  n >>= 7;
b2:
  *data++ = (n & 0x7f) | 0x80;
  n >>= 7;
b1:
  *data = n;
  return size;
}

After more browsing, I found another commonly used implementation in Sqlite3 (code version 3070900):

inline int sqlite3PutVarint(unsigned char *p, unsigned __int64 v){
  int i, j, n;
  unsigned char buf[10];
  if( v & (((unsigned __int64)0xff000000)<<32) ){
    p[8] = (unsigned char)v;
    v >>= 8;
    for(i=7; i>=0; i--){
      p[i] = (unsigned char)((v & 0x7f) | 0x80);
      v >>= 7;
    }
    return 9;
  }    
  n = 0;
  do{
    buf[n++] = (unsigned char)((v & 0x7f) | 0x80);
    v >>= 7;
  }while( v!=0 );
  buf[0] &= 0x7f;
  for(i=0, j=n-1; j>=0; j--, i++){
    p[i] = buf[j];
  }
  return n;
}

There is also slightly optimized version for 32 bit int:

int sqlite3PutVarint32(unsigned char *p, unsigned int v){

  if( (v & ~0x7f)==0 ){
    p[0] = v;
    return 1;
  }

  if( (v & ~0x3fff)==0 ){
    p[0] = (unsigned char)((v>>7) | 0x80);
    p[1] = (unsigned char)(v & 0x7f);
    return 2;
  }
  return sqlite3PutVarint(p, v);
}

It is dissappointing that Sqlite implementation performs the worst of all in my test. So if you are going to use Sqlite consider replacing default implementation with an optimized one.

Meanwhile I am thinking about further possible optimizations.

Here's my optimization in x86 assembly language (32 bit). You can compile with NASM and link. I don't know if it's fast or slow, I just had fun with coding :)

global compress_unsigned_int

;   bit fields:
;   31                              0
;    eeeedddddddcccccccbbbbbbbaaaaaaa


compress_unsigned_int:
    mov     eax, [esp+4]    ; n
    mov     ecx, [esp+8]    ; data

    cmp     eax, 00001111111111111111111111111111b
    jbe     out4b

    shld    edx, eax, 11
    shl     eax, 10
    shld    edx, eax, 8
    shl     eax, 7
    shld    edx, eax, 8
    shl     eax, 7
    shld    edx, eax, 8
    or      edx, 10000000100000001000000010000000b

    mov     [ecx], edx
    mov     eax, [esp+4]
    shr     eax, 28
    mov     [ecx+4], al

    mov     eax, 5
    jmp     exit

out4b:
    cmp     eax, 00000000000111111111111111111111b
    jbe     out3b

    shld    edx, eax, 11
    shl     eax, 10
    shld    edx, eax, 8
    shl     eax, 7
    shld    edx, eax, 8
    shl     eax, 7
    shld    edx, eax, 8
    or      edx, 00000000100000001000000010000000b

    mov     [ecx], edx

    mov     eax, 4
    jmp     exit

out3b:
    cmp     eax, 00000000000000000011111111111111b
    jbe     out2b

    shld    edx, eax, 25
    shl     eax, 24
    shld    edx, eax, 8

    mov     eax, edx

    or      edx, 00000000000000001000000010000000b

    mov     [ecx], dx
    shr     eax, 15
    mov     [ecx+2], al

    mov     eax, 3
    jmp     exit

out2b:
    cmp     eax, 00000000000000000000000001111111b
    jbe     out1b

    shld    edx, eax, 25
    shl     eax, 24
    shld    edx, eax, 8
    or      edx, 00000000000000000000000010000000b

    mov     [ecx], dx

    mov     eax, 2
    jmp     exit

out1b:
    mov     [ecx], al

    mov     eax, 1

exit:
    ret

You might save a few operations by replacing
size_t size=0;...++size;...;return size++; with
char* base=data;...;return data-base;

继续阅读：assembly c sse

Optimizing variable-length encoding

更多精彩内容

精彩评论

最新问答

优酷投屏需升级会员,监管部门回复:不构成侵权,依据是什么?？

常规的不孕不育检查做什么？

永劫无间手游成长奖励有哪些?？

优酷投屏收费,成为VIP后的额外费用我们是否该默默承担?？

为啥腾讯视频看电视剧的时候看一集到一半的时候看不了了,...？

问答排行榜

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

太快了能不能慢一点好爽~好大~不要拔出来了？

二年级家长回音怎么写大全简短的（二年级家长回音怎么写）？