Binary run length encoding
I have a web开发者_如何转开发 form, for the contents of which I would like to generate a short representation in Base64. The form, among other things, contains a list of 264 binary values, the greater part of which are going to be 0 at any single time. (They represent regions on a geographical map). Even in Base64, this 264-bit number generates a long, intimidating string. I want to implement run-length encoding, as efficiently as possible. Can you help me with this? I've googled binary RLE, but have found nothing of use.
What I've tried this far - running RLE on the binary string using decimal counts and "A" as a separator denoting a change between 0 and 1, then converting the result from base 11 to base 64. For example:
00000000001111111000000010000000000000000000000001111111110001111010101000000000000000000000000000000000000111111111110111000000000000111111100000001000000000000000000000000111111111000111101010100000000000000000000000000000000000011111111111011100
becomes
10A5A5AA22A7A1A2AAAAAAA34A9AA1A10A5A5AA22A7A1A2AAAAAAA34A9AA1A
which in turn becomes
CNnbr/FxkgbbOw0LNAKgk65P8SdvaTG+t74o
or, in base 62,
6imo7zq1pqr2mqglTHzXwJRAksm7fvHZHWQK
It's better, but I still can't help but doubt if I'm doing something wrong - is using the digit "A" as a separator is the best way to do this?
And another update:
Thanks to @comingstorm, I have shortened the compressed string some more.
ILHHASCAASBYwwccDASYgAEgWDI=
As I mentioned it in the comments, real usage cases would generally result in an even shorter string.
Since you're coding bits, you probably want to use a bit-based RLE instead of a byte-based one. In this context, you should consider Elias gamma coding (or some variant thereof) to efficiently encode your run lengths.
A reasonable first approximation for your encoding format might be:
- first bit = same as the first bit of the uncompressed string (to set initial polarity)
- remaining bits: Elias coded lengths of successive bit runs (alternating 1 and 0)
Since you know how many bits are in your uncompressed string, you don't need a termination code; you can just add any necessary binary padding as arbitrary bits.
Note that it is always possible for the run-length "compression" to expand your bit string; if you're concerned about this, you can add another initial bit to indicate whether your data is in compressed or uncompressed format, limiting your compression overhead to 1 bit.
264 bit, that are just 33 byte, and that are in base64 just 44 byte. I think this (very small) amount of information is hardly compressable. The sparse representation nulvinge refers too just stores the non zero elements and their values (as you have just 0/1), i.e. in your case just the index of the non zero bits. But as you have 264 possible bits - you need 9 bits for the index, which means, in case you have more than 29 non zero entries, you need already more than original.
Maybe your question is formulated wrong, but I dont see how 264 bits can lead to an intimidating base64 string (How do you generate it - maybe you translate not the 264 bits, but 264 ASCIIs chars (with the value 0
and 1
) - that would explain your long result string?).
An alternative that I think is more what you want is a sparse matrix: http://en.wikipedia.org/wiki/Sparse_matrix
精彩评论