Compression of numeric strings
Can anyone suggest comp开发者_高级运维ression algorithms to operate on numeric strings of 20-30 digits ?
You can easily compress 30 character string down to 15 bytes by just using binary representations of each digit. For example, 1592 can be represented as a series of four-bit values as such:
0001 0101 1001 0010
This, when grouped in groups of two four-bit values, can be represented as §Т
in standard ASCII.
Further, if your strings contain many identical consecutive digits, you can implement a variation of Run-Length Encoding.
Assuming you can have floating point numbers, you have a possibility of 11 symbols:
[0,1,2,3,4,5,6,7,8,9, .]
This means that you need 4 bits per symbol. 3 bits can only represent a maximum of 8 symbols. You can easily use 4 bits per each symbol and get a lot of compression.
If you only have integer digits in your string, an easy solution is to convert to hexidecimal and you can use 4 bits per symbol still while getting a better compression ratio. (since there are no wasted bits with 16 symbols)
If you use Huffman compression you will get an optimal bits/per symbol ratio. You can read more about Huffman compression here.
Make it 2 15 digit numbers and convert them to 2 64 bit integers? Or are they floats?
Break it up into a couple of unsigned ints?
"9347692367596047327509604839"
becomes:
9 347692367 596047327 509604839
One obvious solution is to "compress" them as a binary numeric representation rather than a string representation. See this stack overflow question for example libraries.
I would definitely go for the easiest solution, and just store them as integers (of a suitable size, be it 32-bit, 64-bit or 128 bit, depending on needs). Compressing it with an algorithm supporting characters would waste a lot of space, since it would have to cater for a lot more than 10 different values (0-9) per character .
one of the most common ways to compress numbers (assuming you have more than one you want to compress -- its kind of hard to compress one thing), is using delta encoding. It works on the principle that if you know the first number is x, and the numbers after it are relatively similar, you can encode the subsequent numbers as (x+c1), (x+c2), etc.
In this scheme, you only have to encode the full x value once, and if your c values are smaller than your x's, then you can save a lot of space. You can also use a version of this that sorts the numbers first, and then your delta refers to the number last seen instead of one number. With this method you can cover a wider range of numbers more efficiently.
精彩评论