Compression performance on certain types of data
I am testing my new image file format, which without going into unnecessary detail consists of the PPM RGB 24-bit per pixel format sent through zlib's compression stream, and an 8 byte header appended to the front.
While I was writing up tests to evaluate the performance of the corresponding code which implements this I had one test case which produced pretty terrible results.
unsigned char *image = new unsigned char[3000*3000*3];
for(int i=0;i<3000*3000;++i) {
image[i*3] = i%255;
image[i*3+1] = (i/2)%255;
image[i*3+2] = (i*i*i)%255;
}
Now what I'm doing here is creating a 3000x3000 fully packed 3 byte per pixel image, which has red and green stripes increasing steadily, but the blue component is going to be varying quite a bit.
When I compressed this using the zlib stream for my .ppmz
format, it was able to reduce the size from 27,000,049 bytes (the reason it is not an even 27 million is 49 bytes are in the headers) to 25,545,520 bytes. This compressed file is 94.6% the original size.
This got me rather flustered at first because I figured that even if the blue component was so chaotic it couldn't be helped much, at least the red and green components repeated themselves quite a bit. A smart enough compressor ought to be able to shrink to about 1/3 the size...
To test that, I took the original 27MB uncompressed file and RAR'd it, and it came out to 8,535,878 bytes. This is quite good, at 31.6%, even better than one-third!
Then I realized I made a mistake defining my test image. I was using mod 255 when I should be clamping to 255, which is mod 256:
unsigned char *image = new unsigned char[3000*3000*3];
for(int i=0;i<3000*3000;++i) {
image[i*3] = i%256;
image[i*3+1] = (i/2)%256;
image[i*3+2] = (i*i*i)%256;
}
The thing is, there is now just one more value that my pixels can take, which I was skipping previously. But when I ran my code again, the ppmz
became a measly 145797 byte file. WinRAR squeezed it 开发者_StackOverflow社区into 62K.
Why would this tiny change account for this massive difference? Even mighty WinRAR couldn't get the original file under 8MB. What is it about repeating values every 256 steps that doing so every 255 steps completely changes? I get that with the %255
it makes the first two color components' patterns slightly out of phase, but behavior is hardly random. And then there's just crazy modular arithmetic being dumped into the last channel. But I don't see how it could account for such a huge gap in performance.
I wonder if this is more of a math question than a programming question, but I really don't see how the original data could contain any more entropy than my newly modified data. I think the power of 2 dependence indicates something related to the algorithms.
Update: I've done another test: I switched the third line back to (i*i*i)%255
but left the others at %256
. ppmz
compression ratio rose a tiny bit to 94.65% and RAR yielded a 30.9% ratio. So it appears as though they can handle the linearly increasing sequences just fine, even when they are out of sync, but there is something quite strange going on where arithmetic mod 2^8 is a hell of a lot more friendly to our compression algorithms than other values.
Well, first of all, computers like powers of two. :)
Most of such compression algorithms use compression blocks that are typically aligned to large powers of two. When your cycle aligns perfectly with these blocks, there is only one "unique sequence" to compress. If your data is not aligned, your sequence will shift a little across each block and the algorithm may not be able to recognize it as one "sequence".
EDIT: (updated from comments)
The second reason is that there's an integer overflow on i*i*i
. The result is a double modulus: one over 2^32
and then one over 255
. This double modulus greatly increases the length of the cycle making it close to random and difficult for the compression algorithm to find the "pattern".
Mystical has a big part of the answer, but it also pays to look at the mathematical properties of the data itself, especially the blue channel.
(i * i * i) % 255
repeats with a period of 255, taking on 255 distinct values all equally often. A naïve coder (ignoring the pattern between different pixels, or between the R and B pixels) would need 7.99 bits/pixel to code the blue channel.
(i * i * i) % 256
is 0 whenever i
is a multiple of 8 (8 cubed is 512, which is of course 0 mod 256);
It's 64 whenever i
is 4 more than a multiple of 8;
It's 192 whenever i
is 4 less than a multiple of 8 (together these cover all multiples of 4);
It's one of 16 different values whenever i
is an even non-multiple of 4, depending on i
's residue mod 64.
It takes on one of 128 distinct values whenever i
is odd.
This makes for only 147 different possibilities for the blue pixel, with some occuring much more often than others, and a naïve entropy of 6.375 bits/pixel for the blue channel.
精彩评论