开发者

Why can data be compressed only once?

So the compression process takes a chunk of binary data A and outputs a smaller chunk of binary data B. What characteristics of B make it unable to g开发者_开发问答o through this process again?


Data has something called entropy: the amount of new information each new bit gives. For example, 10101010101010101010 has low entropy because you don't need the next bit to know what comes next. A perfect compression algorithm would compress to maximum entropy, so every bit gives information and so cannot be removed, making the size a minimum.


It is not true that data that is already compressed cannot be compressed again. If you take a file consisting of 1 million zeros and compress it using gzip, the resulting compressed file is 1010 bytes. If you compress the compressed file again it is further reduced to just 75 bytes.

$ python
>>> f = open('0.txt', 'w')
>>> f.write('0'*1000000)
>>> f.close()
>>>
$ wc -c 0.txt
1000000 0.txt

$ gzip 0.txt
$ wc -c 0.txt.gz
1010 0.txt.gz

$ mv 0.txt.gz 0.txt
$ gzip 0.txt
$ wc -c 0.txt.gz
75 0.txt.gz

The reason why it is unlikely that compression works twice is because the compression process removes redundancy. When you have less redundancy it is harder to compress the file further.


For a very academic answer to this question, have a look at Information Etropy! If you're like me, though, the article will make your head hurt.

A simpler answer: Assume you could compress again and again, say by a factor of 10 each time. You could compress Wikipedia down to a gigabyte, then 100M, then 10M... do this 9 times and you'd be down to one byte. If all the information in Wikipedia could be compressed down to one byte, people would not have needed to write it, they could just have expanded one of the 256 possible bytes, one of them would have been the contents of Wikipedia :)

A slightly more sensible answer: Text is redundant: There is information in those bytes that could be expressed a little more tightly. The Wikipedia article mentions the fact that 'q' is almost always followed by 'u', for example. 'E' occurs more often than 'T'. And so forth. Similarly, in a program, often 0 is found more often than any other number. This consistency can be exploited and 'squeezed out.' But once you've done that once, the original redundance is mostly gone. The compressed file has hardly any more 'wasted bits.'


First, this only applies to lossless compression. Lossy compression (like jpg), can theoretically be applied over and over. Of course the quality of the compressed material drops each time.

For lossless compression, we can think of compression as a routine that takes some data and transforms it to another form (A->B). Since it is lossless, we must be able to then take B and go A<-B. If we follow this through, then it means that if we take every sequence of 4 bits (16 patterns) and compress them, we must get 16 different results. That means that on average, no compression was done!

Compression takes advantage of the fact that for certain kinds of data some sequences of data are less common. These less common forms will become larger when compressed. The more common forms that we have chosen our scheme for will then become smaller. On average, the messages are either the same size or larger.

Taking it one step further, if we repeatedly recompress the same message it will on average not change size (again, this is the best case).


Compression works by recognizing patterns and saying "this pattern is here, here and here, so I'll store it once and remember to put it there there and there when I decompress".

Most patterns would get caught in the first compression. You can achieve further compression after it's compress, but... there aren't many patterns left.


It's not that it can be compressed only once, it's that there is a minimum size you can compress any data before you start losing bits of it (the way you do with a low quality jpg or MP3 file). Most compression algorithms these days are good enough that one pass gets you within a couple of % of that so a second time isn't really worthwhile rather that in not being possible.

To understand the minimum size without reading too much theory, think of a question with two possible answers Yes and No. The smallest you can make this result is a single bit where 0 = No and 1 = Yes (or vice versa). Even that has made a bunch of assumptions (that the person receiving the data understands this encoding already for instance).

On a more complex level the same is true for all other data. In a situation where you have eight possible answers, all equally probable (this is important), the minimum size is three bits - the smallest number of bits to allow you eight options (000, 001, 010, 011, 100, 101, 110, 111).

There are some clever things you can do to reduce it a little under certain circumstances (for instance you use a smaller number of bits for very common answers at the expense of needing more than might be needed for less common ones but at a lower overall average) but ultimately there is a minimum amount of storage needed to contain the information.


Take a piece of paper and fold it - you've compressed it by 50%. Now do it again - and keep trying. Notice how it becomes harder and harder, and at some point you have to stop?

Data compression suffers from the same limits. Sure, you can compress it again, and you might save a little more space, but it's a clear example of diminishing returns - every further compression attempt requires more effort for marginal improvements.


For any number N, there are 2^(N+1)-1 different possible input files of length N bits or shorter. If every different input file will yield a different output file, then for every possible input file of length k which can be reduced to some shorter length, there must be at least one shorter file that gets longer.


You can compress data as much as you like, but the effect might not be what you want. After the first level of compression, if you run the same algorithm on it, it probably won't compress enough to make it worthwhile.

Think about this, here is your data:

1001 0011 1110 0100 0011 1001

I'll use a made-up compressor to tokenize by nybble (4 bits) the data as such:

if 1001, compress as 101 since no nybble begins with 101 and 1001 occurs twice if 0011, compress as 110 since no nybble begins with 110 and 0011 occurs twice

After compression:

101 110 1110 0100 110 101 or 1011 1011 1001 0011 0101

This wouldn't actually work in the real world, but as you can imagine, you could compress this again as it still is binary data.

The next compression does this:

if 1011, compress as 111

After compression: 111 111 1001 0011 0101 or 1111 1110 0100 1101 01

But as you can see, there aren't anymore duplicate nybbles, so the compressor I used would have nothing left to compress.

Again, this isn't a real compressor, just an easy way to understand the concept.


The problem of lossless compression is basically, how can this information be expressed more succinctly? For example, you could observe that in the preceding text, the character "e" is most often followed by a spacEand substitutEan uncommon character for that pattern. Similarly, a space followed by a "t" could bEreplaced by a different, uncommon sequence, andThEsequencEofTwo "s":es could similarly bEshortened. When UrunOutOf commN sequencesTo substitute, Ucannot proceed any furHer (or Umay haveTo switchToAdifferent pattern substitution strategy).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜