Any theoretical limit to compression?
Imagi开发者_如何学运维ne that you had all the supercomputers in the world at your disposal for the next 10 years. Your task was to compress 10 full-length movies losslessly as much as possible. Another criteria was that a normal computer should be able to decompress it on the fly and should not need to spend much of his HD to install the decompressing software.
My question is, how much more compression could you achieve than the best alternatives today? 1%, 5%, 50%? More specifically: is there a theoretical limit to compression, given a fixed dictionary size (if it is called that for video compression as well)?
The limits of compression are dictated by the randomness of the source. Welcome to the study of information theory! See data compression.
There is a theoretical limit: I suggest reading this article on Information theory and the pigeon hole principle. It seems to sum up the issue in a very easy to understand way.
If you have a fixed catalogue of all the movies you were ever going to compress, you could just send an id for the movie and have the "decompression" lookup up the data with that index. So compression could be to a fixed size of log2(N) bits, where N was the number of movies.
I suspect the practical lower bound is rather higher than this.
Do you really mean lossless? Most of today's video compression is lossy, I thought.
It is important to redefine the limits with the latest developments regarding information theory. Therefore, it is essential to report the hypotheses for which the limit is valid.
In information theory, 3 fundamental hypotheses are used which are the following:
the information is defined by the entropy function H(X).
the information that identifies the source is known both by the encoder and by the decoder.
the source and its isomorphisms are considered. It means that we can decode a symbol at a time.
First limit, the most famous, defined by Shannon in which all 3 hypotheses are true.
NH(X) With H(X) entropy of the source X.
Second limit. we remove the second hypothesis the decoder does not know the source.
NH(X)+source information
Third limit, let's remove the third hypothesis. In this case, the Set Shaping Theory SST is used, a new method that is revolutionizing information theory. This theory studies the one-to-one functions f that transform a set of strings into a set of equal size made up of strings of greater length. With this method, we get the following limit:
N2H(Y)+ source information≈NH(X) with f(X)=Y and N2>N.
In practice, we obtain a gain in terms of compression equivalent to the information necessary to describe the source is obtained. The information needed to describe the source represents the inefficiency of the entropy coding.
In this case, however, it is not possible to decode a symbol at a time (the code is not instantaneous) but the message must be decoded in full before obtaining the original message.
Important progress has been made in this area. It was possible to apply this theory to a concrete case of data compression "Practical applications of Set Shaping Theory in Huffman coding".
Another interesting aspect is that the authors shared the code and the function that performs the transform described in the set shaping theory. The file is shared on Matlab file exchange: https://www.mathworks.com/matlabcentral/fileexchange/115590-test-sst-huffman-coding?s_tid=FX_rc1_behav
精彩评论