Condition for single bit code for a character in Huffman code?
This is a question I ran into in school settings, but it keeps bugging me so I decided to ask it here.
In Huffman compression, fixed-length sequences (characters) are encoded with variable-length sequences. The code sequence length depends on the frequencies (or probabilities) of the source characters.
My questions is: what is the minimum highest开发者_JAVA百科 character frequency, with which that character will be encoded by a single bit?
It turns out that the answer is 0.4, that is, if the highest frequency p is p >= 0.4, 1-bit code for the corresponding character is guaranteed. In other words, this is a sufficient condition.
It is also true that p >= 1/3 is a necessary condition. That is, there may be examples where 0.4 > p >= 1/3, and the shortest code is 1-bit, but there are no cases like that if p < 1/3.
The way to reason about that is to look at the way the code tree is constructed, in particular at the frequencies of the 3 last surviving subtrees. A proof appears in Johnsen, "On the redundancy of binary Huffman codes", 1980 (unfortunately this is a paid link).
In general, about 50% of the incoming symbol stream would have to consist of a given symbol for Huffman to code it as a single bit. The reason for this is that due to how Huffman coding works (the one symbol's encoding cannot be the prefix of another), by encoding a symbol with a single bit, you are requiring that the first bit for every other symbol be the opposite value (i.e. if one symbol is encoded as 0
, everything else must start with 1
plus at least one more bit). Since you're eliminating half of the possible encoding space for any given bit length, you need to be gaining a way to encode at least half of the symbols being input in order to break even.
Note that there is a special case where the symbol space only consists of 3 symbols. In such a case, whichever symbol has the greatest frequency will be encoded with 1 bit (since the other two will be the 2nd-bit variations of whichever first-bit value isn't chosen) - if 2 or more have equally greater probability, either one could be encoded. Thus, in the 3-symbol case it's possible that a symbol with, say, 34% probability could theoretically be coded as a single bit (say, 0
) while the other two might have 33% or lower probabilities and be coded as 10
and 11
.
So if you're considering all possibilities, then technically anything 1/3 or above could potentially be encoded as a single bit (in the 3-symbols case).
精彩评论