convert huffman code string to binary
I am having problem with how to convert huffman encoding string to binary python.
This question involves nothing of the huffman algorithm.
It is like this:
I can get an encoded huffman string, say 01010101010
. Note, it is a string.
But now I want to save the string representation into real binary.
In the huffman encoded string, every 0 and 1 is a byte.
What I want is every 0 and 1 is a bit.
How can I do that in python?
Edit 1:
Please forgive I did not describe my开发者_开发百科 problem clear enough.
Let me explain my current approach of writing to zeros and ones to binary.
Say, we can a code string s='010101010'.
- I use
int
to convert it to integer - Then use
unichr
to convert it to string so that I can write it to file - write the string to file in binary mode
Also to be noted, I need to read the file in order to decode the huffman code.
So my approach is,
- read the bytes from file
- restore them to int
- convert the int to their binary representation string.
- decode the string
And at step 2, the problem happens and I became clueless.
As some huffman string can be short(like, 10
), while some can be long(010101010101001
). This results in their different byte length in their int value(
some short string may take just one byte,while long ones can take two or even more
)
The following code illustrates my problem:
ss=['010101','10010101010']
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes
print 'write it to file'
with open('binary.bin','wb') as f:
for s in ss:
n=int(s,2)
print n
s=unichr(n)
f.write(s)
print 'read it to file'
with open('binary.bin','rb') as f:
for s in f.read():
print ord(s)
I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010
takes up two bytes.
So, when I read those bytes from the file, How many bytes should I read at once?
There are two different "binary" representations in Python that you might want to use.
Bignum
One is a "bignum" or arbitrary-precision integer. This type is called long
in Python 2.x and int
in Python 3.x. As the name suggests, this representation is semantically an integer of arbitrary length, so it's useful if you plan to do arithmetic on the resulting number. To parse a string of binary digits, use
# Python 2
long(digit_str, 2)
or
# Python 3
int(digit_str, 2)
bitstring
library
Alternatively, as Marc B suggested in the comments, use the bitstring
library. Specifically, for conversion, use the bitstring.pack
function.
For Huffman coding, using bitstring
is probably preferable to storing data in a byte
-string, since Huffman codes are generally not multiples of 8 bits; bitstring
lets you manipulate bit-strings of arbitrary lengths. Disadvantage: bitstring
is not part of the standard library.
One possible approach(using bitstring library) which makes some sense, but still contain incorrectness:
Use bitstring library(Thanks to Mechanical snail and Marc B )
For writing to file.
Steps:
- encode the plain text to binary representation string
- concatenate all those strings to form one longer one
- use bitstring.BitArray to convert to hex format
- write the hex string to a file
For reading:
- read the hex string from file
- convert it back to bit string using BitArray
- start decoding
Code:
ss=['01010100','10010101010','010101110101010101'] #encoded message
from bitstring import BitArray,BitStream
print 'write it to file'
with open('binary.bin','wb') as f:
s=''.join(ss);
b=BitArray(bin=s)
f.write(b.tobytes())# thanks to Scott, tobytes() method is very useful
print 'read it to file'
b=BitArray(filename='binary.bin')
print b.bin
You have a string that you need to convert into a number. int
takes an optional 'base' as an argument. So for the string in your example,
>>> int('01010101010', 2)
682
Once you have a number (not a string), it doesn't make sense to want "real" binary, since the number is the same, you can display it in any base. That means the binary 100
is the same number as the decimal 4
, inside your program they're not different numbers. So, once you turn your string into a number you can fiddle with the bits in it.
精彩评论