开发者

convert huffman code string to binary

I am having problem with how to convert huffman encoding string to binary python.

This question involves nothing of the huffman algorithm.

It is like this:

I can get an encoded huffman string, say 01010101010. Note, it is a string.

But now I want to save the string representation into real binary.

In the huffman encoded string, every 0 and 1 is a byte.

What I want is every 0 and 1 is a bit.

How can I do that in python?

Edit 1:

Please forgive I did not describe my开发者_开发百科 problem clear enough.

Let me explain my current approach of writing to zeros and ones to binary.

Say, we can a code string s='010101010'.

  1. I use int to convert it to integer
  2. Then use unichr to convert it to string so that I can write it to file
  3. write the string to file in binary mode

Also to be noted, I need to read the file in order to decode the huffman code.

So my approach is,

  1. read the bytes from file
  2. restore them to int
  3. convert the int to their binary representation string.
  4. decode the string

And at step 2, the problem happens and I became clueless.

As some huffman string can be short(like, 10), while some can be long(010101010101001). This results in their different byte length in their int value( some short string may take just one byte,while long ones can take two or even more )

The following code illustrates my problem:

ss=['010101','10010101010'] 
# first one is short and takes only one byte in its int value
# second one is long and takes two bytes   

print 'write it to file'
with open('binary.bin','wb') as f:
    for s in ss:
        n=int(s,2)
        print n
        s=unichr(n)
        f.write(s)

print 'read it to file'
with open('binary.bin','rb') as f:
    for s in f.read(): 
        print ord(s)

I am reading one byte a time in the second with part, but this is actually not correct. Because string 10010101010 takes up two bytes.

So, when I read those bytes from the file, How many bytes should I read at once?


There are two different "binary" representations in Python that you might want to use.

Bignum

One is a "bignum" or arbitrary-precision integer. This type is called long in Python 2.x and int in Python 3.x. As the name suggests, this representation is semantically an integer of arbitrary length, so it's useful if you plan to do arithmetic on the resulting number. To parse a string of binary digits, use

# Python 2
long(digit_str, 2)

or

# Python 3
int(digit_str, 2)

bitstring library

Alternatively, as Marc B suggested in the comments, use the bitstring library. Specifically, for conversion, use the bitstring.pack function.

For Huffman coding, using bitstring is probably preferable to storing data in a byte-string, since Huffman codes are generally not multiples of 8 bits; bitstring lets you manipulate bit-strings of arbitrary lengths. Disadvantage: bitstring is not part of the standard library.


One possible approach(using bitstring library) which makes some sense, but still contain incorrectness:

Use bitstring library(Thanks to Mechanical snail and Marc B )

For writing to file.

Steps:

  1. encode the plain text to binary representation string
  2. concatenate all those strings to form one longer one
  3. use bitstring.BitArray to convert to hex format
  4. write the hex string to a file

For reading:

  1. read the hex string from file
  2. convert it back to bit string using BitArray
  3. start decoding

Code:

ss=['01010100','10010101010','010101110101010101'] #encoded message


from bitstring import BitArray,BitStream
print 'write it to file'
with open('binary.bin','wb') as f:
    s=''.join(ss);
    b=BitArray(bin=s)                 
    f.write(b.tobytes())# thanks to Scott, tobytes() method is very useful

print 'read it to file'
b=BitArray(filename='binary.bin')
print b.bin


You have a string that you need to convert into a number. int takes an optional 'base' as an argument. So for the string in your example,

>>> int('01010101010', 2)
682

Once you have a number (not a string), it doesn't make sense to want "real" binary, since the number is the same, you can display it in any base. That means the binary 100 is the same number as the decimal 4, inside your program they're not different numbers. So, once you turn your string into a number you can fiddle with the bits in it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜