开发者

Python zlib output, how to recover out of mysql utf-8 table?

In python, I compressed a string using zlib, and then inserted it into a mysql column that is of type blob, using the utf-8 encoding. The string comes back as utf-8, but it's not clear how to get it back into a format where I can decompress it. Here is some pseduo-output:

valueInserted = zlib.compress('a') = 'x\x9cK\x04\x00\x00b\x00b'

valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'

zlib.decompress(valueFromSqlColumn) UnicodeEncodeError: 'ascii' codec can't en开发者_StackOverflowcode character u'\x9c' in position 1: ordinal not in range(128)

if i do this, it inserts some extra characters:

valueFromSqlColumn.encode('utf-8') = 'x\xc2\x9cK\x04\x00\x00b\x00b'

Any suggestions?


Unicode is designed to be compatible with latin-1, so try:

>>> import zlib
>>> u = zlib.compress("test").decode('latin1')
>>> u
u'x\x9c+I-.\x01\x00\x04]\x01\xc1'

And then

>>> zlib.decompress(u.encode('latin1'))
'test'

EDIT: Fixed typo, latin-1 isn't designed to be compatible with unicode, it's the other way around.


You have a unicode object that is really encoding bytes. That's unfortunate, since unicode strings should really only be coding text, right?

Anyway, what we want to do is to construct a byte string.. this is a str in Python 2.x. We see by the printed string you gave u'x\x9cK\x04\x00\x00b\x00b' that the byte values are encoded as unicode codepoints. We can get the numerical value of a codepoint by using the function ord(..). Then we can get the byte string representation of that number with the function chr(..). Let's try this:

>>> ord(u"A")
65
>>> chr(_)
'A'

So we can decode the string ourselves:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> bdata = "".join(chr(ord(uc)) for uc in udata)
>>> bdata
'x\x9cK\x04\x00\x00b\x00b'

(Wait, what does the above code do? The join stuff? What we first do is create a list of the code points in the string:

>>> [ord(uc) for uc in udata]
[120, 156, 75, 4, 0, 0, 98, 0, 98]

Then we intepret the numbers as bytes, converting them individually:

>>> [chr(ord(uc)) for uc in udata]
['x', '\x9c', 'K', '\x04', '\x00', '\x00', 'b', '\x00', 'b']

Finally, we join them with "" as separator using "".join(list-of-strings)

End of Wait..)

However, cls cleverly notes that the Latin-1 encoding has the property that a character's byte value in the Latin-1 encoding is equal to the character's codepoint in Unicode. Given, of course, that the character is inside the range 0 to 255 where Latin-1 is defined. This means we can do the byte conversion directly with Latin-1:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> udata.encode("latin-1")
'x\x9cK\x04\x00\x00b\x00b'

Which as you can see, gives the same result.


valueInserted = zlib.compress('a') = 'x\x9cK\x04\x00\x00b\x00b'

Note that this is an str object. You say that you "inserted it into a mysql column that is of type blob, using the utf-8 encoding". AS the compressed string is binary, not text, "blob" is an appropriate type of column, but ANY encoding or other transformation is a very bad idea. You need to be able to recover from the database EXACTLY right down to the last bit what you inserted, otherwise the decompression will fail, either by raising an error or (less likely, but worse) silently producing garbage.

You say that you get back after whatever process you go through in inserting it and extracting it again is:

valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'

Note carefully that there is only one tiny visual difference: u'something' instead of 'something'. That makes it a unicode object. Based on your own evidence so far, "comes back as UTF-8" is not correct. A unicode object and a str object encoded in utf8 are not the same thing.

Guess 1: insert as raw string, extract with latin1 decode.

Guess 2: insert as compressed.decode('latin1').encode('utf8'), extract with utf8 decode.

You really need to understand the process of inserting and extracting, including what encodes and decodes happen by default.

Then you really need to fix your code. However in the meantime you can probably kludge up what you've got.

Note this:

>>> valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'
>>> all(ord(char) <= 255 for char in valueFromSqlColumn)
True

Do some trials with more complicated input than 'a'. If, as I guess, you see that all of the unicode characters have an ordinal in range(256), then you have a simple kludge:

>>> compressed = valueFromSqlColumn.encode('latin1')
>>> compressed
'x\x9cK\x04\x00\x00b\x00b'
>>> zlib.decompress(compressed)
'a'

Why this works is that Latin1 encoding/decoding doesn't change the ordinal. You could recover the original compressed value by:

>>> compressed2 = ''.join(chr(ord(uc)) for uc in valueFromSqlColumn)
>>> compressed2
'x\x9cK\x04\x00\x00b\x00b'
>>> compressed2 == compressed
True

if you think using .encode('latin1') is too much like voodoo.

If the above doesn't work (i.e. some ordinals are not in range(256)), then you will need to produce a small runnable script that shows exactly and reproducibly how you are compressing, inserting into the database, and retrieving from the database ... sprinkle lots of print "variable", repr(variable) around your code so that you can see what is happening.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜