开发者

Python UTF8 string confusion

Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:

x = '\xd0\开发者_运维问答xa4'
y = '\x92'

At the Python shell I get the following:

print x
Ф
print y
?

Which is exactly what I want to see. However then there is the following:

print unicode(x, 'utf8')
Ф

But not this:

unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte

My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.

UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.


Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>> 

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

Also please read this and this.


\x92 is not a valid utf-8 encoded character.

You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings. When you then print them, they are simple dumped to the terminal "as is" and the terminal itself interprets the bytes according to its encoding setting.

There is a third parameter to unicode() that tells python what to do in case of encoding (decoding) errors:

>>> unicode('\x92', 'utf8', 'replace')
u'\ufffd'
>>> print _
�


I thought any unicode character other than the ASCII subset had a multi-byte representation in UTF-8. Your y makes sense as a single-byte-per-char string, but not as a UTF-8 string. Because the single byte is outside the 0x00 to 0x7F ASCII range, the codec will expect an extra byte or more for the conversion to a "real" unicode character.

I'm not as familiar with Python as I once was, though, and I'm not confident about this answer.

EDIT hops is the better answer IMO.


I see now where you're confused. Let's look at this:

x = '\xd0\xa4'
y = '\x92'

If I print x, I get Ф. This is because my terminal is using UTF-8 as its character encoding. Thus, when it gets D0 A4, it attempts to decode it as UTF-8, and gets a "Ф". If I change my terminal to use, say, ISO-8859-1 ("latin1"), and I say print x, my terminal will attempt to decode D0 A4 using ISO-8859-1, and since D0 A4 is also a valid ISO-8859-1 string, it will decode, but this time, to "Ф".

Now, for print y. This isn't a UTF-8 string, so my terminal can't decode this. It shows me this error, in my case, by printing "�". I'm wondering if you see "�" or "?" - you should probably see the former, but it depends on what your terminal does in the face of bad output.

Your terminal's encoding should match whatever $LANG says, and your program should output data in whatever encoding $LANG specifies. Nowadays, $LANG is typically ???.UTF-8, where the ??? varies. (Mine is en_US.UTF-8)

Now, when you say unicode(y, 'utf8'), Python attempts to decode this as UTF-8, and appropriately throws an exception.

I'm using Gnome Terminal, and can change my character encoding by going to Terminal → Set Character Encoding


0x92 (hex) = 10 010010 (binary)

As UTF-8 can represent 010010 in one byte, the "header" must be 0 (--> 00010010) instead of 10 (which can never be the header of the first byte). Characters may not be represented with more bytes than needed, so "\x92" is not a valid UTF-8 encoded string.

I guess your database uses some one-byte-per-character encoding (such as latin-1). If you're coding the database queries yourself, you must ensure that the connection encoding is correct or that strings are decoded correctly. With Django models, everything should work automatically.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜