开发者

Python unicode: why in one machine works but in another one it failed sometimes?

I found unicode in python really troublesome, why not Python use utf-8 for all the strings? I am in China so I have to use some Chinese string that can't represent by ascii, I use u'' to denote a string, it works well in my ubuntu machine, but in another ubuntu machine (VPS provided by linode.com), it fails some times. The error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

The code I am using is:

self.talk(u开发者_运维问答ser.record["fullname"] + u"准备好了")


The thing with the famous UnicodeDecodeError is when you do some string manipulation like the one you did just now:

user.record["fullname"] + u" 准备好了"

because what you're doing is concatenating an str with unicode , so python will do an implicit coercion of the str to an unicode before doing the concatenation this coercion is done like this:

unicode(user.record["fullname"]) + u" 准备好了"
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         Problem

And there is the problem because when doing unicode(something) python will decode the string using the default encoding which is ASCII in python 2.* and if it happen that your string user.record["fullname"] have some no-ASCII character it will raise the famous UnicodeDecodeError error.

so how you can solve it :

# Decode the str to unicode using the right encoding
# here i used utf-8 because mostly is the right one but maybe it not (another problem!!!)
a = user.record["fullname"].decode('utf-8')

self.talk(a + u" 准备好了")

PS: Now in python 3 the default encoding is utf-8 and one other thing you can't do a concatenation of a unicode with the string (byte in python 3.) so no more implicit coercion


You need to decode all non-Unicode strings as early as possible. Try to ensure you have no UTF-8 bytestrings stored anywhere in memory, and you have only unicode objects. For example, make sure that the elements of user.record are all converted to unicode on creation, so you don't get any errors like this one. Or just use Python 3 where it's hard to mix them.


Because for Python 2.x the default encoding is ASCII unless its changed manually. Here is a crude hack to include in your script before any other code

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

This will change default Python encoding to UTF-8.


It took me a long time, but I found it.

look at PRINTENV, specially LANG

LANG=en_CA <- server 2 (not working)

LANG=en_US.UTF-8 <- server 1 (working) "On Linode coincidentally)

Set new Locals

sudo update-locale LANG=en_US.UTF-8 LANGUAGE

Log out, back in, bob's your uncle :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜