What is the relationship between 'unicode' and 'encode'
print u'\xe4\xf6\xfc'.encode('utf-8')
print unicode(u'\xe4\xf6\xfc')
tra开发者_运维百科ceback:
盲枚眉
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 6, in <module>
print unicode(u'\xe4\xf6\xfc')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
python shell
>>>u"äöü".encode('utf-8')
Unsupported characters in input
In Python 2:
case a: (unicode object).encode(somecodec) -> string of bytes case b: (string of bytes).decode(somecodec) -> unicode object case c: unicode(string of bytes, somecodec) -> unicode object
Cases b and c are identical. In each of the three cases, you can omit the codec name: then it defaults to 'ascii'
, the ASCII decoder (supporting only the 128 ASCII characters -- you'll get an exception otherwise).
Whenever a 'string of bytes' is required on the left of the arrow, you can pass a unicode object (it's converted with the 'ascii' codec).
Whenever a 'unicode' is required on the left of the arrow, you can pass a string of bytes (it's converted with the 'ascii' codec).
The encoding error:
print unicode(u'\xe4\xf6\xfc')
The unicode()
call does nothing here, since it's parameter is already a unicode object. print
then tries to output that unicode object, and to do so print
wants to convert it to a string in the encoding of your terminal. But python doesn't seems to know which encoding your terminal uses and therefore goes with the "safe" alternative of Ascii.
Since u'\xe4\xf6\xfc'
cannot be represented in Ascii this leads to an encoding error.
Unicode, encode and decode:
Generally encode()
converts a unicode object to a string with a certain character encoding like UTF-8 or ISO-8859-1. Every unicode code point is converted to a sequence of bytes in that encoding:
>>> u'\xe4\xf6\xfc'.encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
The opposite is decode()
, it converts a string in a certain encoding to a unicode object containing the corresponding unicode codepoints.
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf-8')
u'\xe4\xf6\xfc'
Printing:
print
with a string parameter just prints the raw bytes of that string. If that results in the desired output depends on the character encoding of the terminal.
>>> print '\xc3\xa4\xc3\xb6\xc3\xbc' # utf-8 encoding on utf-8 terminal
äöü
>>> print '\xe4\xf6\xfc' # same encoded as latin-1
���
When given a unicode parameter, print
first tries to encode the unicode object in the terminals encoding. This only works if python guesses the right encoding for the terminal and that encoding can actually represent all the characters of the unicode object. Otherwise the encoding throws exceptions or the output contains wrong characters.
>>> print u'\xe4\xf6\xfc' # it correctly assumes a utf-8 terminal
äöü
This is covered in the tutorial and the unicode howto
The unicode
function converts non-unicode (by default, ascii, but it accepts other encodings too) strings into unicode. Your error here is that you're passing a string that is already unicode and asking it to be converted to unicode...
The encode
function on a unicode string converts it back to a non-unicode encoding - again, ascii is the default.
精彩评论