Why does Python sometimes upgrade a string to unicode and sometimes not?
I'm confused. Consider this code working the way I expect:
>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
And this code not at all working the way I expect:
>>> try:
... raise Exception(foo)
... except Exception as e:
... foo2 = e
...
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
File开发者_运维百科 "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Can someone explain what's going on here? Why does it matter whether the unicode data is in a plain unicode string or stored in an Exception object? And why does this fix it:
>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'
I am quite confused! Thanks for the help!
UPDATE: My coding buddy Randall has added to my confusion in an attempt to help me! Send in the reinforcements to explain how this is supposed to make sense:
>>> class A:
... def __str__(self): return "string"
... def __unicode__(self): return "unicode"
...
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'
Note that the order of the arguments here determines which method is called!
The Python Language Reference has the answer:
If
format
is a Unicode object, or if any of the objects being converted using the%s
conversion are Unicode objects, the result will also be a Unicode object.
foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo
This works, because foo
is a unicode
object. This causes the above rule to take effect and results in a Unicode string.
bar = "foo2 is %s" % foo2
In this case, foo2
is an Exception
object, which is obviously not a unicode
object. So the interpreter tries to convert it to a normal str
using your default encoding. This, apparently, is ascii
, which cannot represent those characters and bails out with an exception.
bar = u"foo2 is %s" % foo2
Here it works again, because the format string is a unicode
object. So the interpreter tries to convert foo2
to a unicode
object as well, which succeeds.
As to Randall's question: this surprises me too. However, this is according to the standard (reformatted for readability):
%s
converts any Python object usingstr()
. If the object or format provided is aunicode
string, the resulting string will also beunicode
.
How such a unicode
object is created is left unclear. So both are legal:
- call
__str__
, decode back to a Unicode string, and insert it into the output string - call
__unicode__
and insert the result directly into the output string
The mixed behaviour of the Python interpreter is rather hideous indeed. I would consider this to be a bug in the standard.
Edit: Quoting the Python 3.0 changelog, emphasis mine:
Everything you thought you knew about binary data and Unicode has changed.
[...]
- As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text.
精彩评论