Why does Python sometimes upgrade a string to unicode and sometimes not?

2022-12-31 13:55 问答作者：

I'm confused. Consider this code working the way I expect:

>>> foo = u'Émilie and Juañ are turncoats.'
>>> bar = "foo is %s" % foo
>>> bar
u'foo is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

And this code not at all working the way I expect:

>>> try:
...     raise Exception(foo)
... except Exception as e:
...     foo2 = e
... 
>>> bar = "foo2 is %s" % foo2
------------------------------------------------------------
Traceback (most recent call last):
  File开发者_运维百科 "<ipython console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

Can someone explain what's going on here? Why does it matter whether the unicode data is in a plain unicode string or stored in an Exception object? And why does this fix it:

>>> bar = u"foo2 is %s" % foo2
>>> bar
u'foo2 is \xc3\x89milie and Jua\xc3\xb1 are turncoats.'

I am quite confused! Thanks for the help!

UPDATE: My coding buddy Randall has added to my confusion in an attempt to help me! Send in the reinforcements to explain how this is supposed to make sense:

>>> class A:
...     def __str__(self): return "string"
...     def __unicode__(self): return "unicode"
... 
>>> "%s %s" % (u'niño', A())
u'ni\xc3\xb1o unicode'
>>> "%s %s" % (A(), u'niño')
u'string ni\xc3\xb1o'

Note that the order of the arguments here determines which method is called!

The Python Language Reference has the answer:

If format is a Unicode object, or if any of the objects being converted using the %s conversion are Unicode objects, the result will also be a Unicode object.

foo = u'Émilie and Juañ are turncoats.'
bar = "foo is %s" % foo

This works, because foo is a unicode object. This causes the above rule to take effect and results in a Unicode string.

bar = "foo2 is %s" % foo2

In this case, foo2 is an Exception object, which is obviously not a unicode object. So the interpreter tries to convert it to a normal str using your default encoding. This, apparently, is ascii, which cannot represent those characters and bails out with an exception.

bar = u"foo2 is %s" % foo2

Here it works again, because the format string is a unicode object. So the interpreter tries to convert foo2 to a unicode object as well, which succeeds.

As to Randall's question: this surprises me too. However, this is according to the standard (reformatted for readability):

%s converts any Python object using str(). If the object or format provided is a unicode string, the resulting string will also be unicode.

How such a unicode object is created is left unclear. So both are legal:

call __str__, decode back to a Unicode string, and insert it into the output string
call __unicode__ and insert the result directly into the output string

The mixed behaviour of the Python interpreter is rather hideous indeed. I would consider this to be a bug in the standard.

Edit: Quoting the Python 3.0 changelog, emphasis mine:

Everything you thought you knew about binary data and Unicode has changed.

[...]

As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text.

继续阅读：python unicode

Why does Python sometimes upgrade a string to unicode and sometimes not?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？