Printing objects and unicode, what's under the hood ? What are the good guidelines?

2023-01-12 17:12 问答作者：

I'm struggling with print and unicode conversion. Here is some code executed in the 2.5 windows interpreter.

>>> import sys
>>> print sys.stdout.encoding
cp850
>>> print u"é"
é
>>> print u"é".encode("cp850")
é
>>> print u"é".encode("utf8")
├®
>>> print u"é".__repr__()
u'\xe9'

>>> class A():
...    def __unicode__(self):
...       return u"é"
...
>>> print A()
<__main__.A instance at 0x0000000002AEEA88>

>>> class B():
...    def __repr__(self):
...       return u"é".encode("cp850")
...
>>> print B()
é

>>> class C():
..开发者_如何学Go.    def __repr__(self):
...       return u"é".encode("utf8")
...
>>> print C()
├®

>>> class D():
...    def __str__(self):
...       return u"é"
...
>>> print D()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

>>> class E():
...    def __repr__(self):
...       return u"é"
...
>>> print E()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

So, when a unicode string is printed, it's not it's __repr__() function which is called and printed.

But when an object is printed __str__() or __repr__() (if __str__ not implemented) is called, not __unicode__(). Both can not return a unicode string.

But why? Why if __repr__() or __str__() return a unicode string, shouldn't it be the same behavior than when we print a unicode string? I other words: why print D() is different from print D().__str__()

Am I missing something?

These samples also show that if you want to print an object represented with unicode strings, you have to encode it to a object string (type str). But for nice printing (avoid the "├®"), it's dependent of the sys.stdout encoding.

So, do I have to add u"é".encode(sys.stdout.encoding) for each of my __str__ or __repr__ method? Or return repr(u"é")? What if I use piping? Is is the same encoding than sys.stdout?

My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters). Here is the bad behavior/code that needs to be modified:

class User(object):
    name = u"Luiz Inácio Lula da Silva"
    def __repr__(self):
        # returns unicode
        return "<User: %s>" % self.name
        # won't display gracefully
        # expl: print repr(u'é') -> u'\xe9'
        return repr("<User: %s>" % self.name)
        # won't display gracefully
        # expl: print u"é".encode("utf8") -> print '\xc3\xa9' -> ├®
        return ("<User: %s>" % self.name).encode("utf8")

Thanks!

Python doesn't have many semantic type constraints on given functions and methods, but it has a few, and here's one of them: __str__ (in Python 2.*) must return a byte string. As usual, if a unicode object is found where a byte string is required, the current default encoding (usually 'ascii') is applied in the attempt to make the required byte string from the unicode object in question.

For this operation, the encoding (if any) of any given file object is irrelevant, because what's being returned from __str__ may be about to be printed, or may be going to be subject to completely different and unrelated treatment. Your purpose in calling __str__ does not matter to the call itself and its results; Python, in general, doesn't take into account the "future context" of an operation (what you are going to do with the result after the operation is done) in determining the operation's semantics.

That's because Python doesn't always know your future intentions, and it tries to minimize the amount of surprise. print str(x) and s = str(x); print s (the same operations performed in one gulp vs two), in particular, must have the same effects; if the second case, there will be an exception if str(x) cannot validly produce a byte string (that is, for example, x.__str__() can't), and therefore the exception should also occur in the other case.

print itself (since 2.4, I believe), when presented with a unicode object, takes into consideration the .encoding attribute (if any) of the target stream (by default sys.stdout); other operations, as yet unconnected to any given target stream, don't -- and str(x) (i.e. x.__str__()) is just such an operation.

Hope this helped show the reason for the behavior that is annoying you...

Edit: the OP now clarifies "My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters).". Here's the approach I think works best for that specific goal:

import sys

DEFAULT_ENCODING = 'UTF-8'  # or whatever you like best

class sic(object):

    def __unicode__(self):  # the "real thing"
        return u'Pel\xe9'

    def __str__(self):      # tries to "look nice"
        return unicode(self).encode(sys.stdout.encoding or DEFAULT_ENCODING,
                                    'replace')

    def __repr__(self):     # must be unambiguous
        return repr(unicode(self))

That is, this approach focuses on __unicode__ as the primary way for the class's instances to format themselves -- but since (in Python 2) print calls __str__ instead, it has that one delegate to __unicode__ with the best it can do in terms of encoding. Not perfect, but then Python 2's print statement is far from perfect anyway;-).

__repr__, for its part, must strive to be unambiguous, that is, not to "look nice" at the expense of risking ambiguity (ideally, when feasible, it should return a byte string that, if passed to eval, would make an instance equal to the present one... that's far from always feasible, but the lack of ambiguity is the absolute core of the distinction between __str__ and __repr__, and I strongly recommend respecting that distinction!).

I presume your sys.getdefaultencoding() is still 'ascii'. And I think this is being used whenever str() or repr() of an object are applied. You could change that with sys.setdefaultencoding(). As soon as you write to a stream, though, be it STDOUT or a file, you have to comply with its encoding. This would also apply for piping on the shell, IMO. I assume that 'print' honors the STDOUT encoding, but the exception happens before 'print' is invoked, when constructing its argument.

继续阅读：printing python stdout unicode

Printing objects and unicode, what's under the hood ? What are the good guidelines?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？