UTF-8 in Python
This seems to be a common question among international developers but I haven't found a straight answer yet. I'm getting from a feed the following string: "Carlos e Carlos mostram o que há de melhor na internet"
开发者_如何转开发The following error is returned to the console: UnicodeDecodeError: 'utf8' codec can't decode bytes in position 31-33: invalid data
thanks in advance,
fbr
You can't just decode using some random encoding, even if it is UTF-8; you must decode using the encoding returned in the HTTP headers or an equivalent within the document (such as within the META
element of HTML).
If the encoding isn't available or is incorrect then you should specify in the decode operation what will happen on an invalid byte sequence; usually 'replace'
suffices for this.
>>> print u'Carlos e Carlos mostram o que há de melhor na internet'.encode('latin1').decode('utf-8', 'replace')
Carlos e Carlos mostram o que h�e melhor na internet
精彩评论