开发者

Python failing to encode bad unicode to ascii

I hav开发者_如何学Pythone some Python code that's receiving a string with bad unicode in it. When I try to ignore the bad characters, Python still chokes (version 2.6.1). Here's how to reproduce it:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?


Converting a string to a unicode instance is str.decode() in Python 2.x:

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'


You are confusing "unicode" and "utf-8". Your string s is not unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding. Going from unicode to bytestring is encoding. Perhaps you meant to make s a unicode string:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜