Dealing with wacky encodings in Python
I have a Python script that pulls in data from many sources (databases, files, etc.). Supposedly, all the strings are unicode, but what I end up getting is any variation on the following theme (as returned by repr()
):
u'D\\xc3\\xa9cor'
u'D\xc3\xa9cor'
'D\\xc3\\xa9cor'
'D\xc3\xa9cor'
Is there a reliable way to take any four of the above strings and return the proper unicode string?
u'D\xe9cor' # --> Décor
The only way I can think of right now uses eval()
, replace()
, and a deep, burning shame that will never wash a开发者_高级运维way.
That's just UTF-8 data. Use .decode
to convert it into unicode
.
>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
You can perform an additional string-escape decode for the 'D\\xc3\\xa9cor'
case.
>>> 'D\xc3\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
>>> u'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
To handle the 2nd case as well, you need to detect if the input is unicode
, and convert it into a str
first.
>>> def conv(s):
... if isinstance(s, unicode):
... s = s.encode('iso-8859-1')
... return s.decode('string-escape').decode('utf-8')
...
>>> map(conv, [u'D\\xc3\\xa9cor', u'D\xc3\xa9cor', 'D\\xc3\\xa9cor', 'D\xc3\xa9cor'])
[u'D\xe9cor', u'D\xe9cor', u'D\xe9cor', u'D\xe9cor']
Write adapters that know which transformations should be applied to their sources.
>>> 'D\xc3\xa9cor'.decode('utf-8')
u'D\xe9cor'
>>> 'D\\xc3\\xa9cor'.decode('string-escape').decode('utf-8')
u'D\xe9cor'
Here's the solution I came to before I saw KennyTM's proper, more concise soltion:
def ensure_unicode(string):
try:
string = string.decode('string-escape').decode('string-escape')
except UnicodeEncodeError:
string = string.encode('raw_unicode_escape')
return unicode(string, 'utf-8')
精彩评论