Python Text Encoding
I have this text in a file - Recuérdame (notice it's a French word). When I read this file w开发者_如何学Cith a python script, I get this text as Recuérdame
.
I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?
Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).
For example, if you know the encoding is UTF-8:
with open('foo.txt', 'rb') as f:
contents = f.read().decode('utf-8-sig') # -sig takes care of BOM if present
The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).
It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, é
is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.
To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html
It is HTML an this construct is called „entity“. You can use
def entity_decode(match):
_, is_hex, entity = match.groups()
base = 16 if is_hex else 10
return unichr(int(entity, base))
print re.sub("(?i)(&#(x?)([^;]+);)",
entity_decode,
"Recurdame")
to decode all etities.
Edit: Yes, they are of course not latin1, now it should work with all entities
Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.
精彩评论