Reading Text with Accent - Python
I did some script in python that connects to GMAIL and print a email text... But, often my emails has words with "accent". And there is my problem...
For example a text that I got: "PLANO DE S=C3=9ADE" should be printed as "PLANO DE SAÚDE".
How can I turn legible my email text? What can I use to convert theses letters with accent?
Thanks,
The code suggested by Andrey, works fine on windows, but on Linux I still getting the wrong print:
>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÃDE
Rafael,
Thanks, you are correct about the word, it was misspelled. But the problem still the same here. Another example: CORRECT WORD: obersevação
>>> b = 'Observa=C3=A7=C3=B5es'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
Observações
I am using Debian with UTF-8 locale:
>>> :~$ locale
LANG=en_US.UTF-8
Andrey,
Thanks for your time. I agree with your explanation, but still with same problem here. Take look in my test:
s='Observa=C3=A7=C3=B5es'
s2= s.decode('quopri').decode('utf开发者_如何学Go-8')
>>> print s
Observa=C3=A7=C3=B5es
>>> print s2
Observações
>>> import locale
>>> ENCODING = locale.getpreferredencoding()
>>> print s.encode(ENCODING)
Observa=C3=A7=C3=B5es
>>> print s2.encode(ENCODING)
Observações
>>> print ENCODING
UTF-8
This encoding is called Quoted-printable. In your example, you have a string (Python's unicode
) encoded in UTF-8 bytes (Python's str
) encoded in quoted printable bytes. So the right way to get a string value is:
>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE
Update: There might be some issues with the console conding though. s
holds a fully correct Unicode string value (of Python type unicode
). But when you use the print
statement, the value must be converted to bytes (Python's str
) in order to be written to OS file descriptor number 1
(the standard output pipe). So the print
statement implementation checks your console encoding, then makes some guesses and prints the results. In fact, in Python 2 the results will be different for printing from the interactive shell, running your process non-interactively and running your process while redirecting the output to a file.
The best way to output encoded strings in Python 2 is not agreed upon. Two ways that make most sense are:
1) Use locale
's encoding guess and manually encode strings.
import locale
ENCODING = locale.getpreferredencoding()
print s.encode(ENCODING)
2) Use an encoding option (command-line, hard-coded or whatever).
from getopt import getopt
ENCODING = 'UTF-8'
opts, args = getopt(sys.argv[1:], '', ['encoding='])
for opt, arg in opts:
if opt == '--encoding':
ENCODING = arg
print s.encode(ENCODING)
Update 2: If nothing helps and you still sure that your console encoding and font are set to UTF-8, then try this:
import sys, os
ENCODING = 'UTF-8'
stdout = os.fdopen(sys.stdout.fileno(), 'wb')
s = u'привет' # Don't forget to use a Unicode literal staring with u''
stdout.write(s.encode(ENCODING))
At this point you must see the Russian word привет
in cyrillic character set in your console :)
If this is the case, then you should use this binary stdout
instead of normal sys.stdout
.
Your string is wrong, look:
'PLANO DE S=C3=9ADE' == 'PLANO DE S\xc3\x9aDE'
Where is the missing "A" in SAÚDE?
If you decode 'PLANO DE S=C3=9ADE'
as a quoted-printable, you will get only 'PLANO DE SÚDE'.
Running this code here on linux (Ubuntu 9.10):
>>> b = 'PLANO DE S=C3=9ADE'
>>> s = b.decode('quopri').decode('utf-8')
>>> print s
PLANO DE SÚDE
精彩评论