How do I resolve difficulties with decoding and printing Greek characters using Python?

2023-03-09 02:43 问答作者：

I am creating a simple game designed to prompt the user for the Greek translation of an English word. For example:

cow: # here, the gamer would answer with *η αγελάδα* in order to score one point.

I use a helper function to read and decode from a txt file. I do so using the following code in said function:

# The variable filename refers to my helper function's sole parameter, it takes the 
# above mentioned txt file as an argument.
words_text = codecs.open(filename, 'r', 'utf-8')

This helper function then reads each line. The lines resemble something like this:

# In stack data, when I debug, it reads as u"\η αγελάδα - cow\r\n".
u"\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03开发者_开发技巧b4\u03b1 - cow\r\n"

The first line of the file when read, however, has an unwanted prefix, ueff-:

# u"\ufeffη αγελάδα - cow\r\n"
u"\ufeff\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1 - cow\r\n"

Note: After reviewing Mark's answer, I found out that the prepended oject (ueff) was a BOM signature (it is used to distinguish UTF-8 from other encodings).

It's a minor issue and I am not sure how to remove it in the tidiest of manners. Anyways, my helper function then creates and returns a new dictionary which looks something like this:

{u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1': 'cow'}

Then, in my main function, I use the following in order to store the user's input:

# This is the code for the prompt I noted at the beginning.
# The variable gr_en_dict is the dictionary noted right above.
for key in gr_en_dict:
    user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

I then compare the value of the user's input with the appropriate key in the dictionary:

# I imported unicodedata as ud.
if ud.normalize('NFC', user_reply) == ud.normalize('NFC', key):
        score += 1

In a response to a question similar to mine, the user ΤΖΩΤΖΙΟΥ said to import the module unicodedata and to call the normalize method (which I did in the code above), but I suspect that might not be necessary. Unfortunately, this step of the program is of no concern just yet because I have a problem decoding the user's input. To demonstrate, when I print the canonical string representation of user_reply and that of the corresponding key in my dictionary [using the built-in repr()] I get the following result:

user's input (user_reply):

u'? \u03b1?\u03b5??\u03b4\u03b1'

If I print the user's input without the repr() function, it looks like this:

? α?ε??δα

key in my dictionary:

u'\u03b7 \u03b1\u03b3\u03b5\u03bb\u03ac\u03b4\u03b1'

If I print it without repr(), I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b7' in position 0: character maps to <undefined>

Notice the question marks in the user's input and the error I get when I try to print the Greek word proper. This seems to be the crux of my problem.

So, what exactly do I need to do in order to decode the user's input and to display all Greek characters properly?

When using my native code page:

C:\>chcp
Active code page: 437

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print '? α?ε??δα'
? α?ε??δα
>>>

When using the Greek code page: (strangely, it appears correctly only when I copy it to clipboard first and then paste it into a word type application. I would post an image of the what it actually prints in default console, but I lack the reputation to do so.)

C:\>chcp 869
Active code page: 869

C:\>\python25\python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print ' η αγελάδα'
 η αγελάδα
>>> print 'η αγελάδα'
η αγελάδα
>>>

UP: I had to change default console's font to Lucida Console. That solved my discrepancy.

For part of your question, use:

words_text = codecs.open(filename, 'r', 'utf-8-sig')

and it will handle processing the byte-order-mark of \ufeff.

Technically, this:

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdout.encoding)

should be:

user_reply = raw_input('%s: ' % (gr_en_dict[key])).decode(sys.stdin.encoding)

but in practice they should be the same encoding.

I believe the problem is the encoding in your default console does not support all Greek characters. When I change to a Greek code page, things begin to work better. Note that I can paste the correct characters into the print statement below, but cp437 doesn't actually support all the characters, so when printed the unsupported characters are replaced with a question mark:

C:\>chcp
Active code page: 437

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print 'η αγελάδα - cow'
? α?ε??δα - cow

If I switch to a Greek code page (869 or 1253), it works:

C:\>chcp 869
Active code page: 869

C:\>python
Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp869'
>>> print 'η αγελάδα - cow'
η αγελάδα - cow
>>>

The standard windows shell has issues with extended characters. I would suggest using something like Windows PowerShell.

For the '\ufeff' character, which is the byte order mark, you could perform the following check after reading in the file:

words_text = codecs.open(filename, 'r', 'utf-8')
words_text_lines = words_text.readlines()

if words_text_lines and words_text_lines[0][0]==unicode(codecs.BOM_UTF8, 'utf8'):
    words_text_lines[0] = words_text_lines[0][1:]

That way you're discarding it if it's there.

继续阅读：printing python raw-input unicode

How do I resolve difficulties with decoding and printing Greek characters using Python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？