开发者

Python Unicode in and out of IDE

when I run my programs from within Eclipse IDE the following piece of code works perfectly:

address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]

but when running the same piece of code directly with python, I get an "UnicodeDecodeError".

What does the IDE does differently that it doesn't fall on this error ?

ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.

Edit:

Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clic开发者_运维知识库king" the file it self, and not running from within the IDE.


UnicodeDecodeError indicates that the error happens during decoding of a bytestring into Unicode.

In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:

>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

u"\N{EM DASH}".encode('utf-8') is a bytestring and invoking .encode('utf-8') the 2nd time leads to implicit .decode(sys.getdefaultencoding()) that leads to the UnicodeDecodeError.

What does the IDE does differently that it doesn't fall on this error ?

It probably works in IDE because it changes sys.getdefaultencoding() to utf-8 that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding() on Python 2.

I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.

You should use unicodedata.normalize() instead:

>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True

Note: the code in your question may produce surprising results:

#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...

address_name.upper() calls bytes.upper method while i[5].upper() calls unicode.upper method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold() method instead:

key = unicode_address_name.casefold()
... if key == i[5].casefold()...

In general, If you need to sort unicode strings then you could use icu.Collator. Compare the default lexicographical sort:

>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']

with the order in en_GB locale:

>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']


I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.

The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.

Thank you all for the attention, and I hope this post can help people in the future with a similar problem.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜