Python Unicode in and out of IDE
when I run my programs from within Eclipse IDE the following piece of code works perfectly:
address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]
but when running the same piece of code directly with python, I get an "UnicodeDecodeError".
What does the IDE does differently that it doesn't fall on this error ?
ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
Edit:
Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clic开发者_运维知识库king" the file it self, and not running from within the IDE.
UnicodeDecodeError
indicates that the error happens during decoding of a bytestring into Unicode.
In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:
>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
u"\N{EM DASH}".encode('utf-8')
is a bytestring and invoking .encode('utf-8')
the 2nd time leads to implicit .decode(sys.getdefaultencoding())
that leads to the UnicodeDecodeError
.
What does the IDE does differently that it doesn't fall on this error ?
It probably works in IDE because it changes sys.getdefaultencoding()
to utf-8
that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding()
on Python 2.
I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
You should use unicodedata.normalize()
instead:
>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True
Note: the code in your question may produce surprising results:
#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...
address_name.upper()
calls bytes.upper
method while i[5].upper()
calls unicode.upper
method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold()
method instead:
key = unicode_address_name.casefold()
... if key == i[5].casefold()...
In general, If you need to sort unicode strings then you could use icu.Collator
. Compare the default lexicographical sort:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']
with the order in en_GB
locale:
>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']
I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.
The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.
Thank you all for the attention, and I hope this post can help people in the future with a similar problem.
精彩评论