Python not sorting unicode properly. Strcoll doesn't help
I've got a problem with sorting lists using unicode collation in Python 2.5.1 and 2.6.5 on OSX, as well as on Linux.
import locale
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
Which should print:
[u'a', u'ą', u'z']
But instead prints out:
[u'a', u'z', u'ą']
Summing it up - it loo开发者_高级运维ks as if strcoll was broken. Tried it with various types of variables (fe. non-unicode encoded strings).
What do I do wrong?
Best regards, Tomasz Kopczuk.
Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).
On OS X: sudo port install py26-pyicu
, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).
PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:
import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]
which gives:
[u'a', u'ą', u'z']
Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.
Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll()
directly as in this snippet.
EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:
The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.
The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.
The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.
The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.
The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.
The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.
Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):
First set the locale (don't forget to import locale and platform)
if platform.system() == 'Linux':
locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')
elif platform.system() == 'Windows':
locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')
else:
pass (or any other OS)
Then sort using key:
a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']
print(sorted(a,key=locale.strxfrm))
For list of Objects:
a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]
print(sorted(a, key=lambda x: locale.strxfrm(x['id']))
Finally you can return the locale:
locale.setlocale(locale.LC_ALL, '')
@gnibbler, using PyICU with the sorted() function does work in a Python3 Environment. After a little digging through the ICU API documentation and some experimentation, I came across the getSortKey() function:
import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('de_DE.UTF-8'))
sorted(['a','b','c','ä'],key=collator.getSortKey)
which produces the desired collation:
['a', 'ä', 'b', 'c']
instead of the undesired collation:
sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
import locale
from functools import cmp_to_key
iterable = [u'a', u'z', u'ą']
sorted(iterable, key=cmp_to_key(locale.strcoll)) # locale-aware sort order
(Ref.: http://docs.python.org/3.3/library/functools.html)
Since 2012 there's been a library natsort
. It includes amazing functions such as natsorted
and humansorted
. More importantly, they work not only with lists!. Code:
from natsort import natsorted, humansorted
lst = [u"a", u"z", u"ą"]
dct = {"ą": 1, "ż": 3, "Ż": 4, "b": 5}
lst_natsorted = natsorted(lst)
lst_humansorted = humansorted(lst)
dct_natsorted = dict(natsorted(dct.items()))
dct_humansorted = dict(humansorted(dct.items()))
print("List natsorted: ", lst_natsorted)
print("List humansorted: ", lst_humansorted, "\n")
print("Dictionary natsorted: ", dct_natsorted)
print("Dictionary humansorted: ", dct_humansorted)
Output:
List natsorted: ['a', 'ą', 'z']
List humansorted: ['a', 'ą', 'z']
Dictionary natsorted: {'Ż': 4, 'ą': 1, 'b': 5, 'ż': 3}
Dictionary humansorted: {'ą': 1, 'b': 5, 'ż': 3, 'Ż': 4}
As you can see results differ when sorting dictionaries but considering given list both results are correct.
By the way, this library is also great to sort strings containing numbers:
from natsort import natsorted, humansorted
lst_mixed = ["a9", "a10", "a1", "c4", "c40", "c5"]
mixed_sorted = sorted(lst_mixed)
mixed_natsorted = natsorted(lst_mixed)
mixed_humansorted = humansorted(lst_mixed)
Output:
List with mixed strings sorted: ['a1', 'a10', 'a9', 'c4', 'c40', 'c5']
List with mixed strings natsorted: ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
List with mixed strings humansorted: ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
On ubuntu lucid the sorting with cmp seems to work ok, but my output encoding is wrong.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']
Using key with locale.strxfrm does not work unless I am missing something
>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)
精彩评论