开发者

Sorting strings with accented characters in python [duplicate]

This question already has answers here: Closed 12 years ago.

Possi开发者_如何学运维ble Duplicate:

Python not sorting unicode properly. Strcoll doesn't help.

I'm trying to sort some words in alphabetical order. Here is how I do it:

#!/opt/local/bin/python2.7
# -*- coding: utf-8 -*-

import locale

# Make sure the locale is in french
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
print "locale: " + str(locale.getlocale())

# The words are in alphabetical order
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]

for word in sorted(words, cmp=locale.strcoll):
    print word.decode("string-escape")

I'm expecting that the words are printed in the same order as they are defined, but here is what I get:

locale: ('fr_FR', 'UTF8')
liche
lichen
licher
lichoter
lichée
lichénoïde

The é character is treated as if it's greater than z.

It seems I'm misunderstanding how locale.strcoll is comparing strings. What comparator function should I use to get the words sorted alphabetically?


I finally chose to strip diacritics and compare the stripped version of the strings so that I don't have to add the PyICU dependency.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜