Sorting algorithm in OpenOffice calc
I'm having a really long day, the culmination of which is a dumb moment trying to sort a list of string开发者_StackOverflow社区s. Calc sorts them like this:
0DCv6UlY6T0
0ITZEBZrwMk
1062VEX2EfI
2jk7hilGMs0
2lZVu3haI6A
3f8s3KbFQ0Q
3hB09daYLmk
43Erj3qFxxo
6lj33w3YoOw
7jiNQnkfx0k
7TSMj6g3UoE
7Wba8IUk6v8
9hbG9dS7zl0
ALThJiGFBSc
by_VzOiPhZM
Ce250P1xep0
Cgx6DV6RJg8
d5dDgLRd1-o
DnyzZwaYDXE
dO5KLh2er4E
This isn't quite what I expected. Look at the last 3 values. Shouldn't the entry starting with a capital D come before the ones starting with lowercase d (or the other way around)? Why does it come between the lowercase d entries?
Funnily, command line sort
in Linux does things the same way. Can somebody explain the logic behind such sorting? I need to replicate it (or reproduce it in Python, if it's already implemented somewhere).
It's because of locale. See the difference between:
sort inputfile
and with (what you probably want):
LANG="C" sort inputfile
output of second command:
0DCv6UlY6T0
0ITZEBZrwMk
1062VEX2EfI
2jk7hilGMs0
2lZVu3haI6A
3f8s3KbFQ0Q
3hB09daYLmk
43Erj3qFxxo
6lj33w3YoOw
7TSMj6g3UoE
7Wba8IUk6v8
7jiNQnkfx0k
9hbG9dS7zl0
ALThJiGFBSc
Ce250P1xep0
Cgx6DV6RJg8
DnyzZwaYDXE
by_VzOiPhZM
d5dDgLRd1-o
dO5KLh2er4E
Whether capitals are lexicographically distinct from lower-case letters depends on the locale (specifically LC_COLLATE), which explains the command line sort program (and ls and ...), and presumably also Openoffice.
E.g.
$ cat test
Abc
aabc
$ sort test
aabc
Abc
$ LC_COLLATE=C sort test
Abc
aabc
For replication:
data = [ "abc", "aBB", "abD", "Aac", "AAb", "ABc", "ABa" ]
print sorted(data, key = lambda item: item.upper())
The trick is to provide the key
argument. This function is applied to the list items, and the result is used for comparisions during the sort.
精彩评论