开发者

Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can I overcome this?

I'm running Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win 32

When I'm asking Python

>>> "u11-Phrase 099.wav" <  "u11-Phrase 1000.wav"
True

That's fine. When I ask

>>> "u11-Phrase 100.wav" <  "u11-Phrase 1000.wav"
True

That's fine, too. But when I ask

>>> "u11-Phrase 101.wav" <  "u11-Phrase 1000.wav"
False

So according Python "u11-Phrase 100.wav" comes before "u11-Phrase 1000.wav" but "u11-Phrase 101.wav" comes after "u11-Phrase 1000.wav"! And this is problematic for me because I'm trying to write a file renaming program and this kind of sorting breaks the functionality.

What can I do to overcome this? Should I write my own cmp function and test for edge cases or is there a much simpler shortcut to give me the ordering I want?

On the other hand if I modify the strings such as

>>> "u11-Phrase 0101.wav" <  "u11-Phrase 1000.wav"
True

However those strings come from the file listing of directory such as:

files = glob.glob('*.wav')
files.sort()
for file in files:
    ...

So I'd rather not do surgical operations on the strings after they have been created by glob. And no, I don't want to change the original filenames in that folder开发者_如何学JAVA, too.

Any hints?


You are looking for human sorting.

The reason 101.wav is not less than 1000.wav is that computers (not just Python) sort strings character by character, and the first difference between these two strings is where the first string has a '1' and the second string has a '0'. '1' is not less than '0', so the strings compare as you have seen.

People naturally parse those strings into their components, and interpret the numbers numerically, not lexically. The code I linked to above will do that same sort of parsing.


You need to construct a proper sort key for each filename. Something like this should do what you want:

import re

def k(s):
    return [w.isdigit() and int(w) or w for w in re.split(r'(\d+)', s)]

files = ["u11-Phrase 099.wav", "u11-Phrase 1000.wav", "u11-Phrase 100.wav"]

print files
print sorted(files, key=k)

It gives this output:

['u11-Phrase 099.wav', 'u11-Phrase 1000.wav', 'u11-Phrase 100.wav']
['u11-Phrase 099.wav', 'u11-Phrase 100.wav', 'u11-Phrase 1000.wav']

The k function will split apart the filenames on sequences of digits and (more importantly) turn those sequences into integers:

>>> k('u11-Phrase 099.wav')
['u', 11, '-Phrase ', 99, '.wav']

We then use the fact that Python knows how to sort lists --- it sorts the lists by comparing each element one by one. The end result is that

>>> k('u11-Phrase 99.wav') < k('u11-Phrase 100.wav')
True

whereas

>>> 'u11-Phrase 99.wav' < 'u11-Phrase 100.wav'
False

as you've already found out.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜