Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can I overcome this?

2022-12-14 08:05 问答作者：

I'm running Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win 32

When I'm asking Python

>>> "u11-Phrase 099.wav" <  "u11-Phrase 1000.wav"
True

That's fine. When I ask

>>> "u11-Phrase 100.wav" <  "u11-Phrase 1000.wav"
True

That's fine, too. But when I ask

>>> "u11-Phrase 101.wav" <  "u11-Phrase 1000.wav"
False

So according Python "u11-Phrase 100.wav" comes before "u11-Phrase 1000.wav" but "u11-Phrase 101.wav" comes after "u11-Phrase 1000.wav"! And this is problematic for me because I'm trying to write a file renaming program and this kind of sorting breaks the functionality.

What can I do to overcome this? Should I write my own cmp function and test for edge cases or is there a much simpler shortcut to give me the ordering I want?

On the other hand if I modify the strings such as

>>> "u11-Phrase 0101.wav" <  "u11-Phrase 1000.wav"
True

However those strings come from the file listing of directory such as:

files = glob.glob('*.wav')
files.sort()
for file in files:
    ...

So I'd rather not do surgical operations on the strings after they have been created by glob. And no, I don't want to change the original filenames in that folder开发者_如何学JAVA, too.

Any hints?

You are looking for human sorting.

The reason 101.wav is not less than 1000.wav is that computers (not just Python) sort strings character by character, and the first difference between these two strings is where the first string has a '1' and the second string has a '0'. '1' is not less than '0', so the strings compare as you have seen.

People naturally parse those strings into their components, and interpret the numbers numerically, not lexically. The code I linked to above will do that same sort of parsing.

You need to construct a proper sort key for each filename. Something like this should do what you want:

import re

def k(s):
    return [w.isdigit() and int(w) or w for w in re.split(r'(\d+)', s)]

files = ["u11-Phrase 099.wav", "u11-Phrase 1000.wav", "u11-Phrase 100.wav"]

print files
print sorted(files, key=k)

It gives this output:

['u11-Phrase 099.wav', 'u11-Phrase 1000.wav', 'u11-Phrase 100.wav']
['u11-Phrase 099.wav', 'u11-Phrase 100.wav', 'u11-Phrase 1000.wav']

The k function will split apart the filenames on sequences of digits and (more importantly) turn those sequences into integers:

>>> k('u11-Phrase 099.wav')
['u', 11, '-Phrase ', 99, '.wav']

We then use the fact that Python knows how to sort lists --- it sorts the lists by comparing each element one by one. The end result is that

>>> k('u11-Phrase 99.wav') < k('u11-Phrase 100.wav')
True

whereas

>>> 'u11-Phrase 99.wav' < 'u11-Phrase 100.wav'
False

as you've already found out.

继续阅读：python sorting

Python sorts "u11-Phrase 1000.wav" before "u11-Phrase 101.wav"; how can I overcome this?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？