How can I partially sort a Python list?
I wrote a compiler cache for MSVC (much like ccache for gcc). One of the things I have to do is to remove the oldest object files in my cache directory to trim the cache to a user-defined size.
Right now, I basically have a list 开发者_如何学JAVAof tuples, each of which is the last access time and the file size:
# First tuple element is the access time, second tuple element is file size
items = [ (1, 42341),
(3, 22),
(0, 3234),
(2, 42342),
(4, 123) ]
Now I'd like to do a partial sort on this list so that the first N elements are sorted (where N is the number of elements so that the sum of their sizes exceeds 45000). The result should be basically this:
# Partially sorted list; only first two elements are sorted because the sum of
# their second field is larger than 45000.
items = [ (0, 3234),
(1, 42341),
(3, 22),
(2, 42342),
(4, 123) ]
I don't really care about the order of the unsorted entries, I just need the N oldest items in the list whose cumulative size exceeds a certain value.
You could use the heapq
module. Call heapify()
on the list, followed by heappop()
until your condition is met. heapify()
is linear and heappop()
logarithmic, so it's likely as fast as you can get.
heapq.heapify(items)
size = 0
while items and size < 45000:
item = heapq.heappop(items)
size += item[1]
print item
Output:
(0, 3234)
(1, 42341)
I don't know of anything canned, but you could do this with a variant of any sort which incrementally builds the sorted list from one end to the other, but which simply stops when enough elements have been sorted. Quicksort would be the obvious choice. Selection sort would do, but it's a terrible sort. Heapsort, as Marco suggests, would also do it, taking the heapify of the whole array as a sunk cost. Mergesort couldn't be used this way.
To look at quicksort specifically, you would simply need to track a high water mark of how far into the array has been sorted so far, and the total file size of those elements. At the end of each sub-sort, you update those numbers by adding in the newly-sorted elements. Abandon the sort when it passes the target.
You might also find performance was improved by changing the partition-selection step. You might prefer lopsided partitioning elements if you only expect to sort a small fraction of the array.
Partial sorting (see the Wikipedia page) is more efficient than actual sorting. The algorithms are analogous to sorting algorithms. I'll outline heap-based partial sort (though it's not the most efficient on that page).
You want the oldest ones. You stick the elements in a heap, one by one, and pop off the newest element in the heap when it gets too big. Since the heap is kept small, you don't pay as much to insert and remove elements.
In the standard case, you want the smallest/biggest k
elements. You want the oldest elements which satisfy a total condition, so keep track of the total condition by keeping a total_size
variable.
Code:
import heapq
def partial_bounded_sort(lst, n):
"""
Returns minimal collection of oldest elements
s.t. total size >= n.
"""
# `pqueue` holds (-atime, fsize) pairs.
# We negate atime, because heapq implements a min-heap,
# and we want to throw out newer things.
pqueue = []
total_size = 0
for atime, fsize in lst:
# Add it to the queue.
heapq.heappush(pqueue, (-atime, fsize))
total_size += fsize
# Pop off newest items which aren't needed for maintaining size.
topsize = pqueue[0][1]
while total_size - topsize >= n:
heapq.heappop(pqueue)
total_size -= topsize
topsize = pqueue[0][1]
# Un-negate atime and do a final sort.
oldest = sorted((-priority, fsize) for priority, fsize in pqueue)
return oldest
There are a few things you can do to microoptimize this code. For example, you can fill in the list with the first few items and heapify it all at once.
The complexity could be better than that of sorting. In your particular problem, you don't know the number of elements you'll return, or even how many elements could be in the queue at once. In the worst case, you sort almost all of the list. You might be able to prevent this by preprocessing the list to see whether it's easier to find the set of new things or the set of old things.
If you want to keep track of which items are and aren't removed, you can keep two "pointers" into the original list: one to keep track of what you've processed, and one marking the "free" space. When processing an item, erase it from the list, and when throwing away an item from the heap, put it back into the list. The list will end up with the items that are not in the heap, plus some None
entries in the end.
精彩评论