Efficient priority list
I am looking f开发者_StackOverflow社区or an efficient data structure to represent a priority list. Specifically I need to assign a priority to a set of items and return only the top scoring items. I have looked into priority queues which operate on heaps, but they don't seem to really suit my needs. They will reorganize the heap structure as soon as I will poll the top rating item from the queue.
The simplest solution would of course be a linked list, which in the worst case would take quite long for the insertion operation.
Does anyone have a better solution?
Heaps seem very suitable, and it seems like you are going about it wrongly.
Say you wanted the top x elements (how does this x compare to n, btw?)
What you are doing is putting all into a max-heap and getting the top x.
I suggest instead, you use a min-heap of exactly x elements.
First x elements you insert into heap.
Next incoming element, you compare against the min which can be done very quickly (O(1) time) in the heap. If smaller, you just ignore the incoming element.
If incoming element is larger than min, then you increase the min to the incoming element and sift it down in the heap. This should be logx time at worst.
Once done (in nlogx time), you can retrieve the elements from the heap in sorted order in O(xlogx) time.
Depending on how your data is (and how small x is), using this min-heap solution can be really fast.
If you really really want the inserts to be super-fast and don't care much about the retrieval then you can also do the following.
Insert the elements into a vector (array with amortized O(1) insert time) in the order they come.
The use the Selection algorithm to find the xth largest element (in O(n) time, but the constants might be big). Say that number is S.
Now walk the array comparing each element with S and select the ones as large as S.
If x is reasonably sized and comparable to n (like n/2 or something) this might work out fine, but if x is small compared to n, I would suggest go with the min-heap.
Hmm. Skip lists? They should have O(log n) insertion (as heap-based queue) but getting top element should be O(1) [including removing it]. They could be even implemented using lock-free algorithm.
If you need only the k top items and you never need to look a the others, you can use a simple linked list or array storing only the current top k items, plus a number (the worst score of the elements in the list).
In the Add()
operation you simply compare the item with the worst value in the list and, if better, you swap the current worst with the added item. This takes O(k) time in the worst case for insertion because you need to find the element that has currently the worst score. The the average case, however, is O(1), since, as you add better elements to the list, the probability of having to do a swap tends to 0 (that is, you're not actually adding any items).
So if you generate elements at random, your performance is likely to be very good. Even if you generate ordered items (worst case), it might be fast enough for your value of k.
The JDK has a built-in pqueue class (java.util.PriorityQueue) which is based on a heap algorithm.
Sorry, I only just saw the bit about heaps not fitting your needs. Can you explain why? You can write a custom comparator (or make your items comparable) and the PriorityQueue will order your items appropriately.
A balanced tree would always guarantee a logarithmic worst case. Although linear time is usually regarded as feasible, there is still a tremendous difference between logarithmic and linear:
for a billion elements, the difference is between 1 billion operations and a few dozens. If each operation takes 1 millisecond, that means going from 11 days to less than a second.
Every node has at most two children.
The heap tree is complete and left-adjusted. Complete means that if the heap has height H, every leaf node is either at level H or H-1. All the levels are left-adjusted, which means that no right sub-tree has a height greater than its left sibling. So, if a leaf is at the same height as an internal node, the leaf can’t be on the left of that node.
Every node holds the highest priority in the subtree rooted at that node.
Binary search trees are the most common kind of trees, but we can use d'ary trees. we can use any value greater than 2, and use the same array representation for the heap.
But the improvement we get with trees comes with a price. First, as with any data structure that uses pointers (lists, graphs, trees, and so on) we have a memory overhead in comparison to arrays. While with the latter we just need to reserve space for the data (plus maybe, depending on the implementation details, some constant space for pointers and the node structure itself), every tree node requires extra space for the pointers to its children and possibly to its parent.
Reference
精彩评论