Efficient implementation of binary heaps
I'm looking for information on how to implement binary heaps efficiently. I feel like there should be a nice article somewhere about implementing heaps efficiently, but I haven't found one. In fact I've been unable to find any resources on the matter of efficient implementation beyond the basics like storing the heap in an array. I'm looking for techniques for making a fast binary heap beyond the ones I describe below.
I've already written a C++ implementation that is faster than Microsoft Visual C++'s and GCC's std::priority_queue or using std::make_heap, std::push_heap and std::pop_heap. The following are the techniques I've already got covered in my implementation. I only came up with the last 2 myself, though I doubt that those are new ideas:
(Edit: added section on memory optimization)
Instead, we can leave a hole in the heap where the top element was. Then we move that hole down the heap by iteratively moving the larger child up. This requires only 1 comparison per level that we go past. In this way the hole will become a leaf. At this point we can move the right-most bottom leaf into the position of the hole and move that value up until the heap property is restored. Since the value we moved was a leaf we don't expect it to move very far up the tree. So we should expect a little more than log n comparisons, which is better than before.
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i
and the compiler has to perform this computation to get p[i]. sizeof(T) is a compile-time constant, and the multiplication can be done more efficiently if sizeof(T) is a power of two. My implementation got faster by adding 8 padding bytes to increase sizeof(T) from 24 to 32. Reduced efficiency of the cache probably means that this is not a win for sufficiently large data sets.
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i == static_cast<char*>(p) + j
Then the left-child and right-child formulas for j-values become respectively 2*j and 2*j + sizeof(T). The parent formula is a little more tricky, and I haven't found a way to do it other than converting the j-value to an i-value and back like so:
parentOnJ(j) = parent(j/sizeof(T))*sizeof(T) == (j/(2*sizeof(T))*sizeof(T)
If sizeof(T) is a power of 2 then this will compile to 2 shifts. That is 1 operation more than the usual parent using indexes i. However we then save 1 operation on lookup. So the net effect is that finding the parent takes the same amount of time this way, while lookup of left-child and right-child becomes faster.
The answers of TokenMacGuy and templatetypedef point out memory based optimizations that reduce cache misses. For very large data sets or infrequently used priority queues, parts of the queue can be swapped out to disk by the OS. In that case it is worth it to add a lot of overhead to make optimal use of the cache because swapping in from disk is very slow. My data easily fits in memory and is in continuous use, so no part of the queue will likely be swapped to disk. I suspect that this is the case for most uses of priority queues.
There are other priority queues that are designed to make better use of the CPU cache. For example a 4-heap should have fewer cache misses and the amount of extra overhead is not that much. LaMarca and Ladner report in 1996 that they get 75% performance improvement from going to aligned 4-heaps. However, Hendriks reports in 2010 that:
The improvements to the implicit heap suggested by LaMarca and Ladner [17] to improve data locality and reduce cache misses were also tested. We implemented a four-way heap, that indeed shows a slightly better consistency than the two-way heap for very skewed input data, but only for very large queue sizes. Very large queue sizes are开发者_JAVA技巧 better handled by the hierarchical heap.
An interesting paper/article on this topic considers the behavior of caching/paging on the overall layout of the heap; The idea being that it's vastly more costly to pay for a cache miss or page in than nearly any other part of a datastructure's implementation. The paper discusses a heap layout that addresses this.
You're Doing It Wrong by Poul-Henning Kamp
As an elaboration on @TokenMacGuy's post, you might want to look into cache-oblivious data structures. The idea is to build data structures that, for arbitrary caching systems, minimize the number of cache misses. They're tricky, but they actually might be useful from your perspective since they perform well even when dealing with multi-layer cache systems (for example, registers / L1 / L2 / VM).
There's actually a paper detailing an optimal cache-oblivious priority queue that might be of interest. This data structure would have all sorts of advantages in terms of speed, since it would try to minimize the number of cache misses at every level.
On the first point: even having a "spare spot" for your array based implementation isn't a waste. Many operations need a temporary element anyway. Rather than initializing a new element each time, having a dedicated element at index [0] is handy.
精彩评论