开发者

State-of-the-art data structures [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 11 years ago.

What can you tell about modern data structures? We all know classic ones, like trees, tries, stacks, lists, B-trees and开发者_高级运维 so on (I think a Cormen's book is a pretty good list of a "classic ones"). But what about recent researches? I can name at least 2 of them: finger trees and judy arrays. I would like to know more.


That really depends on your definition of "recent." CLRS contains a great number of data structures, but by its very nature can’t have covered all of the great ideas and techniques that have been developed over the years. Here's a sampler of areas of research and some of the results that have come from there:

Binary Search Trees

On the "classical" front, the recently-developed WAVL tree is a simple balanced tree structure that in the insert-only case behaves like an AVL tree (tightly balanced) and when inserts and deletes are intermixed never does worse than a red/black tree. There's also the zip tree, which is a way of encoding a probabilistic skiplist as a binary search tree and has a very simple set of rules.

Splay trees were the first binary search tree that achieved excellent performance by rearranging nodes during lookups. They have many excellent theoretical properties. Although they aren't "modern" in the sense that they were first developed in the 1980s, they kicked off a new area of research: the quest to see if there's a single binary search tree that is, in some sense, the "best" possible binary search tree. Such a tree would ideally give O(log n) lookups, but also have other desirable properties like "it takes less time to look up an item that's been searched for recently" and "looking up items near the most-recently-queried item should be fast." In the 2000s we saw the development of the tango tree and the multisplay tree, which have the unusual guarantee that the cost of performing a series of operations on those trees is never more than a factor of O(log log n) slower than performing those operations on any binary search tree.

In the early 2010s researchers found a connection between binary search tree algorithms and sets of points in the 2D plane. This "geometry of binary search trees" has been used to find new lower bounds on BST structures and has produced a new candidate for a BST that, for some definition of "is as good as possible," might be "as good as possible."

Other balanced trees and tree-like structures to check out include the skiplist, the treap, and the scapegoat tree, developed in the 1990s.

Sketching and Sampling

Many recent data structures are designed to work on data streams where items are only visible one element at a time. The goal is to compute statistics about the data stream while using space far lower than what would be needed to store all the elements. (Think Google tracking popular search queries or Twitter looking for trending tags - there's simply too many things coming in at once to keep track of everything posted in real time). The count-min sketch makes it possible to estimate how frequently various items have been seen in the data stream, and can be used to find items that appear frequently in the stream. The HyperLogLog estimator uses a tiny amount of space and can estimate how many different items have been seen. And estimators like the AMS estimator can be used to approximate how skewed the data distribution is.

Memory-Optimized Structures

There's been much research into cache-oblivious data structures. The idea behind a cache-oblivious structure is that memory in a computer typically involves multiple layers of caches, and tuning data structures based on the sizes of those caches has been known, for a while, to provide performance boosts (see, for example, the B-tree). In the cache-oblivious model, the data structure needs to take optimal advantage of caching, but without knowing what the cache size is. Surprisingly, this is possible, and we have binary search trees and priority queues that meet this goal.

Hash Tables

Cuckoo hashing was developed in the early 2000s and provides a way to build hash tables where lookups take worst-case time O(1). They're fast in practice, and their analysis led to research into properties of random hypergraphs.

Linear probing, which was known since the 1960s to perform well assuming random hash functions, was formally shown to require 5-independent hash functions to be fast in a theoretical sense.

In the 2010s some Google engineers developed the "Swiss Table," an extremely fast hash table based on linear probing which harnesses SIMD instructions for an extra performance boost.

Bloom Filter Replacements

The Bloom filter has been around since the 1970s and provides an excellent way to store a set in a small amount of space provided that false positives are okay. In the 2000s, 2010s, and 2020s, many practical data structures were developed that improve it in practice. The d-left counting Bloom filter supports insertions and deletions with about a 2x space drop. The more recent cuckoo filter uses less space than a traditional Bloom filter (for reasonable error rates) and supports insertions and deletions. And the XOR filter uses less space than a Bloom filter in all cases, assuming the items to store are known up front.

XOR filters are related to Bloomier filters, a way of storing an approximation of a dictionary, analogous to how a Bloom filter stores an approximation of a set. They were used in a paper in the late 2010s to compress machine learning models with low error rates.

Several data structures purely of theoretical interest have also been developed that show it's possible to achieve the theoretical optimum number of bits in a Bloom filter replacement. Porat's matrix filters are probably the simplest, while Pagh, Pagh, and Rao's approach was (IIRC) the first to do so.

Succinct and Compact Data Structures

How few bits are needed to encode a data structure? That question has given rise to the field of succinct and compact data structures. It's possible to encode complex structures like trees using close to the information-theoretic minimum number of bits. Check out wavelet trees as an example of how to do this. There's also been work on building data structures that use data compression to reduce the size of the structure while also keeping the structure useful. The FM-index is a good example of this.

Persistent Data Structures

Persistent data structures are data structures where performing an edit produces two versions of the result - one before the edit was applied and one after. This was first explored in the 1980s by Sarnak and Tarjan and led to improvements in 2D point location.

Purely functional data structures are one subcase of this, which can be used in functional languages (or in imperative languages to guarantee persistence). Chris Okasaki's work in this field has led to developments of new trees and priority queues, for example.

Kinetic Data Structures

Kinetic data structures are data structures that store moving objects. This allows for operations like "given this set of moving particles, which pair will be the next to collide?" and has found applications in computer graphics. Search "kinetic priority queue" for a good introduction.

Concurrent Data Structures

Concurrent data structures are data structures that work well in a multithreaded environment. There are clever ways to build hash tables and priority queues that can handle multiple readers and multiple writers at once without locking the whole structure, many of which have been picked up by Java's standard libraries and others of which can be used in concurrency-heavy workflows.

Geometric Data Structures

Data structures for working with points in 2D, 3D, and higher-dimensional space have applications in both computer graphics and data processing. There's a huge space to cover here; here's some highlights.

Data structures for the point location problem maintain a subdivision of the 2D plane into regions and support queries of the form "given a point, which region of space is in it?" There are many approaches for solving this problem. Some use persistent data structures. Others are based on refining a triangular mesh into a hierarchy of meshes. Others work by splitting the world into trapezoids.

The orthogonal range search problem asks for data structures that store a collection of points and answer queries of the form "what are all the points in this axis-aligned box?" The range tree and k-d tree were developed in the 1970s and are still useful in practice. More modern approaches, which continue to be developed in the 2020s, use a mixture of fractional cascading and integer data structures to improve these runtimes in a theoretical sense.

String Data Structures

Data structures for string processing, namely the suffix tree, are extremely relevant in biocomputation and web searches. I don't think CLRS even mentions their existence. You should definitely look into them, though, since they are responsible for much of the new work in genomics. There’s been some pretty cool recent developments in suffix array construction algorithms, with algorithms like SA-IS bridging the gap between theoretically and practically fast algorithms.

Integer Data Structures

Many researchers have put effort into building data structures that take advantage of the fact that modern machines can operate on multiple bits in parallel. Some structures like the fusion tree, exponential tree, or y-fast tree exploit these properties to sort and search in arrays of integers faster than the O(n lg n) barriers imposed in a naive comparison model. The fusion tree and its descendants (exponential trees and the like) have shown that with word-level parallelism you can get some pretty impressive theoretical speedups, though these structures aren’t super fast in practice.

Integer data structures have also been used to give an O(m + n)-time algorithm for the single source shortest paths problem, strictly better than Dijkstra's algorithm in a theoretical sense, assuming weights are integers. This algorithm, though, is basically only of theoretical interest because the constant factors used in implementing it are too high to be practical.

Priority Queues

The Fibonacci heap was the first priority queue to support enqueue, meld, and decrease-key in (amortized) time O(1) and with extract-min taking (amortized) time O(log n), improving the performance of Dijkstra's algorithm, Prim's algorithm, and others like the Stoer-Wagner min cut algorithm. Since then, many other priority queues have been developed to simplify the Fibonacci heap or meet these time bounds in the worst case. Quake heaps are a much simpler structure (conceptually), but don't run as quickly in practice. Strict Fibonacci heaps achieve the same time bounds as the Fibonacci heap, but in a worst-case sense.

The pairing heap, based on the splay tree, is much faster than Fibonacci heaps in practice. It was originally conjectured to meet the same time bounds as the Fibonacci heap, but this was shown not to be the case many years after they were developed. As of 2021, the actual runtime of the pairing heap is not yet known.

Dynamic Graphs

Many classical problems on graphs ("given two nodes, is there a path between them?" "find a minimum spanning tree." "check if the graph is planar") have known fast algorithms. But what happens if the graph is allowed to change over time? Suddenly, these problems get much harder to solve efficiently without recomputing everything from scratch on each edit.

In the case where the underlying graph is a forest, Euler tour trees, s-t trees (sometimes called link/cut trees), and top trees make it possible to answer many interesting questions like connectivity with very low update time per edit.

For general graphs, Holm et al's layered forest structure maintains connectivity and MSTs in graphs as edges are added or deleted with fairly good amortized runtimes. Kapron et al's cutset structure uses randomization to perform updates with good worst-case efficiency and a high probability of success.

Range Minimum Queries

The range minimum query problem is to preprocess an array of values so that queries of the form "what's the smallest entry in this subarray?" can be answered quickly. In the 1980s a solution was found that uses linear preprocessing time and O(1) query time, and that solution has been simplified and refined over the years to one that works very fast in practice (search "Fischer-Heun structure.")

RMQ has a surprising number of applications, especially in conjunction with trees (where it can be used to quickly find lowest common ancestors) and in particular suffix trees and suffix arrays (where it powers many impressive algorithms)


Some of the relatively recent (as in the last 30 years) data structure innovations have been probabilistic ones, like Skip Lists. I find these particularly interesting, but I don't keep up on research. Reading recent ACM Transactions on Algorithms might help you find some interesting and cutting edge research.

But, most anything "new" is going to be highly specialized. It is only once in a very long while that a new but fundamentally important algorithm/structure is created (like lists, trees, etc).


There are many hundreds of specialized data structures. http://en.wikipedia.org/wiki/List_of_data_structures is a good start.


Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table. Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001.

http://en.wikipedia.org/wiki/Cuckoo_hashing

Then fresh ones are: Cache-oblivious data structures

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜