Fastest Set Operations In The West

2023-01-26 20:10 问答作者：

I 开发者_Go百科haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering: What are the fastest set intersect, union, and disjoin algorithms?

Are there any interesting ones with limited domains?

Can anyone beat O(Z) where Z is the actual size of intersection?

If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.

A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.

Of interest: An Implementation of B-Y Intersect

Update

We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.

Update

I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.

If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.

For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.

An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.

If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.

The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):

Confluently Persistent Sets and Maps

there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)

Abstractly, a set is something that supports an operation, "is X a member?". You can define that operation on the intersection A n B in terms of it on A and B. An implementation would look something like:

interface Set { bool isMember(Object X); };

class Intersection {
    Set a, b;
    public Intersection(Set A, Set B) { this.a = A; this.b = B; }

    public isMember(Object X) {
        return a.isMember(X) and b.isMember(Y);
    }
}

A and B could be implemented using an explicit set type, like a HashSet. The cost of that operation on each is quite cheap, let's approximate it with O(1); so the cost on the intersection is just 2 O(n). ;-)

Admittedly if you build a large hierarchy of intersections like this, checking for a member can be more expensive, up to O(n) for n sets in the hierarchy. A potential optimisation for this could be to check the depth of the hierarchy against a threshold, and materialise it into a HashSet if it exceeds it. This will reduce the member operation cost, and perhaps amortise the construction cost when many intersections are applied.

继续阅读：algorithm language-agnostic set-intersection

Fastest Set Operations In The West

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？