Fastest Set Operations In The West
I 开发者_Go百科haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering: What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains? Can anyone beat O(Z) where Z is the actual size of intersection?If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.
A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.
Of interest: An Implementation of B-Y Intersect
Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.
For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.
An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.
If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.
The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):
Confluently Persistent Sets and Maps
there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)
Abstractly, a set is something that supports an operation, "is X a member?". You can define that operation on the intersection A n B
in terms of it on A
and B
. An implementation would look something like:
interface Set { bool isMember(Object X); };
class Intersection {
Set a, b;
public Intersection(Set A, Set B) { this.a = A; this.b = B; }
public isMember(Object X) {
return a.isMember(X) and b.isMember(Y);
}
}
A
and B
could be implemented using an explicit set type, like a HashSet. The cost of that operation on each is quite cheap, let's approximate it with O(1); so the cost on the intersection is just 2 O(n). ;-)
Admittedly if you build a large hierarchy of intersections like this, checking for a member can be more expensive, up to O(n) for n sets in the hierarchy. A potential optimisation for this could be to check the depth of the hierarchy against a threshold, and materialise it into a HashSet if it exceeds it. This will reduce the member operation cost, and perhaps amortise the construction cost when many intersections are applied.
精彩评论