The fastest way to do a collection subtraction

2022-12-22 01:53 问答作者：

I have two Sets. Set b is the subset of Set a. they're both very huge Sets. I want to subtract b from a , what's the best practice to do this common operation ? I've written to many codes like this , and I don't think it's efficient. what's your idea ?

pseudo code : (this is not Java API) .

for(int i = 0 ; i < a.size(); i++) {
          for (int j=0 ; j < b.size() ;j++) {
              // do comparison , if found equals ,remove from a
              break;
          }
 }

And I want to find an algorithm , not only applies to Sets, also works for Array.

EDIT: The Set here is not JAVA API , it's a data structure . so I don't care if Java API has a removeAll开发者_运维知识库() method , I want find a common solution for this problem , I've encounter a lot of problems like this when I use Javascript and Actionscript .

I don't think you will get it much faster, but your code will look simpler and will not become slower by a.removeAll(b);. removeAll() is part of the Java-API.

For efficiency analysis: Your given code example is O(n^2), that scales not very good, but also isn't the horriblest thing on earth (exponential complexity is the thing you don't want). As long as you don't know the internal organisation of the data in the Collection, you will not get a better performance. removeAll() is implemented by the class itself and knows about the internal organisation. So if the data is organized in a Hash, you may get better results, if the data is organized in an unsorted array, the complexity will be the same. A Set have to efficently lookup if a new item is already in the set, so I suspect some sort of Hash as internal representation, especially if the implementation is called HashSet. :-)

EDIT: The OP changed it's question to mention it is not only for Java. removeAll() is a Java-API, so this (or something similar) may not be available in other languages. As said before, if the collections are unsorted arrays with no other restrictions the two for-loops are already the fastest solution. But if the data is organized different you have faster options. If the two collections are sorted data (in my example comes the smallest element first), you can do the following (reducing the complexity to O(n)):

int bIndex = 0;
for(int i = 0 ; i < a.size(); i++) {
          while (a[i] < b[bIndex]) {bIndex++;}
          if (a[i] == b[bIndex]) {markForRemoval(a[i]);} // I mark this only for removal, as the actual removal would make your index incorrect
}

If the data is organized as a hash in both collections you also need only one for-loop, accessing directly the element in b. Other possible organizations of data are possible.

In the end, there's not much of a choice other than to one by one compare elements and remove the ones which are in both.

To do it another way, you'd have to do something fancy like give all set members a unique value index, and construct a huge array of booleans representing each set, and then you could do bit operations to subtract B from A. I have no idea if that would be faster, given the overhead of creating unique value indices and manipulating the very large bitmasks.

I know you don't care about a Java solution, but since other people have recommended removeAll(), I'd like to point out that it's still doing essentially the same thing under the covers. Check the source for HashSet.

If the sets are maintained such that the elements are available at any given time in sorted order, then you can perform a single linear pass over both sets and create the difference in O(n) time. Now, again, that's if you can get at the ordered lists of elements for free — which is to say that the maintenance (i.e., add-element and remove-element operations) of the sets pays the cost of keeping the elements available in sorted order.

Any sort of "removeAll" operation that relies on performing lookups is necessarily going to be worse than O(n).

(It occurs to me that the construction of the difference set — that is, the answer constructed from the linear pass over the two lists — could be O(n log n) if you're not extremely careful.)

Given that b is a subset of a I'm not sure why your pseudo-code has 2 loops. Mine would simply be:

foreach b in B
    remove b from A

In practice how the run time of this compares with the run time of yours depends on, among other things, how you have implemented the set as a data structure.

well, the correct idea was already pointed out: the set should be implemented using a hash. hashes ideally have O(1) access cost, so you can get O(min(m,n)) cost for the overall operation assuming you can determine which set is bigger (like maintaining a counter during insert/remove operations).

in actionscript 3, you'd use a Dictionary. just use elements as keys and values.

removing looks like this:

for each (var key:* in set2) {//a simple for-in loop will also do the trick, since keys and values are equal, but for-each-in loops perform faster
    delete set1[key];
}

in JavaScript, you will need to give the entries ids when inserting, so you can use those ids as keys in a map. simply map ids to original values.

removing looks like this:

for (var key in set2) {
    delete set1[key];
}

The operation as you are writing it is O(N^2), but if the sets are large, you might want to use a hash.

// A is some kind of array, O(1) iteration
// B is a hash containing elements to remove, O(1) contains(elt)
List<T> removeAll(List<T> A, Set<T> B) {
  List<T> result; // empty, could preallocate at |A|
  for (elt : A) { // for each 'elt' belonging to A, hence O(|A|)
    if (! B.contains(elt) ) { // O(1) thanks to hash
      C.add(elt) ; // ensure this is O(1) with preallocation or linked list
    }
  }
  return result;
}

This requires indexing the set B, so you need a hash function. In Java, you could use Set<T> Bh = new HashSet<T>(B); which is O(|B|) in time and memory. So overall we get O(|A|+|B|) in time and roughly O(2|A|+2|B|)) in memory. Sure beats the quadratic of removeAll, you will feel the difference (TM).

It is probably better to copy elements into a new array (as done in the pseudo code), since removing elements from A directly could lead to overhead if you keep elements in order (left shifting elements in A is costly).

You have seen the removeAll method in the Set interface?

Also check out this stack overflow question.

I believe you'll find java.util.HashSet.removeAll(Collection toRemove) to perform well. On the other hand, if you don't have sets but sorted collections, you might be able to do a lot better.

继续阅读：actionscript-3 javascript performance

The fastest way to do a collection subtraction

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？