开发者

How to test if one set of (unique) integers belongs to another set, efficiently?

I'm writing a program where I'm having to test if one set of unique integers A belongs to another set of unique numbers B. However, this operation might be done several hundred times per second, so I'm looking for an efficient algorithm to do it.

For example, if A = [1 2 3] and B = [1 2 3 4], it is true, but if B = [1 2 4 5 6], it's false.

I'm not sure how efficient it is to just sort and compare, so I'm wondering if there are any more efficient algorithms.

One idea I came up with, was to give each number n their corresponding n'th prime: that is 1 = 2, 2 = 3, 3 = 5, 4 = 7 etc. Then I could calculate the product of A, and if that product is a factor of the similar product of B, we could say that A is a subset of similar B with certainty. For example, if A = [1 2 3], B = [1 2 3 4] the primes are [2 3 5] and [2 3 5 7] and the products 2*3*5=30 and 2*3*5*7=210. Since 210%30=0, A is a subset of B. I'm expecting the largest integer to be couple of million at most, so I think it's doable.

Are there any mo开发者_开发技巧re efficient algorithms?


The asymptotically fastest approach would be to just put each set in a hash table and query each element, which is O(N) time. You cannot do better (since it will take that much time to read the data).

Most set datastructures already support expected and/or amortized O(1) query time. Some languages even support this operation. For example in python, you could just do

A < B

Of course the picture changes drastically depending on what you mean by "this operation is repeated". If you have the ability to do precalculations on the data as you add it to the set (which presumably you have the ability to do so), this will allow you to subsume the minimal O(N) time into other operations such as constructing the set. But we can't advise without knowing much more.

Assuming you had full control of the set datastructure, your approach to keep a running product (whenever you add an element, you do a single O(1) multiplication) is a very good idea IF there exists a divisibility test that is faster than O(N)... in fact your solution is really smart, because we can just do a single ALU division and hope we're within float tolerance. Do note however this will only allow you roughly a speedup factor of 20x max I think, since 21! > 2^64. There might be tricks to play with congruence-modulo-an-integer, but I can't think of any. I have a slight hunch though that there is no divisibility test that is faster than O(#primes), though I'd like to be proved wrong!

If you are doing this repeatedly on duplicates, you may benefit from caching depending on what exactly you are doing; give each set a unique ID (though since this makes updates hard, you may ironically wish to do something exactly like your scheme to make fingerprints, but mod max_int_size with detection-collision). To manage memory, you can pin extremely expensive set comparison (e.g. checking if a giant set is part of itself) into the cache, while otherwise using a most-recent policy if you run into memory issues. This nice thing about this is it synergizes with an element-by-element rejection test. That is, you will be throwing out sets quickly if they don't have many overlapping elements, but if they have many overlapping elements the calculations will take a long time, and if you repeat these calculations, caching could come in handy.


Let A and B be two sets, and you want to check if A is a subset of B. The first idea that pops into my mind is to sort both sets and then simply check if every element of A is contained in B, as following:

Let n_A and n_B be the cardinality of A and B, respectively. Let i_A = 1, i_B = 1. Then the following algorithm (that is O(n_A + n_B)) will solve the problem:

// A and B assumed to be sorted
i_A = 1;
i_B = 1;
n_A = size(A);
n_B = size(B);
while (i_A <= n_A) {
  while (A[i_A] > B[i_B]) {
    i_B++;
    if (i_B > n_B) return false;
  }
  if (A[i_A] != B[i_B}) return false;
  i_A++;
}
return true;

The same thing, but in a more functional, recursive way (some will find the previous easier to understand, others might find this one easier to understand):

// A and B assumed to be sorted
function subset(A, B)
  n_A = size(A)
  n_B = size(B)
  function subset0(i_A, i_B)
    if (i_A > n_A) true
    else if (i_B > n_B) false
    else
      if (A[i_A] <= B[i_B]) return (A[i_A] == B[i_B]) && subset0(i_A + 1, i_B + 1);
      else return subset0(i_A, i_B + 1);
  subset0(1, 1)

In this last example, notice that subset0 is tail recursive, since if (A[i_A] == B[i_B]) is false then there will be no recursive call, otherwise, if (A[i_A] == B[i_B]) is true, than there's no need to keep this information, since the result of true && subset0(...) is exactly the same as subset0(...). So, any smart compiler will be able to transform this into a loop, avoiding stack overflows or any performance hits caused by function calls.

This will certainly work, but we might be able to optimize it a lot in the average case if you have and provide more information about your sets, such as the probability distribution of the values in the sets, if you somehow expect the answer to be biased (ie, it will more often be true, or more often be false), etc.

Also, have you already written any code to actually measure its performance? Or are you trying to pre-optimize?

You should start by writing the simplest and most straightforward solution that works, and measure its performance. If it's not already satisfactory, only then you should start trying to optimize it.


I'll present an O(m+n) time-per-test algorithm. But first, two notes regarding the problem statement:

Note 1 - Your edits say that set sizes may be a few thousand, and numbers may range up to a million or two. In following, let m, n denote the sizes of sets A, B and let R denote the size of the largest numbers allowed in sets.

Note 2 - The multiplication method you proposed is quite inefficient. Although it uses O(m+n) multiplies, it is not an O(m+n) method because the product lengths are worse than O(m) and O(n), so it would take more than O(m^2 + n^2) time, which is worse than the O(m ln(m) + n ln(n)) time required for sorting-based methods, which in turn is worse than the O(m+n) time of the following method.

For the presentation below, I suppose that sets A, B can completely change between tests, which you say can occur several hundred times per second. If there are partial changes, and you know which p elements change in A from one test to next, and which q change in B, then the method can be revised to run in O(p+q) time per test.

Step 0. (Performed one time only, at outset.) Clear an array F, containing R bits or bytes, as you prefer.

Step 1. (Initial step of per-test code.) For i from 0 to n-1, set F[B[i]], where B[i] denotes the i'th element of set B. This is O(n).

Step 2. For i from 0 to m-1, { test F[A[i]]. If it is clear, report that A is not a subset of B, and go to step 4; else continue }. This is O(m).

Step 3. Report that A is a subset of B.

Step 4. (Clear used bits) For i from 0 to n-1, clear F[B[i]]. This is O(n).

The initial step (clearing array F) is O(R) but steps 1-4 amount to O(m+n) time.


Given the limit on the size of the integers, if the set of B sets is small and changes seldom, consider representing the B sets as bitsets (bit arrays indexed by integer set member). This doesn't require sorting, and the test for each element is very fast.

If the A members are sorted and tend to be clustered together, then get another speedup by testing all the element in one word of the bitset at a time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜