开发者

How does one remove duplicate elements in place in an array in O(n) in C or C++?

Is there any method to remove the duplicate elements in an array in place in C/C++ in O(n)? Suppose elements are a[5]={1,2,2,3开发者_JS百科,4} then resulting array should contain {1,2,3,4} The solution can be achieved using two for loops but that would be O(n^2) I believe.


If, and only if, the source array is sorted, this can be done in linear time:

std::unique(a, a + 5); //Returns a pointer to the new logical end of a.

Otherwise you'll have to sort first, which is (99.999% of the time) n lg n.


Best case is O(n log n). Perform a heap sort on the original array: O(n log n) in time, O(1)/in-place in space. Then run through the array sequentially with 2 indices (source & dest) to collapse out repetitions. This has the side effect of not preserving the original order, but since "remove duplicates" doesn't specify which duplicates to remove (first? second? last?), I'm hoping that you don't care that the order is lost.

If you do want to preserve the original order, there's no way to do things in-place. But it's trivial if you make an array of pointers to elements in the original array, do all your work on the pointers, and use them to collapse the original array at the end.

Anyone claiming it can be done in O(n) time and in-place is simply wrong, modulo some arguments about what O(n) and in-place mean. One obvious pseudo-solution, if your elements are 32-bit integers, is to use a 4-gigabit bit-array (512 megabytes in size) initialized to all zeros, flipping a bit on when you see that number and skipping over it if the bit was already on. Of course then you're taking advantage of the fact that n is bounded by a constant, so technically everything is O(1) but with a horrible constant factor. However, I do mention this approach since, if n is bounded by a small constant - for instance if you have 16-bit integers - it's a very practical solution.


Yes. Because access (insertion or lookup) on a hashtable is O(1), you can remove duplicates in O(N).

Pseudocode:

hashtable h = {}
numdups = 0
for (i = 0; i < input.length; i++) {
    if (!h.contains(input[i])) {
        input[i-numdups] = input[i]
        h.add(input[i])
    } else {
        numdups = numdups + 1
    }

This is O(N).

Some commenters have pointed out that whether a hashtable is O(1) depends on a number of things. But in the real world, with a good hash, you can expect constant-time performance. And it is possible to engineer a hash that is O(1) to satisfy the theoreticians.


I'm going to suggest a variation on Borealids answer, but I'll point out up front that it's cheating. Basically, it only works assuming some severe constraints on the values in the array - e.g. that all keys are 32-bit integers.

Instead of a hash table, the idea is to use a bitvector. This is an O(1) memory requirement which should in theory keep Rahul happy (but won't). With the 32-bit integers, the bitvector will require 512MB (ie 2**32 bits) - assuming 8-bit bytes, as some pedant may point out.

As Borealid should point out, this is a hashtable - just using a trivial hash function. This does guarantee that there won't be any collisions. The only way there could be a collision is by having the same value in the input array twice - but since the whole point is to ignore the second and later occurences, this doesn't matter.

Pseudocode for completeness...

src = dest = input.begin ();
while (src != input.end ())
{
  if (!bitvector [*src])
  {
    bitvector [*src] = true;
    *dest = *src; dest++;
  }
  src++;
}
//  at this point, dest gives the new end of the array

Just to be really silly (but theoretically correct), I'll also point out that the space requirement is still O(1) even if the array holds 64-bit integers. The constant term is a bit big, I agree, and you may have issues with 64-bit CPUs that can't actually use the full 64 bits of an address, but...


Take your example. If the array elements are bounded integer, you can create a lookup bitarray.

If you find an integer such as 3, turn the 3rd bit on. If you find an integer such as 5, turn the 5th bit on.

If the array contains elements rather than integer, or the element is not bounded, using a hashtable would be a good choice, since hashtable lookup cost is a constant.


The canonical implementation of the unique() algorithm looks like something similar to the following:

template<typename Fwd>
Fwd unique(Fwd first, Fwd last)
{
    if( first == last ) return first;
    Fwd result = first;
    while( ++first != last ) {
        if( !(*result == *first) )
            *(++result) = *first;
    }
    return ++result;
}

This algorithm takes a range of sorted elements. If the range is not sorted, sort it before invoking the algorithm. The algorithm will run in-place, and return an iterator pointing to one-past-the-last-element of the unique'd sequence.

If you can't sort the elements then you've cornered yourself and you have no other choice but to use for the task an algorithm with runtime performance worse than O(n).

This algorithm runs in O(n) runtime. That's big-oh of n, worst case in all cases, not amortized time. It uses O(1) space.


The example you have given is a sorted array. It is possible only in that case (given your constant space constraint)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜