开发者

how to remove duplicates from an array without sorting

i have an array which might contain duplicate objects. I wonder if it's possible to find and remove the 开发者_Go百科duplicates in the array: - without sorting (strict requirement) - without using a temporary secondary array - possibily in O(N), with N the nb of the elements in the array

In my case the array is a Lua array, which contains tables:

t={
{a,1},
 {a,2},
 {b,1},
 {b,3},
 {a,2}
} 

In my case, t[5] is a duplicate of t[2], while t[1] is not.


To summarize, you have the following options:

  • time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
  • time: O(n*log n), no extra memory - sort first, walk over the array linearly after
  • time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)

Pick one. There's no way to do what you want in O(n) time with no extra memory.


Can't be done in O(n) but ...

what you can do is

  • Iterate thru the array
  • For each member search forward for repetitions, remove those.

Worst case scenario complexity is O(n^2)


Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.


Yes, depending on how you look at it.

You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.

If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).

This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.

For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...

As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.

Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).

If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.

Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).

Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...

So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)


The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜