开发者

Algorithm to find duplicates in multiple linked lists

What is the fastest method of finding duplicates across multiple (large) linked lists. I 开发者_StackOverflow中文版will attempt to illustrate the problem with arrays instead just to make it a bit more readable. (I used numbers from 0-9 for simplicity instead of pointers).

list1[] = {1,2,3,4,5,6,7,8,9,0};
list2[] = {0,2,3,4,5,6,7,8,9,1};
list3[] = {4,5,6,7,8,9,0,1,2,3};
list4[] = {8,2,5};
list5[] = {1,1,2,2,3,3,4,4,5,5};

If I now ask: 'does the number 8 exist in list1-5?' I could sort the lists, remove duplicates, repeat this for all lists and merge them into a "superlist" and see if the number of (new) duplicates equal the number of lists that I search through. Assuming that I got the correct number of duplicates I can assume that what I searched for (8) exists in all of the lists. If I instead searched for 1 I will only get four duplicates—ergo not found in all of the lists.

Is there a faster/smarter/better way to achieve the above without sorting and/or changing the lists in any way?

P.S.: This question is asked mostly out of pure curiosity and nothing else! :)


Just put each number into a hash table and store the number of occurrences for that item in the table. When you find another, just increment the counter. O(n) algorithm (n items across all the lists).

If you want to store the lists that each occurs in, then you need a set representation to be stored under each item as well. YOu can use any set representation -- bit vector, list, array etc. This will tell you the lists that that item is a member of. This does not change it from O(n), just increases the work by a constant factor.


Define an array hash and set all the location values to 0

define hash[MAX_SYMBOLS] = {0};
define new_list[LENGTH]
defile list[LENGTH] and populate

Now for each element in your list, use this number as an index in hash and increment that location of hash . Each presence of that number would increment the value at that hash location once. So a duplicate value i would have hash[i] > 1

for i=0 to (n - 1)
  do
    increment hash[list[i]]
endfor

If you want to remove the duplicates and create a new list then scan the hash array and for each presence of i ie. if hash[i] > 0 load them into a new list in the order in which they appeared in the original list.

define j = 0
for i=0 to (n - 1)
  do
    if hash[list[i]] is not 0
      then
        new_list[j] := i
        increment j
    endif
endfor

Note that when using with negative numbers you will not be able to use the values directly to index. To use negative numbers, first we can find the largest magnitude of the negative numbers and use that magnitude to add to all the numbers when we use them to index the hash array.

find the highest magnitude of negative value into min_neg

for i=0 to (n - 1)
  do
    increment hash[list[i + min_neg]]
endfor

Or in implementation you can allocate contiguous memory and then define a pointer at the middle of the allocated memory block, so that you could move in both front and back directions so that you can use negative index with it. You need to make sure that you have enough memory to use in front and back of the pointer.

int *hash = malloc (sizeof (int) * SYMBOLS)
int *hash_ptr = hash + (int)(SYMBOLS/2)

now you can do hash_ptr[-6] or some hash_ptr[i] with -SYMBOLS/2 < i < SUMBOLS/2 + 1


The question is a bit vague, so the answer depends on what you want.

A hash table is the correct answer for asking general questions about duplicates, because it allows you to go through each list just once to build a table that will answer most questions; however, some questions will not require one.

Possible cases that seem to answer your question:

Do you just need to know if a certain value is present in each list? - Check through the first list until the value is found. If not, you're done: it is not. Repeat for each successive list. If all lists are searched and the value found, it is duplicated in each list. In this algorithm, it is not necessary to look at each value in each list, or even each list, so this would be the quickest.

Do you need to know whether any duplicates exist at all? - If any value in a hash table keyed by number has a count greater than 0, there are duplicates... If that is all you need to know, you can quit right there.

Do you need the number of duplicates in each table, separately? - Multiply each value by the number of lists and add the number of the list in process. Store that as the hash key and count duplicates. When all lists are processed, you have a table that can answer all kinds of questions. To check duplicates for a specific value, multiply it by the list count and examine sequential hash keys. If there is one for each list, the number is present in each list. If all the counts are greater than 1 over that range, the number is duplicated in each list.

Etc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜