开发者

How do I remove duplicate strings from an array in C?

I have an array of strings in C and an integer indicating how many strings are in the array.

char *strarray[MAX];  
int strcount;

In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

For example:

strarray[0] = "Line 1"; 
strarray[1] = "Line 2"; 
strarray[2] = "Line 3"; 
strarray[3] = "Line 2"; 
strarray[4] = "Line 4";

would become:

strarray[0] = "Line 1"; 
strarray[1] = "Line 3"; 
strarray[2] = "Line 2"; 
strarray[3] = "Line 4";

Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

The algorithm presented below would be ran every time a new string is added to the strarray.

For the interest of showing that I am trying, I will include my proposed algorithm below:

  1. Search entire strarray for match to str
  2. If no match, do nothing
  3. If match found, put str in strarray
  4. Now we have a strarray with a max of 1 duplicate entry
  5. Add highest index strarray string to lowest index of temporary string array
  6. Continue downwards into strarray and check each element
  7. If duplicate found, skip it开发者_Go百科
  8. If not, add it to the next highest index of the temporary string array
  9. Reverse temporary string array and copy to strarray

Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

Thank you!


The typical efficient unique function is to:

  1. Sort the given array.
  2. Verify that consecutive runs of the same item are setup so that only one remains.

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

Because you need to save the original order, you can have something like:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

EDIT2: Simple C implementation example of std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}


I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).


Can you control the input as it is going into the array? If so, just do something like this:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.


Sort the array with an algorithm like qsort (man 3 qsort in the terminal to see how it should be used) and then use the function strcmp to compare the strings and find duplicates

If you want to mantain the original order you could use a O(N^2) complexity algorithm nesting two for, the first each time pick an element to compare to the other and the second for will be used to scan the rest of the array to find if the chosen element is a duplicate.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜