开发者

Python: find a duplicate in a container efficiently

I have a container cont. If I want to find out if it has duplicates, I'll just check len(cont) == len(set(cont)).

Suppose I want to find a duplicate element if it exists (just any arbitrary duplicate element). Is there any neat and effic开发者_如何学编程ient way to write that?

[Python 3]


You can start adding them to the set and as soon as you try to add the element that is already in the set you found a duplicate.


Ok, my first answer has gotten quite a bit of flack, so I thought I'd try a few different ways of doing this and report the differences. Here's my code.

import sys
import itertools

def getFirstDup(c, toTest):

    # Original idea using list slicing => 5.014 s
    if toTest == '1':
        for i in xrange(0, len(c)):
            if c[i] in c[:i]:
                return c[i]

    # Using two sets => 4.305 s
    elif toTest == '2':
        s = set()
        for i in c:
            s2 = s.copy()
            s.add(i)
            if len(s) == len(s2):
                return i

    # Using dictionary LUT => 0.763 s
    elif toTest == '3':
        d = {}
        for i in c:
            if i in d:
                return i
            else:
                d[i] = 1

    # Using set operations => 0.772 s
    elif toTest == '4':
        s = set()
        for i in c:
            if i in s:
                return i
            else:
                s.add(i)

    # Sorting then walking => 5.130 s
    elif toTest == '5':
        c = sorted(c)
        for i in xrange(1, len(c)):
            if c[i] == c[i - 1]:
                return c[i]

    # Sorting then groupby-ing => 5.086 s
    else:
        c = sorted(c)
        for k, g in itertools.groupby(c):
            if len(list(g)) > 1:
                return k

    return None


c = list(xrange(0, 10000000))
c[5000] = 0

for i in xrange(0, 10):
    print getFirstDup(c, sys.argv[1])

Basically, I try this in six different ways, as listed in the source file. I used the Linux time command and collected the realtime runtimes, running the commands like so

time python ./test.py 1

with 1 being which algorithm I wanted to try. Each algorithm looks for the first duplicate in 10,000,000 integers, and runs ten times. There is one duplication in the list, which is "mostly sorted" though I did try reverse sorted lists without noticing a proportional difference between algorithms.

My original suggestion did poorly at 5.014 s. My understanding of icyrock.com's solution also did poorly at 4.305 s. Next I tried using a dictionary to create a LUT, which gave the best runtime at 0.763 s. I tried using the in operator on sets, and got 0.772 s, nearly as good as the dictionary LUT. I tried sorting and walking the list, which gave a pitiful time of 5.130 s. Finally, I tried John Machin's suggestion of the itertools, which gave a poor time of 5.086 s.

In summary, a dictionary LUT seems to be the way to go, with set operations (which may use LUTs in its implementation) being a close second.


Update: I tried razpeitia's suggestion, and apart from the fact that you need to know precisely what duplicate key you're looking for, the actual algorithm did the worst so far (66.366 s).


Update 2: I'm sure someone is going to say that this test is biased because the duplicate location is near one end of the list. Try running the code using a different location before downvoting and report your results!


It's not apparent what's the point of finding one arbitrary element that is a duplicate or 1 or more other elements of the collection ... do you want to remove it? merge its attributes with those of its twins / triplets / ... / N-tuplets? In any case, that's an O(N) operation, which if repeated until no more duplicates are detected is an O(N ** 2) operation.

However you can get a bulk deal at the algorithm warehouse: sort the collection -- O(N*log(N)) -- and then use itertools.groupby to bunch up the duplicates and cruise through the bunches, ignoring the bunches of size 1 and doing whatever you want with the bunches of size > 1 -- all of that is only about O(N).


from collections import Counter

cont = [1, 2, 3]
c = Counter(cont)
x = someItem

if c[x] == 0:
    print("Not in cont")
elif c[x] == 1:
    print("Unique")
else:
    print("Duplicate")


You have to scan all the elements for the duplicates as they can be just the last ones you check, so you can't get more efficient than worst case O(N) time, just like linear search. But a simple linear search to find a duplicate will use up O(N) memory, because you need to track what you've seen so far.

If the array is sorted you can find duplicates in O(N) time without using any additional memory because duplicate pairs will be next to each other.


If your container is a list, you can just pass the value you're looking for to its count() method and check the result:

>>> l = [1,1,2,3]
>>> l.count(1)
2
>>> 

A dictionary can't have duplicate keys, nor can a set. Outside of these, I'd need to know what kind of container it is. I guess the real point is to always make sure you haven't missed an obvious solution to the problem before you go coding a custom solution. I fall prey to this myself sometimes :)


According to http://wiki.python.org/moin/TimeComplexity most of the list operations are terribly inefficient (just confirmed that x in myList does seem to be O(N) in python3).

The method given by the original poster is efficient because it is O(N) time and space (this is the "best" you can, without making additional assumptions about your list, since list operations like x in myList are O(N)).

There is a major optimization which is possible, which is to iteratively build up the set. This would return quickly on certain kinds of lists, e.g. [0,1,1,2,3,4,5,...]. However you are implicitly assuming a bit about the distribution of your lists (for example, do you optimize for this case, or optimize for duplicates at the end, or both?). The good thing about this optimization is that it doesn't affect asymptotic speed. Here's how I would code it elegantly:

def hasDuplicate(iter):
    visited = set()
    for item in iter:
        if item in visited:
            return True
        visited.add(item)
    return False

You could also return the first duplicate, but you can't return None; you'd have to raise an Exception since the iterable might contain None.

sidenote: There are ways to improve the space-efficiency for a minor hit to time-efficiency (e.g. bloom filters).


Another suggestions, similar to jonesy's answer. At least in python3 (haven't tested in python 2.7), when c[-5000] = 0, this becomes faster than solution 3 and 4 of the original answer. Otherwise it is only slightly faster than solution 1 and 2...

elif toTest == '7':
    for i in c:
        if c.count(i)>1:
            return i


Try this:

def getFirstDup(cont):
    for i in xrange(0, len(cont)):
        if cont[i] in cont[:i]:
            return cont[i]
    return None
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜