Python: find a duplicate in a container efficiently

2023-01-25 12:47 问答作者：

I have a container cont. If I want to find out if it has duplicates, I'll just check len(cont) == len(set(cont)).

Suppose I want to find a duplicate element if it exists (just any arbitrary duplicate element). Is there any neat and effic开发者_如何学编程ient way to write that?

[Python 3]

You can start adding them to the set and as soon as you try to add the element that is already in the set you found a duplicate.

Ok, my first answer has gotten quite a bit of flack, so I thought I'd try a few different ways of doing this and report the differences. Here's my code.

import sys
import itertools

def getFirstDup(c, toTest):

    # Original idea using list slicing => 5.014 s
    if toTest == '1':
        for i in xrange(0, len(c)):
            if c[i] in c[:i]:
                return c[i]

    # Using two sets => 4.305 s
    elif toTest == '2':
        s = set()
        for i in c:
            s2 = s.copy()
            s.add(i)
            if len(s) == len(s2):
                return i

    # Using dictionary LUT => 0.763 s
    elif toTest == '3':
        d = {}
        for i in c:
            if i in d:
                return i
            else:
                d[i] = 1

    # Using set operations => 0.772 s
    elif toTest == '4':
        s = set()
        for i in c:
            if i in s:
                return i
            else:
                s.add(i)

    # Sorting then walking => 5.130 s
    elif toTest == '5':
        c = sorted(c)
        for i in xrange(1, len(c)):
            if c[i] == c[i - 1]:
                return c[i]

    # Sorting then groupby-ing => 5.086 s
    else:
        c = sorted(c)
        for k, g in itertools.groupby(c):
            if len(list(g)) > 1:
                return k

    return None


c = list(xrange(0, 10000000))
c[5000] = 0

for i in xrange(0, 10):
    print getFirstDup(c, sys.argv[1])

Basically, I try this in six different ways, as listed in the source file. I used the Linux time command and collected the realtime runtimes, running the commands like so

time python ./test.py 1

with 1 being which algorithm I wanted to try. Each algorithm looks for the first duplicate in 10,000,000 integers, and runs ten times. There is one duplication in the list, which is "mostly sorted" though I did try reverse sorted lists without noticing a proportional difference between algorithms.

My original suggestion did poorly at 5.014 s. My understanding of icyrock.com's solution also did poorly at 4.305 s. Next I tried using a dictionary to create a LUT, which gave the best runtime at 0.763 s. I tried using the in operator on sets, and got 0.772 s, nearly as good as the dictionary LUT. I tried sorting and walking the list, which gave a pitiful time of 5.130 s. Finally, I tried John Machin's suggestion of the itertools, which gave a poor time of 5.086 s.

In summary, a dictionary LUT seems to be the way to go, with set operations (which may use LUTs in its implementation) being a close second.

Update: I tried razpeitia's suggestion, and apart from the fact that you need to know precisely what duplicate key you're looking for, the actual algorithm did the worst so far (66.366 s).

Update 2: I'm sure someone is going to say that this test is biased because the duplicate location is near one end of the list. Try running the code using a different location before downvoting and report your results!

It's not apparent what's the point of finding one arbitrary element that is a duplicate or 1 or more other elements of the collection ... do you want to remove it? merge its attributes with those of its twins / triplets / ... / N-tuplets? In any case, that's an O(N) operation, which if repeated until no more duplicates are detected is an O(N ** 2) operation.

However you can get a bulk deal at the algorithm warehouse: sort the collection -- O(N*log(N)) -- and then use itertools.groupby to bunch up the duplicates and cruise through the bunches, ignoring the bunches of size 1 and doing whatever you want with the bunches of size > 1 -- all of that is only about O(N).

from collections import Counter

cont = [1, 2, 3]
c = Counter(cont)
x = someItem

if c[x] == 0:
    print("Not in cont")
elif c[x] == 1:
    print("Unique")
else:
    print("Duplicate")

You have to scan all the elements for the duplicates as they can be just the last ones you check, so you can't get more efficient than worst case O(N) time, just like linear search. But a simple linear search to find a duplicate will use up O(N) memory, because you need to track what you've seen so far.

If the array is sorted you can find duplicates in O(N) time without using any additional memory because duplicate pairs will be next to each other.

If your container is a list, you can just pass the value you're looking for to its count() method and check the result:

>>> l = [1,1,2,3]
>>> l.count(1)
2
>>>

A dictionary can't have duplicate keys, nor can a set. Outside of these, I'd need to know what kind of container it is. I guess the real point is to always make sure you haven't missed an obvious solution to the problem before you go coding a custom solution. I fall prey to this myself sometimes :)

According to http://wiki.python.org/moin/TimeComplexity most of the list operations are terribly inefficient (just confirmed that x in myList does seem to be O(N) in python3).

The method given by the original poster is efficient because it is O(N) time and space (this is the "best" you can, without making additional assumptions about your list, since list operations like x in myList are O(N)).

There is a major optimization which is possible, which is to iteratively build up the set. This would return quickly on certain kinds of lists, e.g. [0,1,1,2,3,4,5,...]. However you are implicitly assuming a bit about the distribution of your lists (for example, do you optimize for this case, or optimize for duplicates at the end, or both?). The good thing about this optimization is that it doesn't affect asymptotic speed. Here's how I would code it elegantly:

def hasDuplicate(iter):
    visited = set()
    for item in iter:
        if item in visited:
            return True
        visited.add(item)
    return False

You could also return the first duplicate, but you can't return None; you'd have to raise an Exception since the iterable might contain None.

sidenote: There are ways to improve the space-efficiency for a minor hit to time-efficiency (e.g. bloom filters).

Another suggestions, similar to jonesy's answer. At least in python3 (haven't tested in python 2.7), when c[-5000] = 0, this becomes faster than solution 3 and 4 of the original answer. Otherwise it is only slightly faster than solution 1 and 2...

elif toTest == '7':
    for i in c:
        if c.count(i)>1:
            return i

Try this:

def getFirstDup(cont):
    for i in xrange(0, len(cont)):
        if cont[i] in cont[:i]:
            return cont[i]
    return None

继续阅读：algorithm duplicates python python-3.x

Python: find a duplicate in a container efficiently

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？