Remove certain consecutive duplicates in list

2023-02-19 18:38 问答作者：

I have a list of strings like this:

['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']

I want to replace the '**', '**' with a single '**', but leave 'bar', 'bar' intact. I.e. replace any consecutive number of '**' with a single one. My curren开发者_如何学Ct code looks like this:

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
np = [p[0]]
for pi in range(1,len(p)):
  if p[pi] == '**' and np[-1] == '**':
    continue
  np.append(p[pi])

Is there any more pythonic way to do this?

Not sure about pythonic, but this should work and is more terse:

star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
star_list = [i for i, next_i in zip(star_list, star_list[1:] + [None]) 
             if (i, next_i) != ('**', '**')]

The above copies the list twice; if you want to avoid that, consider Tom Zych's method. Or, you could do as follows:

from itertools import islice, izip, chain

star_list = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
sl_shift = chain(islice(star_list, 1, None), [None])
star_list = [i for i, next_i in izip(star_list, sl_shift) 
             if (i, next_i) != ('**', '**')]

This can be generalized and made iterator-friendly -- not to mention more readable -- using a variation on the pairwise recipe from the itertools docs:

from itertools import islice, izip, chain, tee
def compress(seq, x):
    seq, shift = tee(seq)
    shift = chain(islice(shift, 1, None), (object(),))
    return (i for i, j in izip(seq, shift) if (i, j) != (x, x))

Tested:

>>> list(compress(star_list, '**'))
['**', 'foo', '*', 'bar', 'bar', '**', 'baz']

This is in my opinion pythonic

result = [v for i, v in enumerate(L) if L[i:i+2] != ["**", "**"]]

the only "trickery" being used is that L[i:i+2] is a list of one element when i == len(L)-1.

Note that of course the very same expression can also be used as a generator

This works. Not sure how Pythonic it is.

import itertools

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']

q = []
for key, iter in itertools.groupby(p):
    q.extend([key] * (1 if key == '**' else len(list(iter))))

print(q)

from itertools import groupby

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', 'baz']
keep = set(['foo',  'bar', 'baz'])
result = []

for k, g in groupby(p):
    if k in keep:
        result.extend(list(g))
    else:
        result.append(k)

A generalised "pythonic" solution that works with any iterable (no backing up, no copying, no indexing, no slicing, doesn't fail if iterable is empty) and any thing-to-squeeze (including None):

>>> test = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
...      'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]
>>>
>>> def squeeze(iterable, victim, _dummy=object()):
...     previous = _dummy
...     for item in iterable:
...         if item == victim == previous: continue
...         previous = item
...         yield item
...
>>> print test
['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**', 'foo', '*'
, '*', 'bar', 'bar', 'bar', '**', '**', 'foo', 'bar']
>>> print list(squeeze(test, "**"))
['**', 'foo', '*', 'bar', 'bar', '**', 'baz', '**', 'foo', '*', '*', 'bar', 'bar
', 'bar', '**', 'foo', 'bar']
>>> print list(squeeze(["**"], "**"))
['**']
>>> print list(squeeze(["**", "**"], "**"))
['**']
>>> print list(squeeze([], "**"))
[]
>>>

Update for the edification of @eyquem who stated that victim could not be a sequence (or, presumably a set).

Having a container of victims means that there are two possible semantics:

>>> def squeeze2(iterable, victims, _dummy=object()):
...     previous = _dummy
...     for item in iterable:
...         if item == previous in victims: continue
...         previous = item
...         yield item
...
>>> def squeeze3(iterable, victims, _dummy=object()):
...     previous = _dummy
...     for item in iterable:
...         if item in victims and previous in victims: continue
...         previous = item
...         yield item
...
>>> guff = "c...d..e.f,,,g,,h,i.,.,.,.j"
>>> print "".join(squeeze2(guff, ".,"))
c.d.e.f,g,h,i.,.,.,.j
>>> print "".join(squeeze3(guff, ".,"))
c.d.e.f,g,h,i.j
>>>

A solution without itertools.groupby() :

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]

def treat(A):
    prec = A[0]; yield prec
    for x in A[1:]:
        if (prec,x)!=('**','**'):  yield x
        prec = x

print p
print
print list(treat(p))

result

['**', 'foo', '*', 'bar', 'bar', '**', '**', '**',  
 'baz', '**', '**',
 'foo', '*', '*', 'bar', 'bar','bar', '**', '**',
 'foo', 'bar']


['**', 'foo', '*', 'bar', 'bar', '**',
 'baz', '**',
 'foo', '*', '*', 'bar', 'bar', 'bar', '**',
 'foo', 'bar']

Another solution, inspired from dugres

from itertools import groupby

p = ['**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar',]

res = []
for k, g in groupby(p):
    res.extend(  ['**'] if k=='**' else list(g) )    
print res

It's like Tom Zych's solution, but simpler

EDIT

p = ['**','**', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'baz', '**', '**',
     'foo', '*','*', 'bar', 'bar','bar', '**', '**','foo','bar', '**', '**', '**']


q= ['**',12,'**',45, 'foo',78, '*',751, 'bar',4789, 'bar',3, '**', 5,'**',7, '**',
    73,'baz',4, '**',8, '**',20,'foo', 8,'*',36,'*', 36,'bar', 11,'bar',0,'bar',9,
    '**', 78,'**',21,'foo',27,'bar',355, '**',33, '**',37, '**','end']

def treat(B,dedupl):
    B = iter(B)
    prec = B.next(); yield prec
    for x in B:
        if not(prec==x==dedupl):  yield x
        prec = x

print 'gen = ( x for x in q[::2])'
gen = ( x for x in q[::2])
print 'list(gen)==p is ',list(gen)==p
gen = ( x for x in q[::2])
print 'list(treat(gen)==',list(treat(gen,'**'))

ch = '??h4i4???4t4y?45l????hmo4j5???'
print '\nch==',ch
print "''.join(treat(ch,'?'))==",''.join(treat(ch,'?'))

print "\nlist(treat([],'%%'))==",list(treat([],'%%'))

result

gen = ( x for x in q[::2])
list(gen)==p is  True
list(treat(gen)== ['**', 'foo', '*', 'bar', 'bar', '**', 'baz', '**', 'foo', '*', '*', 'bar', 'bar', 'bar', '**', 'foo', 'bar', '**']

ch== ??h4i4???4t4y?45l????hmo4j5???
''.join(treat(ch,'?'))== ?h4i4?4t4y?45l?hmo4j5?

list(treat([],'%%'))== []

Remark: A generator function allows to adapt the output to the type of the input with a writing around the call of the generator, it dosn't require to change th internal code of the genrator function;

That's not the case with the Tom Zynch's solution , that can't be adapted so easely to the type of the input

EDIT 2

I searched a one-line method, with a list comprehension or a generator expression.

I found to ways of doing this, I think it isn't possible to do without groupby()

from itertools import groupby
from operator import concat

p = ['**', '**','foo', '*', 'bar', 'bar', '**', '**', '**',
     'bar','**','foo','sun','sun','sun']
print 'p==',p,'\n'

dedupl = ("**",'sun')
print 'dedupl==',repr(dedupl)

print [ x for k, g in groupby(p) for x in ((k,) if k in dedupl else g) ]

# or

print reduce(concat,( [k] if k in dedupl else list(g) for k, g in groupby(p)),[])

Based on the same principle, it is easy to convert the function of dugres into a generator function:

from itertools import groupby

def compress(iterable, to_compress):
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x

However , this generator function has two disadvantages:

it resorts to the function groupby(), which is not easy to understand by someone not used to Python
its execution's time is longer than the ones of my generator function treat() and the generator function of John Machin, that don't use groupby().

I slightly modified them to make them able to accept a sequence of items to be de-duplicated, and I measured the durations of execution:

from time import clock
from itertools import groupby

def squeeze(iterable, victims, _dummy=object()):
    if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    previous = _dummy
    for item in iterable:
        if item in victims and item==previous:
            continue
        previous = item
        yield item

def treat(B,victims):
    if hasattr(B, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    B = iter(B)
    prec = B.next(); yield prec
    for x in B:
        if x  not in victims or x!=prec:  yield x
        prec = x

def compress(iterable, to_compress):
    if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
        to_compress = (to_compress,)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x

p = ['**', '**','su','foo', '*', 'bar', 'bar', '**', '**', '**',
     'su','su','**','bin', '*','*','bar','bar','su','su','su']

n = 10000

te = clock()
for i in xrange(n):
    a = list(compress(p,('**','sun')))
print clock()-te,'  generator function with groupby()'

te = clock()
for i in xrange(n):
    b = list(treat(p,('**','sun')))
print clock()-te,'  generator function eyquem'


te = clock()
for i in xrange(n):
    c = list(squeeze(p,('**','sun')))
print clock()-te,'  generator function John Machin'

print p
print 'a==b==c is ',a==b==c
print a

The instruction

if hasattr(iterable, '__iter__') and not hasattr(to_compress, '__iter__'):
    to_compress = (to_compress,)

is necessary to avoid errors when the the iterable argument is a sequence and the other argument only one string: this latter needs then to be modified into a container, provided that the iterable argument isn't a string itself.

It is based on the fact that sequences like tuples, lists, stes... have the method iter, but strings haven't. The following code shows the problems:

def compress(iterable, to_compress):
    if hasattr(iterable, '__iter__') and not hasattr( to_compress, '__iter__'):
        to_compress = (to_compress,)
    print 't_compress==',repr(to_compress)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x


def compress_bof(iterable, to_compress):
    if not hasattr(to_compress, '__iter__'): # to_compress is a string
        to_compress = (to_compress,)
    print 't_compress==',repr(to_compress)
    for k, g in groupby(iterable):
        if k in to_compress:
            yield k
        else:
            for x in g: yield x


def compress_bug(iterable, to_compress_bug):
    print 't_compress==',repr(to_compress_bug)
    for k, g in groupby(iterable):
        #print 'k==',k,k in to_compress_bug
        if k in to_compress_bug:
            yield k
        else:
            for x in g: yield x


q = ';;;htr56;but78;;;;$$$$;ios4!'
print 'q==',q
dedupl = ";$"
print 'dedupl==',repr(dedupl)
print

print "''.join(compress    (q,"+repr(dedupl)+")) :\n",''.join(compress    (q,dedupl))+\
      ' <-CORRECT ONE'
print
print "''.join(compress_bof(q,"+repr(dedupl)+")) :\n",''.join(compress_bof(q,dedupl))+\
      '  <====== error ===='
print
print "''.join(compress_bug(q,"+repr(dedupl)+")) :\n",''.join(compress_bug(q,dedupl))

print '\n\n\n'


q = [';$', ';$',';$','foo', ';', 'bar','bar',';',';',';','$','$','foo',';$12',';$12']
print 'q==',q
dedupl = ";$12"
print 'dedupl==',repr(dedupl)
print
print 'list(compress    (q,'+repr(dedupl)+')) :\n',list(compress    (q,dedupl)),\
      ' <-CORRECT ONE'
print
print 'list(compress_bof(q,'+repr(dedupl)+')) :\n',list(compress_bof(q,dedupl))
print
print 'list(compress_bug(q,'+repr(dedupl)+')) :\n',list(compress_bug(q,dedupl)),\
      '  <====== error ===='
print

result

q== ;;;htr56;but78;;;;$$$$;ios4!
dedupl== ';$'

''.join(compress    (q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4! <-CORRECT ONE

''.join(compress_bof(q,';$')) :
t_compress== (';$',)
;;;htr56;but78;;;;$$$$;ios4!  <====== error ====

''.join(compress_bug(q,';$')) :
t_compress== ';$'
;htr56;but78;$;ios4!




q== [';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12', ';$12']
dedupl== ';$12'

list(compress    (q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12']  <-CORRECT ONE

list(compress_bof(q,';$12')) :
t_compress== (';$12',)
[';$', ';$', ';$', 'foo', ';', 'bar', 'bar', ';', ';', ';', '$', '$', 'foo', ';$12']

list(compress_bug(q,';$12')) :
t_compress== ';$12'
[';$', 'foo', ';', 'bar', 'bar', ';', '$', 'foo', ';$12']   <====== error ====

I obtained the following execution's time:

0.390163274941   generator function with groupby()
0.324547114228   generator function eyquem
0.310176572721   generator function John Machin
['**', '**', 'su', 'foo', '*', 'bar', 'bar', '**', '**', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']
a==b==c is  True
['**', 'su', 'foo', '*', 'bar', 'bar', '**', 'su', 'su', '**', 'bin', '*', '*', 'bar', 'bar', 'su', 'su', 'su']

I prefer the solution of John Machin because there is no instruction B = iter(B) as in mine.

But the instruction previous = _dummy with _dummy = object() appears weird to me. So I finally think the better solution is the following code, that works even with a string as iterable argument, and in which the first object previous being defined isn't a fake:

def squeeze(iterable, victims):
    if hasattr(iterable, '__iter__') and not hasattr(victims, '__iter__'):
        victims = (victims,)
    for item in iterable:
        previous = item
        break
    for item in iterable:
        if item in victims and item==previous:
            continue
        previous = item
        yield item

EDIT 3

I had undesrtood that object() was used as a sentinel.

But I was puzzled by the fact object is CALLED. Yesterday, I was thinking that object is something so peculiar that it is impossible that object be in any iterable passed as an argument to squeeze(). So, I was wondering why you called it, John Machin, and that sowed doubt in my mind concerning its nature; that's why I asked you a confirmation that object is the super meta-class.

But today, I think I understand why object is called in your code.

In fact, it is quite possible that object be in an iterable, why not ? Super meta-class object is an object, so nothing prevents it to have been put in the iterable before a de-duplication is processed on the iterable, who knows ? Then using object itself as a sentinel is incorrect practice.

So you didn't use object but an instance object() as a sentinel.

But I wondered why to choose this mysterious thing that the return of a call to object is ?

My reflections went on about this point and I remarked something that must be the reason of this call:

calling object creates an instance , since object is the most base class in Python, and each time an instance is created, it is a different object from any prior created instance, with a value always different from the value of any prior object's instance:

a = object()
b = object()
c = object()
d = object()

print id(a),'\n',id(b),'\n',id(c),'\n',id(d)

print a==b,a==c,a==d
print b==c,b==d,c==d

result

10818752 
10818760 
10818768 
10818776
False False False
False False False

So it is sure that _dummy=object() is a unique object, having a unique id and a unique value. By the way , I wonder what is the value of an object's instance. Anyway the following code shows the problem with _dummy=object and the absence of problem with _dummy=object()

def imperfect_squeeze(iterable, victim, _dummy=object):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

def squeeze(iterable, victim, _dummy=object()):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

wat = object
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_squeeze\n''li before ==',li
print map(id,li)
li = list(imperfect_squeeze(li,[wat,'**']))
print 'li after  ==',li
print


wat = object()
li = [wat,'**','**','foo',wat,wat]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[wat,'**']))
print 'li after  ==',li
print


li = [object(),'**','**','foo',object(),object()]
print 'squeeze\n''li before ==',li
print map(id,li)
li = list(squeeze(li,[li[0],'**']))
print 'li after  ==',li

result

imperfect_squeeze
li before == [<type 'object'>, '**', '**', 'foo', <type 'object'>, <type 'object'>]
[505317320, 18578968, 18578968, 13208848, 505317320, 505317320]
id(previous)   == 505317320
id(iterable[0])== 505317320
li after  == ['**', 'foo', <type 'object'>]

squeeze
li before == [<object object at 0x00A514C8>, '**', '**', 'foo', <object object at 0x00A514C8>, <object object at 0x00A514C8>]
[10818760, 18578968, 18578968, 13208848, 10818760, 10818760]
id(previous)   == 10818752
id(iterable[0])== 10818760
li after  == [<object object at 0x00A514C8>, '**', 'foo', <object object at 0x00A514C8>]

squeeze
li before == [<object object at 0x00A514D0>, '**', '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]
[10818768, 18578968, 18578968, 13208848, 10818776, 10818784]
id(previous)   == 10818752
id(iterable[0])== 10818768
li after  == [<object object at 0x00A514D0>, '**', 'foo', <object object at 0x00A514D8>, <object object at 0x00A514E0>]

The problem is consisting in the absence of <type 'object'> as first element of the list after treatment by imperfect_squeeze() .

However, we must note that the "problem" is possible only with a list whose FIRST element is object: that's a lot of reflections about such a tiny probability.... but a rigorous coder takes account of all.

If we resort to list, instead of object , the results are different:

def imperfect_sqlize(iterable, victim, _dummy=list):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

def sqlize(iterable, victim, _dummy=list()):
    previous = _dummy
    print 'id(previous)   ==',id(previous)
    print 'id(iterable[0])==',id(iterable[0])
    for item in iterable:
        if item in victim and item==previous:  continue
        previous = item; yield item

wat = list
li = [wat,'**','**','foo',wat,wat]
print 'imperfect_sqlize\n''li before ==',li
print map(id,li)
li = list(imperfect_sqlize(li,[wat,'**']))
print 'li after  ==',li
print

wat = list()
li = [wat,'**','**','foo',wat,wat]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[wat,'**']))
print 'li after  ==',li
print

li = [list(),'**','**','foo',list(),list()]
print 'sqlize\n''li before ==',li
print map(id,li)
li = list(sqlize(li,[li[0],'**']))
print 'li after  ==',li

result

imperfect_sqlize
li before == [<type 'list'>, '**', '**', 'foo', <type 'list'>, <type 'list'>]
[505343304, 18578968, 18578968, 13208848, 505343304, 505343304]
id(previous)   == 505343304
id(iterable[0])== 505343304
li after  == ['**', 'foo', <type 'list'>]

sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734936, 18578968, 18578968, 13208848, 18734936, 18734936]
id(previous)   == 18734656
id(iterable[0])== 18734936
li after  == ['**', 'foo', []]

sqlize
li before == [[], '**', '**', 'foo', [], []]
[18734696, 18578968, 18578968, 13208848, 18735016, 18734816]
id(previous)   == 18734656
id(iterable[0])== 18734696
li after  == ['**', 'foo', []]

Is there any other object than object in Python that have this peculiarity ?

John Machin, why did you choose an instance of object as a sentinel in your generator function ? Did you already know the above peculiarity ?

继续阅读：duplicates list python

Remove certain consecutive duplicates in list

EDIT

EDIT 2

EDIT 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

EDIT

EDIT 2

EDIT 3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？