Prune similar strings by length
How would one remove elements containing strings from a 开发者_C百科Python list by their similarity and length (if a string X is found in another, longer string Y, X must be removed)?
IN: [('this is string that stays', 0), ('this is string', 1), ('string that stays', 2), ('i am safe', 3)]
OUT: [('this is string that stays', 0), ('i am safe', 3)]
Here you go:
l = [('this is string that stays', 0), ('this is string', 1), ('string that stays', 2), ('i am safe', 3)]
survivors = set(s for s, _ in l)
for s1, _ in l:
if any(s1 != s2 and s1 in s2 for s2 in survivors):
survivors.discard(s1)
survivors
is what you want, except that it does not contain the numbers of the input tuples -- changing this should be an exercise for the reader :-P.
if you don't mind order(N*N)
>>> s=[('this is string that stays', 0), ('this is string', 1), ('string that stays', 2), ('i am safe', 3)]
>>> s=[i[0] for i in s]
>>> result=[s[i] for i in range(len(s)) if not any(s[i] in s[j] for j in range(i)+range(i+1,len(s)-i))]
>>> result
['this is string that stays', 'i am safe']
If you do care about efficiency, I suggest you split each string into a sequences of words (or even characters) and make a tree data structure such as a trie (http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries) which allows fast look ups on each sub sequence
All, other answers provide good solutions. I just want to add a note on your attempt:
for i in range(0, len(d)):
for j in range(1, len(d)):
if d[j][0] in d[i][0] and len(d[i][0]) > len(d[j][0]):
del d[j]
this fails with list indices out of range because your deleting while iterating over the list. Here's one way to prevent this problem:
d = [('this is string that stays', 0), ('this is string', 1), ('string that stays', 2), ('i am safe', 3)]
to_be_removed = list()
for i in range(0, len(d)):
for j in range(0, len(d)):
if i != j and d[j][0] in d[i][0] and len(d[i][0]) > len(d[j][0]):
to_be_removed.append(j)
for m, n in enumerate(to_be_removed):
del d[n - m]
print d
Try this :
IN = [('this is string that stays', 0), ('this is string', 1), ('string that stays', 2), ('i am safe', 3)]
OUT=[]
def check_item(liste, item2check):
for item, _ in liste:
if item2check in item and len(item2check) < len(item):
return True
return False
for item, rank in IN:
if not check_item(IN, item):
OUT.append((item, rank))
# or in a list-comprehension :
OUT = [(item, rank) for item, rank in IN if not check_item(IN, item)]
print OUT
>>> [('this is string that stays', 0), ('i am safe', 3)]
精彩评论