Longest common substring from more than two strings
I'm looking for a Python library for finding the longest common sub-string from a set of strings. There are two ways to solve this problem:
- using suffix trees
- using dynamic programming.
Method i开发者_如何学Cmplemented is not important. It is important it can be used for a set of strings (not only two strings).
These paired functions will find the longest common string in any arbitrary array of strings:
def long_substr(data):
substr = ''
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(data[0])-i+1):
if j > len(substr) and is_substr(data[0][i:i+j], data):
substr = data[0][i:i+j]
return substr
def is_substr(find, data):
if len(data) < 1 and len(find) < 1:
return False
for i in range(len(data)):
if find not in data[i]:
return False
return True
print long_substr(['Oh, hello, my friend.',
'I prefer Jelly Belly beans.',
'When hell freezes over!'])
No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.
EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.
def long_substr(data):
substr = ''
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(data[0])-i+1):
if j > len(substr) and all(data[0][i:i+j] in x for x in data):
substr = data[0][i:i+j]
return substr
Hope this helps,
Jason.
This can be done shorter:
def long_substr(data):
substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}
s = substrs(data[0])
for val in data[1:]:
s.intersection_update(substrs(val))
return max(s, key=len)
set's are (probably) implemented as hash-maps, which makes this a bit inefficient. If you (1) implement a set datatype as a trie and (2) just store the postfixes in the trie and then force each node to be an endpoint (this would be the equivalent of adding all substrings), THEN in theory I would guess this baby is pretty memory efficient, especially since intersections of tries are super-easy.
Nevertheless, this is short and premature optimization is the root of a significant amount of wasted time.
def common_prefix(strings):
""" Find the longest string that is a prefix of all the strings.
"""
if not strings:
return ''
prefix = strings[0]
for s in strings:
if len(s) < len(prefix):
prefix = prefix[:len(s)]
if not prefix:
return ''
for i in range(len(prefix)):
if prefix[i] != s[i]:
prefix = prefix[:i]
break
return prefix
From http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py
I prefer this for is_substr
, as I find it a bit more readable and intuitive:
def is_substr(find, data):
"""
inputs a substring to find, returns True only
if found for each data in data list
"""
if len(find) < 1 or len(data) < 1:
return False # expected input DNE
is_found = True # and-ing to False anywhere in data will return False
for i in data:
print "Looking for substring %s in %s..." % (find, i)
is_found = is_found and find in i
return is_found
# this does not increase asymptotical complexity
# but can still waste more time than it saves. TODO: profile
def shortest_of(strings):
return min(strings, key=len)
def long_substr(strings):
substr = ""
if not strings:
return substr
reference = shortest_of(strings) #strings[0]
length = len(reference)
#find a suitable slice i:j
for i in xrange(length):
#only consider strings long at least len(substr) + 1
for j in xrange(i + len(substr) + 1, length + 1):
candidate = reference[i:j] # ↓ is the slice recalculated every time?
if all(candidate in text for text in strings):
substr = candidate
return substr
Disclaimer This adds very little to jtjacques' answer. However, hopefully, this should be more readable and faster and it didn't fit in a comment, hence why I'm posting this in an answer. I'm not satisfied about shortest_of
, to be honest.
If someone is looking for a generalized version that can also take a list of sequences of arbitrary objects:
def get_longest_common_subseq(data):
substr = []
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(data[0])-i+1):
if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data):
substr = data[0][i:i+j]
return substr
def is_subseq_of_any(find, data):
if len(data) < 1 and len(find) < 1:
return False
for i in range(len(data)):
if not is_subseq(find, data[i]):
return False
return True
# Will also return True if possible_subseq == seq.
def is_subseq(possible_subseq, seq):
if len(possible_subseq) > len(seq):
return False
def get_length_n_slices(n):
for i in xrange(len(seq) + 1 - n):
yield seq[i:i+n]
for slyce in get_length_n_slices(len(possible_subseq)):
if slyce == possible_subseq:
return True
return False
print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])
print get_longest_common_subseq(['Oh, hello, my friend.',
'I prefer Jelly Belly beans.',
'When hell freezes over!'])
My answer, pretty slow, but very easy to understand. Working on a file with 100 strings of 1 kb each takes about two seconds, returns any one longest substring if there are more than one
ls = list()
ls.sort(key=len)
s1 = ls.pop(0)
maxl = len(s1)
#1 create a list of all substrings backwards sorted by length. Thus we don't have to check the whole list.
subs = [s1[i:j] for i in range(maxl) for j in range(maxl,i,-1)]
subs.sort(key=len, reverse=True)
#2 Check a substring with the next shortest then the next etc. if is not in an any next shortest string then break the cycle, it's not common. If it passes all checks, it is the longest one by default, break the cycle.
def isasub(subs, ls):
for sub in subs:
for st in ls:
if sub not in st:
break
else:
return sub
break
print('the longest common substring is: ',isasub(subs,ls))
Caveman solution that will give you a dataframe with the top most frequent substring in a string base on the substring length you pass as a list:
import pandas as pd
lista = ['How much wood would a woodchuck',' chuck if a woodchuck could chuck wood?']
string = ''
for i in lista:
string = string + ' ' + str(i)
string = string.lower()
characters_you_would_like_to_remove_from_string = [' ','-','_']
for i in charecters_you_would_like_to_remove_from_string:
string = string.replace(i,'')
substring_length_you_want_to_check = [3,4,5,6,7,8]
results_list = []
for string_length in substring_length_you_want_to_check:
for i in range(len(string)):
checking_str = string[i:i+string_length]
if len(checking_str) == string_length:
number_of_times_appears = (len(string) - len(string.replace(checking_str,'')))/string_length
results_list = results_list+[[checking_str,number_of_times_appears]]
df = pd.DataFrame(data=results_list,columns=['string','freq'])
df['freq'] = df['freq'].astype('int64')
df = df.drop_duplicates()
df = df.sort_values(by='freq',ascending=False)
display(df[:10])
result is:
string freq
78 huck 4
63 wood 4
77 chuc 4
132 chuck 4
8 ood 4
7 woo 4
21 chu 4
23 uck 4
22 huc 4
20 dch 3
The addition of a single 'break' speeds up jtjacques's answer significantly on my machine (1000X or so for 16K files):
def long_substr(data):
substr = ''
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(substr)+1, len(data[0])-i+1):
if all(data[0][i:i+j] in x for x in data[1:]):
substr = data[0][i:i+j]
else:
break
return substr
You could use the SuffixTree module that is a wrapper based on an ANSI C implementation of generalised suffix trees. The module is easy to handle....
Take a look at: here
精彩评论