开发者

Longest common substring from more than two strings

I'm looking for a Python library for finding the longest common sub-string from a set of strings. There are two ways to solve this problem:

  • using suffix trees
  • using dynamic programming.

Method i开发者_如何学Cmplemented is not important. It is important it can be used for a set of strings (not only two strings).


These paired functions will find the longest common string in any arbitrary array of strings:

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_substr(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_substr(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if find not in data[i]:
            return False
    return True


print long_substr(['Oh, hello, my friend.',
                   'I prefer Jelly Belly beans.',
                   'When hell freezes over!'])

No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.

EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and all(data[0][i:i+j] in x for x in data):
                    substr = data[0][i:i+j]
    return substr

Hope this helps,

Jason.


This can be done shorter:

def long_substr(data):
  substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)}
  s = substrs(data[0])
  for val in data[1:]:
    s.intersection_update(substrs(val))
  return max(s, key=len)

set's are (probably) implemented as hash-maps, which makes this a bit inefficient. If you (1) implement a set datatype as a trie and (2) just store the postfixes in the trie and then force each node to be an endpoint (this would be the equivalent of adding all substrings), THEN in theory I would guess this baby is pretty memory efficient, especially since intersections of tries are super-easy.

Nevertheless, this is short and premature optimization is the root of a significant amount of wasted time.


def common_prefix(strings):
    """ Find the longest string that is a prefix of all the strings.
    """
    if not strings:
        return ''
    prefix = strings[0]
    for s in strings:
        if len(s) < len(prefix):
            prefix = prefix[:len(s)]
        if not prefix:
            return ''
        for i in range(len(prefix)):
            if prefix[i] != s[i]:
                prefix = prefix[:i]
                break
    return prefix

From http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py


I prefer this for is_substr, as I find it a bit more readable and intuitive:

def is_substr(find, data):
  """
  inputs a substring to find, returns True only 
  if found for each data in data list
  """

  if len(find) < 1 or len(data) < 1:
    return False # expected input DNE

  is_found = True # and-ing to False anywhere in data will return False
  for i in data:
    print "Looking for substring %s in %s..." % (find, i)
    is_found = is_found and find in i
  return is_found


# this does not increase asymptotical complexity
# but can still waste more time than it saves. TODO: profile
def shortest_of(strings):
    return min(strings, key=len)

def long_substr(strings):
    substr = ""
    if not strings:
        return substr
    reference = shortest_of(strings) #strings[0]
    length = len(reference)
    #find a suitable slice i:j
    for i in xrange(length):
        #only consider strings long at least len(substr) + 1
        for j in xrange(i + len(substr) + 1, length + 1):
            candidate = reference[i:j]  # ↓ is the slice recalculated every time?
            if all(candidate in text for text in strings):
                substr = candidate
    return substr

Disclaimer This adds very little to jtjacques' answer. However, hopefully, this should be more readable and faster and it didn't fit in a comment, hence why I'm posting this in an answer. I'm not satisfied about shortest_of, to be honest.


If someone is looking for a generalized version that can also take a list of sequences of arbitrary objects:

def get_longest_common_subseq(data):
    substr = []
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(data[0])-i+1):
                if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data):
                    substr = data[0][i:i+j]
    return substr

def is_subseq_of_any(find, data):
    if len(data) < 1 and len(find) < 1:
        return False
    for i in range(len(data)):
        if not is_subseq(find, data[i]):
            return False
    return True

# Will also return True if possible_subseq == seq.
def is_subseq(possible_subseq, seq):
    if len(possible_subseq) > len(seq):
        return False
    def get_length_n_slices(n):
        for i in xrange(len(seq) + 1 - n):
            yield seq[i:i+n]
    for slyce in get_length_n_slices(len(possible_subseq)):
        if slyce == possible_subseq:
            return True
    return False

print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])

print get_longest_common_subseq(['Oh, hello, my friend.',
                                     'I prefer Jelly Belly beans.',
                                     'When hell freezes over!'])


My answer, pretty slow, but very easy to understand. Working on a file with 100 strings of 1 kb each takes about two seconds, returns any one longest substring if there are more than one

ls = list()
ls.sort(key=len)
s1 = ls.pop(0)
maxl = len(s1)

#1 create a list of all substrings backwards sorted by length. Thus we don't have to check the whole list.

subs = [s1[i:j] for i in range(maxl) for j in range(maxl,i,-1)]
subs.sort(key=len, reverse=True)
    

#2 Check a substring with the next shortest then the next etc. if is not in an any next shortest string then break the cycle, it's not common. If it passes all checks, it is the longest one by default, break the cycle.

def isasub(subs, ls):
    for sub in subs:
        for st in ls:
            if sub not in st:
                break 
        else:
            return sub
            break
print('the longest common substring is: ',isasub(subs,ls))


Caveman solution that will give you a dataframe with the top most frequent substring in a string base on the substring length you pass as a list:

import pandas as pd

lista = ['How much wood would a woodchuck',' chuck if a woodchuck could chuck wood?']

string = ''
for i in lista:
    string = string + ' ' + str(i)

string = string.lower()

characters_you_would_like_to_remove_from_string = [' ','-','_']

for i in charecters_you_would_like_to_remove_from_string:
    string = string.replace(i,'')

substring_length_you_want_to_check = [3,4,5,6,7,8]

results_list = []

for string_length in substring_length_you_want_to_check:
    for i in range(len(string)):
        checking_str = string[i:i+string_length]
        if len(checking_str) == string_length:
            number_of_times_appears = (len(string) - len(string.replace(checking_str,'')))/string_length
            results_list = results_list+[[checking_str,number_of_times_appears]]


df = pd.DataFrame(data=results_list,columns=['string','freq'])

df['freq'] = df['freq'].astype('int64')

df = df.drop_duplicates()


df = df.sort_values(by='freq',ascending=False)

display(df[:10])

result is:

    string  freq
78    huck     4
63    wood     4
77    chuc     4
132  chuck     4
8      ood     4
7      woo     4
21     chu     4
23     uck     4
22     huc     4
20     dch     3


The addition of a single 'break' speeds up jtjacques's answer significantly on my machine (1000X or so for 16K files):

def long_substr(data):
    substr = ''
    if len(data) > 1 and len(data[0]) > 0:
        for i in range(len(data[0])):
            for j in range(len(substr)+1, len(data[0])-i+1):
                if all(data[0][i:i+j] in x for x in data[1:]):
                    substr = data[0][i:i+j]
                else:
                    break
    return substr


You could use the SuffixTree module that is a wrapper based on an ANSI C implementation of generalised suffix trees. The module is easy to handle....

Take a look at: here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜