开发者

Search repeat (any type of repeat) substring in a long string without space

I have asked the same question before, but did not get what I want. So have to post it again.

I have a very long string which does not have any space within it. Now I am trying to search for repeated substring (any type, no specific pattern) in this long string. The length of the repeat could be a range between (min, max), i.e (min = 3. max = 5).

For example: String s = "atggucttuaccccggucttaacccc"; in which "gguctt" and开发者_StackOverflow "acccc" are the two different repeated substrings (I do not know this before I run the code).

So I am wandering, in C#, is there any fast way to determine the repeats and the position where the repeats occur?

Thanks in advance.


You are essentially looking to search the string for a substring, but the substrings are composed of every possible substring in the string.

I would begin by iterating through chunk lengths, from 2 (or whatever the smallest match should be), to half the string's length (a string longer than half the strings length couldn't be repeated).

For each chunk size, I would iterate through the string taking chunks of the appropriate size and using a string matching algorithm like Boyer-Moore (or the built in string searching algorithm) to see if the string is repeated. Note that it is only necessary to search the remainder of the string, if there were a repeat earlier in the string, it would have been matched what that region was the chunk. You can also limit the search region to eliminate the last (chunk_size - 1) chars in the string, as a match couldn't possibly begin after there (although your string searching algo might do this for you). I would also maintain a hashtable of all of the already checked chunks to avoid having to check them again, this would be particularly important for the first few iterations where the chunk size is small.

In pseudocode:

match_min = 2
match_max = 5

search_cache = Hashtable()
for (chunk_size = match_min; chunk_size < min(match_max+1, len(str)/2); chunk_size++){
  for (start = 0; start < len(str) - chunk_size; start++){
    sub = str.substring(start, start + chunk_size)
    // We want to know if sub repeats
    if (sub not in search_cache)
      search_cache[sub] = str.substring(start + chunk_size, len(str) - chunk_size + 1).find(sub)
    if (search_cache[sub] != -1)
      print "MATCH FOUND %s at %d-%d" % (sub, start, search_cache[sub])
  }
}

This will only find one match for each chunk (and some chunks will appear to match themselves), but could be easily modified to find all matches (just make the find function return all matches, and modify how the print statement works).

The efficiency of this would be roughly O(c*m*n) where c is a constant representing the efficiency of your string searching algo (the amortized cost of doing a string search), m is the size of the string, and n is (max - min). It is also a function of the amount of repetition in the string, as if the entropy is low, the search_cache would save you more time. Approximating c as O(n) makes the function roughly O(n^2).


If the string is long, you might want to look into Suffixtrees or Suffixarrays. They solve this and similar problems efficiently.


Try this:

var matches = Regex.Matches("atggucttuaccccggucttaacccc", @"((.)\2+)")

It'll give you the positions of the matches too. More information here.

EDIT: Just realised you need arbitrary repeated string matching, not just repeated character character matching.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜