substring finding from a string

2023-04-13 03:35 问答作者：

Input: string S = AAGATATGATAGGAT.

Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and开发者_运维百科 13) and so on...

A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.
An internal node’s leaf descendants are suffixes, each of which has a left character.
If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.
Maximal repeats is left-diverse internal nodes.

Overall idea:

Build a suffix tree and then do a DFS (depth first search) on the tree
For each leaf, label it with its left character
For each internal node:
If at least one child is labelled with *, then label it with *
Else if its children’s labels are diverse, label with *.
Else then all children have same label, copy it to current node

Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.

Your idea is good, but with a suffix tree you can do something even easier.

Let T be the suffix tree of your sequence .

Let x be a node in T, T_x is the subtree of T with root x.

Let N_x be the number of leaf in T_x

Let word(x) be the word created by traversing T from root to node x

Now using the definition of a suffix tree we get :

Number of repeats of word(x) = N_x and the position of this words are the label of each leaf

The algorithm for this would be a basic tree traversal, for each node in the tree calculate N_x, if N_x > 2 add this to your result (if you want the position too you can add the label of each leaf)

Pseudo code :

input :

mySequence

output:

Result (list of word that repeat with count and position)

Algorithm :

T = suffixTree(mySequence)

For each internal node X in T:
  T_X = subTree(T)
  N_X = Number of lead (T_X)
  if N_X >=2 :
  Result .add ( [word(X), N_X , list(label of leafs)] )
return Result

Example :

let's take the wikipedia example for suffix trees : "BANANA" :

substring finding from a string

we get :

N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2  so "N" repeats 2 times  in position {2,4}
N_NA=2  so "NA" repeats 2 times in position {2,4}

I found this paper that seems to treat your problem the same way you're doing, so yes I think your method is write :

Spelling approximate repeated or common motifs using a suffix tree

Extract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Sigma. For instance, Sigma may be equal to {A, C, G, T} and the sequence represent an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint).[...]

You can download it and have a look at it , the author gives pseudo code for your algorithm.

Hope this helps

继续阅读：algorithm language-agnostic suffix-tree

substring finding from a string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？