开发者

substring finding from a string

Input: string S = AAGATATGATAGGAT.

Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and开发者_运维百科 13) and so on...

  • A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.

  • An internal node’s leaf descendants are suffixes, each of which has a left character.

  • If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.

  • Maximal repeats is left-diverse internal nodes.

Overall idea:

  • Build a suffix tree and then do a DFS (depth first search) on the tree

  • For each leaf, label it with its left character

  • For each internal node:

  • If at least one child is labelled with *, then label it with *

  • Else if its children’s labels are diverse, label with *.

  • Else then all children have same label, copy it to current node

Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.


Your idea is good, but with a suffix tree you can do something even easier.

Let T be the suffix tree of your sequence .

Let x be a node in T, T_x is the subtree of T with root x.

Let N_x be the number of leaf in T_x

Let word(x) be the word created by traversing T from root to node x

Now using the definition of a suffix tree we get :

Number of repeats of word(x) = N_x and the position of this words are the label of each leaf

The algorithm for this would be a basic tree traversal, for each node in the tree calculate N_x, if N_x > 2 add this to your result (if you want the position too you can add the label of each leaf)

Pseudo code :

input :

mySequence

output:

Result (list of word that repeat with count and position)

Algorithm :

T = suffixTree(mySequence)

For each internal node X in T:

  T_X = subTree(T)
  N_X = Number of lead (T_X)
  if N_X >=2 :
  Result .add ( [word(X), N_X , list(label of leafs)] )

return Result

Example :

let's take the wikipedia example for suffix trees : "BANANA" :

substring finding from a string

we get :

N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2  so "N" repeats 2 times  in position {2,4}
N_NA=2  so "NA" repeats 2 times in position {2,4}

I found this paper that seems to treat your problem the same way you're doing, so yes I think your method is write :

Spelling approximate repeated or common motifs using a suffix tree

Extract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Sigma. For instance, Sigma may be equal to {A, C, G, T} and the sequence represent an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint).[...]

You can download it and have a look at it , the author gives pseudo code for your algorithm.

Hope this helps

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜