substring finding from a string
Input: string S = AAGATATGATAGGAT.
Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and开发者_运维百科 13) and so on...
A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.
An internal node’s leaf descendants are suffixes, each of which has a left character.
If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.
Maximal repeats is left-diverse internal nodes.
Overall idea:
Build a suffix tree and then do a DFS (depth first search) on the tree
For each leaf, label it with its left character
For each internal node:
If at least one child is labelled with *, then label it with *
Else if its children’s labels are diverse, label with *.
Else then all children have same label, copy it to current node
Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.
Your idea is good, but with a suffix tree you can do something even easier.
Let T be the suffix tree of your sequence .
Let x be a node in T, T_x is the subtree of T with root x.
Let N_x be the number of leaf in T_x
Let word(x) be the word created by traversing T from root to node x
Now using the definition of a suffix tree we get :
Number of repeats of word(x) = N_x and the position of this words are the label of each leaf
The algorithm for this would be a basic tree traversal, for each node in the tree calculate N_x, if N_x > 2 add this to your result (if you want the position too you can add the label of each leaf)
Pseudo code :
input :
mySequence
output:
Result (list of word that repeat with count and position)
Algorithm :
T = suffixTree(mySequence)
For each internal node X in T:
T_X = subTree(T) N_X = Number of lead (T_X) if N_X >=2 : Result .add ( [word(X), N_X , list(label of leafs)] )
return Result
Example :
let's take the wikipedia example for suffix trees : "BANANA" :
we get :
N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2 so "N" repeats 2 times in position {2,4}
N_NA=2 so "NA" repeats 2 times in position {2,4}
I found this paper that seems to treat your problem the same way you're doing, so yes I think your method is write :
Spelling approximate repeated or common motifs using a suffix tree
Extract
We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Sigma. For instance, Sigma may be equal to {A, C, G, T} and the sequence represent an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint).[...]
You can download it and have a look at it , the author gives pseudo code for your algorithm.
Hope this helps
精彩评论