How to: From one string to another in a long list of strings
Imagine开发者_StackOverflow中文版 a long string of characters: "AATTAATCTATATATTGAAATGGGGCCCCAATTTTCCCAAATC ...."
I define 4 strings:
"AAT"
"ATG"
"TTT"
"ATC"
My mission is to find the "end point" for every string "AAT" in the long string of characters. My end points are the three last strings "ATG", "TTT", "ATC", which means I need to find the index for my start position "AAT" to my end position, which can be either "ATG", "TTT" or "ATC". I have been told to advance in steps of 3, but im not sure how to do it.
I have tried to do this:
open1=open(<text>)
u=open1.read()
string1="AAT
while True:
p=u.find(string1,p)
p=p+1
mylist.append(p)
print mylist
, which will print the locations of the strings "ATG" in my textfile. Im not sure how to move on from here. I guess i could find the positions of the other strings as well, but how do I create a function that starts from "ATG" and stops until it meets one of the end points??
Hope this is somehow understandable
You can do this with a regex:
>>> import re
>>> s = "AATTAATCTATATATTGAAATGGGGCCCCAATTTTCCCAAATC ...."
>>> [(m.start(), m.end()) for m in re.finditer('AAT.*?(?:ATG|TTT|ATC)', s)]
[(0, 8), (18, 34)]
re.finditer searches for multiple non-overlapping matches of a regex and returns a MatchObject for each one. The start() and end() methods of the match object give the start and end index of the matched string.
The regex searches for AAT followed by anything up to and including the first occurrence of ATG, TTT or ATC.
You may need to construct the regex dynamically if you do not know the start & end strings until the program runs - this is pretty simple to do:
start = "AAT"
end = ["ATG", "TTT", "ATC"]
regex = "%s.*?(?:%s)" % (start, '|'.join(end))
精彩评论