Print out a large list from file into multiple sublists with overlapping sequences in python
currently I have a very long sequence in a file and I wish to split this sequence into smaller subsequences, but I would like each subsequence to have an overlap from the previous sequence, and place them into a list. here is an example of what I mean:
(apologies about the cryptic sequence, this is all on 1 line)
file1.txt
abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft
list1 = ["abcdefessdfekgheithrfkopeifhght", "fhghtryrhfbcvdfersdwtiyuyrterdhc", "erdhcbgjherytyekdnfiwyt", "nfiwytowihfiwoeirehjiwoqpft"]
I can currently split each sequence into smaller saubsequences without the overlaps using the following code:
def chunks(seq, n):
division = len(seq) / float (n)
return [ seq[int(round(division * i)): int(round(division * (i + 1)))] for i in xrange(n) ]
in the above code the n specifies how many subsequences the list will be split into.
I was thinking of just grabbing the ends of each subsequence and just concate开发者_如何学编程nating them to the ends of the elements in the list by hard coding it... but this would be inefficient and hard. is there an easy way to do this?
in reality it would be more about 100 characters that i would require to be overlapped.
Thanks guys
seq="abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft"
>>> n = 4
>>> overlap = 5
>>> division = len(seq)/n
>>> [seq[i*division:(i+1)*division+overlap] for i in range(n)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']
it is probably slightly more efficient to do it like this
>>> [seq[i:i+division+overlap] for i in range(0,n*division,division)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']
If you want to split your sequence seq
into subsequences of length length
with overlap
number of characters/elements shared between each subsequence and its successor:
def split_with_overlap(seq, length, overlap):
return [seq[i:i+length] for i in range(0, len(seq), length - overlap)]
Then testing it on your original data:
>>> seq = 'abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft'
>>> split_with_overlap(seq, 31, 5)
['abcdefessdfekgheithrfkopeifhght', 'fhghtryrhfbcvdfersdwtiyuyrterdh', 'terdhcbgjherytyekdnfiwytowihfiw', 'ihfiwoeirehjiwoqpft']
精彩评论