开发者

Print out a large list from file into multiple sublists with overlapping sequences in python

currently I have a very long sequence in a file and I wish to split this sequence into smaller subsequences, but I would like each subsequence to have an overlap from the previous sequence, and place them into a list. here is an example of what I mean:

(apologies about the cryptic sequence, this is all on 1 line)

file1.txt
abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft


list1 = ["abcdefessdfekgheithrfkopeifhght", "fhghtryrhfbcvdfersdwtiyuyrterdhc", "erdhcbgjherytyekdnfiwyt", "nfiwytowihfiwoeirehjiwoqpft"]

I can currently split each sequence into smaller saubsequences without the overlaps using the following code:

def chunks(seq, n):
    division = len(seq) / float (n)
        return [ seq[int(round(division * i)): int(round(division * (i + 1)))] for i in xrange(n) ]

in the above code the n specifies how many subsequences the list will be split into.

I was thinking of just grabbing the ends of each subsequence and just concate开发者_如何学编程nating them to the ends of the elements in the list by hard coding it... but this would be inefficient and hard. is there an easy way to do this?

in reality it would be more about 100 characters that i would require to be overlapped.

Thanks guys


seq="abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft"
>>> n = 4
>>> overlap = 5
>>> division = len(seq)/n
>>> [seq[i*division:(i+1)*division+overlap] for i in range(n)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']

it is probably slightly more efficient to do it like this

>>> [seq[i:i+division+overlap] for i in range(0,n*division,division)]
['abcdefessdfekgheithrfkopeifhg', 'eifhghtryrhfbcvdfersdwtiyuyrt', 'yuyrterdhcbgjherytyekdnfiwyto', 'iwytowihfiwoeirehjiwoqpft']


If you want to split your sequence seq into subsequences of length length with overlap number of characters/elements shared between each subsequence and its successor:

def split_with_overlap(seq, length, overlap):
    return [seq[i:i+length] for i in range(0, len(seq), length - overlap)]

Then testing it on your original data:

>>> seq = 'abcdefessdfekgheithrfkopeifhghtryrhfbcvdfersdwtiyuyrterdhcbgjherytyekdnfiwytowihfiwoeirehjiwoqpft'

>>> split_with_overlap(seq, 31, 5)
['abcdefessdfekgheithrfkopeifhght', 'fhghtryrhfbcvdfersdwtiyuyrterdh', 'terdhcbgjherytyekdnfiwytowihfiw', 'ihfiwoeirehjiwoqpft']
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜