开发者

Given a list of slices, how do I split a sequence by them?

Given a list of slices, how do I separate a sequence based on them?

I have long amino-acid strings that I would like to split based on start-stop values in a list. 开发者_StackOverflow社区An example is probably the most clear way of explaining it:

str = "MSEPAGDVRQNPCGSKAC"
split_points = [[1,3], [7,10], [12,13]]

output >> ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']

The extra parentheses are to show which elements were selected from the split_points list. I don't expect the start-stop points to ever overlap.

I have a bunch of ideas that would work, but seem terribly inefficient (code-length wise), and it seems like there must be a nice pythonic way of doing this.


Strange way to split strings you have there:

def splitter( s, points ):
    c = 0
    for x,y in points:
        yield s[c:x]
        yield "(%s)" % s[x:y+1]
        c=y+1
    yield s[c:]

print list(splitter(str, split_points))
# => ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']

# if some start and endpoints are the same remove empty strings.
print list(x for x in splitter(str, split_points) if x != '')


Here is a simple solution for you. to grab each of the sets specified by the point.

In[4]:  str[p[0]:p[1]+1] for p in split_points]
Out[4]: ['SEP', 'VRQN', 'CG']

To get the parenthesis:

In[5]:  ['(' + str[p[0]:p[1]+1] + ')' for p in split_points]
Out[5]: ['(SEP)', '(VRQN)', '(CG)']

Here is the cleaner way of doing it to do the whole deal:

results = []

for i in range(len(split_points)):
    start, stop = split_points[i]
    stop += 1

    last_stop = split_points[i-1][1] + 1 if i > 0 else 0

    results.append(string[last_stop:start])        
    results.append('(' + string[start:stop] + ')')

results.append(string[split_points[-1][1]+1:])

All of the below solutions are bad, and more for fun than anything else, do not use them!

This more of a WTF solution, but I figured I'd post it since it was asked for in comments:

split_points = [(x, y+1) for x, y in split_points]
split_points = [((split_points[i-1][1] if i > 0 else 0, p[0]), p) for i, p in zip(range(len(split_points)), split_points)]
results = [string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in split_points] + [string[split_points[-1][1][1]:]]
results = '\n'.join(results).split()

still trying to figure out the one liner, here's a two:

split_points = [((split_points[i-1][1]+1 if i > 0 else 0, p[0]), (p[0], p[1]+1)) for i, p in zip(range(len(split_points)), split_points)]
print '\n'.join([string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in split_points] + [string[split_points[-1][1][1]:]]).split()

And the one liner that should never be used:

print '\n'.join([string[n[0]:n[1]] + '\n(' + string[m[0]:m[1]] + ')' for n, m in (((split_points[i-1][1]+1 if i > 0 else 0, p[0]), (p[0], p[1]+1)) for i, p in zip(range(len(split_points)), split_points))] + [string[split_points[-1][1]:]]).split()


Here's some code that will work.

result = []
last_end = 0
for sp in split_points:
  result.append(str[last_end:sp[0]])
  result.append('(' + str[sp[0]:sp[1]+1] + ')')
  last_end = sp[1]+1
result.append(str[last_end:])

print result

If you just want the parts in the parenthesis it becomes a little simpler:

result = [str[sp[0]:sp[1]+1] for sp in split_points]


Here's a solution that converts your split_points to regular string slices and then prints out the appropriate slices:

str = "MSEPAGDVRQNPCGSKAC"
split_points = [[1, 3], [7, 10], [12, 13]]

adjust = [s for sp in [[x, y + 1] for x, y in split_points] for s in sp]
zipped = zip([None] + adjust, adjust + [None])

out = [('(%s)' if i % 2 else '%s') % str[x:y] for i, (x, y) in
       enumerate(zipped)]

print out

>>> ['M', '(SEP)', 'AGD', '(VRQN)', 'P', '(CG)', 'SKAC']


>>> str = "MSEPAGDVRQNPCGSKAC"
>>> split_points = [[1,3], [7,10], [12,13]]
>>>
>>> all_points = sum(split_points, [0]) + [len(str)-1]
>>> map(lambda i,j: str[i:j+1], all_points[:-1], all_points[1:])
['MS', 'SEP', 'PAGDV', 'VRQN', 'NPC', 'CG', 'GSKAC']
>>>
>>> str_out = map(lambda i,j: str[i:j+1], all_points[:-1:2], all_points[1::2])
>>> str_in = map(lambda i,j: str[i:j+1], all_points[1:-1:2], all_points[2::2])
>>> sum(map(list, zip(['(%s)' % s for s in str_in], str_out[1:])), [str_out[0]])
['MS', '(SEP)', 'PAGDV', '(VRQN)', 'NPC', '(CG)', 'GSKAC']


Probably not for elegance, but just because I can do it in a oneliner :)

>>> reduce(lambda a,ij:a[:-1]+[str[a[-1]:ij[0]],'('+str[ij[0]:ij[1]+1]+')',
            ij[1]], split_points, [0])[:-1] + [str[split_points[-1][-1]+1:]]

['M', '(SEP)', 'PAGD', '(VRQN)', 'NP', '(CG)', 'SKAC']

Maybe you like it. Here some explanation:

In your question you pass one set of slices, and implicitly you want to have the complement set of slices as well (to generate the un-parenthesized [is that English?] slices). So basically, each slice [i,j] lacks the previous j. e.g. [7,10] lacks the 3 and [1,3] lacks the 0.

reduce processes lists and at each step passes the output so far (a) plus the next input element (ij). The trick is that apart from producing the plain output, we add each time an extra variable --- a sort of memory --- which is in the next step retrieved in a[-1]. In this particular example we store the last j value, and hence at all times we have the full information to provide both the unparenthesized and the parenthesized substring.

Finally, the memory is stripped with [:-1] and replaced by the remainder of the original str in [str[split_points[-1][-1]+1:]].

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜