开发者

Loop over a file and write the next line if a condition is met

Having a hard 开发者_开发知识库time fixing this or finding any good hints about it. I'm trying to loop over one file, modify each line slightly, and then loop over a different file. If the line in the second file starts with the line from the first then the following line in the second file should be written to a third file.

with open('ids.txt', 'rU') as f:
        with open('seqres.txt', 'rU') as g:
                for id in f:
                        id=id.lower()[0:4]+'_'+id[4]
                        with open(id + '.fasta', 'w') as h:
                                for line in g:
                                        if line.startswith('>'+ id):
                                                h.write(g.next())

All the correct files appear, but they are empty. Yes, I am sure the if has true cases. :-)

"seqres.txt" has lines with an ID number in a certain format, each followed by a line with data. The "ids.txt" has lines with the ID numbers of interest in a different format. I want each line of data with an interesting ID number in its own file.

Thanks a million to anyone with a little advice!


Here's a mostly flattened implementation. Depending on how many hits you're going to get for each ID, and how many entries there are in 'seqres' you could redesign it.

# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)

# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()

# iterate through the rest of seqres, staggering
for lineB in iseqres:
  lineID = lineA[1:7]
  if lineID in ids:
    with open("%s.fasta" % lineID, 'a') as h:
      h.write(lineB)
  lineA = lineB


I think there is still progress to be made from the code you declare as final. You can make the result a little less nested and avoid a couple sort of silly things.

from contextlib import nested
from itertools import tee, izip

# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
    for id in f:
        id = id.lower()[0:4] + '_' + id[4]
        with open(id + '.fasta', 'w') as h:
            g.seek(0) # start at the beginning of g each time
            for line, next_line in pairwise(g):
                if line.startswith('>' + id):
                    h.write(next_line)

This is an improvement over the final code you posted in that

  • It does not unnecessarily read the whole files into memory, but simple iterates over the file objects. (This may or may not be the best option for g, really. It definitely scales better.)
  • It does not contain the crash condition using gl[line+1] if we are already on the last line of gl
    • Depending on how g actually looks, there might be something more applicable than pairwise.
  • It is not as deeply nested.
  • It conforms to PEP8 for things like spaces around operators and indentation depth.
  • This algorithm is O(n * m), where n and m are the number of lines in f and g. If the length of f is unbounded, you can use a set of its ids to reduce the algorithm to O(n) (linear in the number of lines in g).


For speed, you really want to avoid looping over the same file multiple times. This means you've turned into an O(N*M) algorithm, when you could be a using an O(N+M) one.

To achieve this, read your list of ids into a fast lookup structure, like a set. Since there are only 4600 this in-memory form shouldn't be any problem.

The new solution is also reading the list into memory. Probably not a huge problem with just a few hundred thousand lines, but its wasting more memory than you need, since you can do the whole thing in a single pass, only reading the smaller ids.txt file into memory. You can just set a flag when the previous line was something interesting, which will signal the next line to write it out.

Here's a reworked version:

with open('ids.txt', 'rU') as f:
    interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f)  # Get all ids in a set.

found_id = None
with open('seqres.txt', 'rU') as g:
    for line in g:
        if found_id is not None:
            with open(found_id+'.fasta','w') as h:
                h.write(line)

        id = line[:7]
        if id in interesting_ids: found_id = id
        else: found_id = None


The problem is that you are only looping through file g once - after you have read through it the first time the file index position is left at the end of the file, so any further reads will fail with EOF. You would need to reopen g every time round the loop.

However this will be massively inefficient - you are reading the same file repeatedly, once for every line in f. It will be orders of magnitude faster to read all of g into an array at the start and use that, so long as it will fit in memory.


After the first line in the ids.txt file has been processed, the file seqres.txt has been exhausted. There is something wrong with the nesting of your loops. Also, you're modifying the iterator inside the for line in g loop. Not a good idea.

If you really want to append the line that follows the line whose ID matches, then perhaps something like this might work better:

with open('ids.txt', 'rU') as f:
    ids = f.readlines()
with open('seqres.txt', 'rU') as g:
    seqres = g.readlines()

for id in ids:
    id=id.lower()[0:4]+'_'+id[4]
    with open(id + '.fasta', 'a') as h:
    for line in seqres:
        if line.startswith('>'+ id):
            h.write(seqres.next())
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜