Efficient file buffering & scanning methods for large files in python
The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it:
What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters.
I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like.
As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module.
Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do.
My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8
>>>import cStringIO >>>example_file = cStringIO.StringIO("""\ >header CAGTcag TFgcACF """) >>>for read in parse(example_file): ... print read ... CAGTCAGTF AGTCAGTFG GTCAGTFGC TCAGTFGCA CAGTFGCAC AGTFGCACF
The function that I found had the absolute best performance from the methods I could think of is this:
def parse(file):
size = 8 # of course in my code this is a function argument
file.readline() # skip past the header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community.
Thanks!
- Note, this time includes a lot of extra calculation,开发者_如何学JAVA such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size.
Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!
Could you mmap the file and start pecking through it with a sliding window? I wrote a stupid little program that runs pretty small:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sarnold 20919 0.0 0.0 33036 4960 pts/2 R+ 22:23 0:00 /usr/bin/python ./sliding_window.py
Working through a 636229 byte fasta file (found via http://biostar.stackexchange.com/questions/1759) took .383 seconds.
#!/usr/bin/python
import mmap
import os
def parse(string, size):
stride = 8
start = string.find("\n")
while start < size - stride:
print string[start:start+stride]
start += 1
fasta = open("small.fasta", 'r')
fasta_size = os.stat("small.fasta").st_size
fasta_map = mmap.mmap(fasta.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
parse(fasta_map, fasta_size)
Some classic IO bound changes.
- Use a lower level read operation like
os.read
and read in to a large fixed buffer. - Use threading/multiprocessing where one reads and buffers and the other processes.
- If you have multiple processors/machines use multiprocessing/mq to divy up processing across CPUs ala map-reduce.
Using a lower level read operation wouldn't be that much of a rewrite. The others would be pretty large rewrites.
I suspect the problem is that you have so much data stored in string format, which is really wasteful for your use case, that you're running out of real memory and thrashing swap. 128 GB should be enough to avoid this... :)
Since you've indicated in comments that you need to store additional information anyway, a separate class which references a parent string would be my choice. I ran a short test using chr21.fa from chromFa.zip from hg18; the file is about 48MB and just under 1M lines. I only have 1GB of memory here, so I simply discard the objects afterwards. This test thus won't show problems with fragmentation, cache, or related, but I think it should be a good starting point for measuring parsing throughput:
import mmap
import os
import time
import sys
class Subseq(object):
__slots__ = ("parent", "offset", "length")
def __init__(self, parent, offset, length):
self.parent = parent
self.offset = offset
self.length = length
# these are discussed in comments:
def __str__(self):
return self.parent[self.offset:self.offset + self.length]
def __hash__(self):
return hash(str(self))
def __getitem__(self, index):
# doesn't currently handle slicing
assert 0 <= index < self.length
return self.parent[self.offset + index]
# other methods
def parse(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Subseq(whole, offset, size)
class Seq(object):
__slots__ = ("value", "offset")
def __init__(self, value, offset):
self.value = value
self.offset = offset
def parse_sep_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield Seq(whole[offset:offset + size], offset)
def parse_plain_str(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_tuple(file, size=8):
file.readline() # skip header
whole = "".join(line.rstrip().upper() for line in file)
for offset in xrange(0, len(whole) - size + 1):
yield (whole, offset, size)
def parse_orig(file, size=8):
file.readline() # skip header
buffer = ''
for line in file:
buffer += line.rstrip().upper()
while len(buffer) >= size:
yield buffer[:size]
buffer = buffer[1:]
def parse_os_read(file, size=8):
file.readline() # skip header
file_size = os.fstat(file.fileno()).st_size
whole = os.read(file.fileno(), file_size).replace("\n", "").upper()
for offset in xrange(0, len(whole) - size + 1):
yield whole[offset:offset+size]
def parse_mmap(file, size=8):
file.readline() # skip past the header
buffer = ""
for line in file:
buffer += line
if len(buffer) >= size:
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size].upper()
buffer = buffer[-(len(buffer) - size + 1):]
for start in xrange(0, len(buffer) - size + 1):
yield buffer[start:start + size]
def length(x):
return sum(1 for _ in x)
def duration(secs):
return "%dm %ds" % divmod(secs, 60)
def main(argv):
tests = [parse, parse_sep_str, parse_tuple, parse_plain_str, parse_orig, parse_os_read]
n = 0
for fn in tests:
n += 1
with open(argv[1]) as f:
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
fn = parse_mmap
n += 1
with open(argv[1]) as f:
f = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
start = time.time()
length(fn(f))
end = time.time()
print "%d %-20s %s" % (n, fn.__name__, duration(end - start))
if __name__ == "__main__":
sys.exit(main(sys.argv))
1 parse 1m 42s
2 parse_sep_str 1m 42s
3 parse_tuple 0m 29s
4 parse_plain_str 0m 36s
5 parse_orig 0m 45s
6 parse_os_read 0m 34s
7 parse_mmap 0m 37s
The first four are my code, while orig is yours and the last two are from other answers here.
User-defined objects are much more costly to create and collect than tuples or plain strings! This shouldn't be that surprising, but I had not realized it would make this much of a difference (compare #1 and #3, which really only differ in a user-defined class vs tuple). You said you want to store additional information, like offset, with the string anyway (as in the parse and parse_sep_str cases), so you might consider implementing that type in a C extension module. Look at Cython and related if you don't want to write C directly.
Case #1 and #2 being identical is expected: by pointing to a parent string, I was trying to save memory rather than processing time, but this test doesn't measure that.
I have a function for process a text file and use buffer in read and write and parallel computing with async pool of workets of process. I have a AMD of 2 cores, 8GB RAM, with gnu/linux and can process 300000 lines in less of 1 second, 1000000 lines in aproximately 4 seconds and aproximately 4500000 lines (more of 220MB) in aproximately 20 seconds:
# -*- coding: utf-8 -*-
import sys
from multiprocessing import Pool
def process_file(f, fo="result.txt", fi=sys.argv[1]):
fi = open(fi, "r", 4096)
fo = open(fo, "w", 4096)
b = []
x = 0
result = None
pool = None
for line in fi:
b.append(line)
x += 1
if (x % 200000) == 0:
if pool == None:
pool = Pool(processes=20)
if result == None:
result = pool.map_async(f, b)
else:
presult = result.get()
result = pool.map_async(f, b)
for l in presult:
fo.write(l)
b = []
if not result == None:
for l in result.get():
fo.write(l)
if not b == []:
for l in b:
fo.write(f(l))
fo.close()
fi.close()
First argument is function that rceive one line, process and return result for will write in file, next is file of output and last is file of input (you can not use last argument if you receive as first parameter in your script file of input).
精彩评论