reading very large file where format is newline independent
My Python code supports reading and writing data in a file format created by others called the BLT format. The BLT format is white space and newline independent in that a newline is treated just like other white space. The primary entry in this format is a "ballot" which ends with a "0", e.g.,
1 2 3 0
Since the format is newline independent, it could also be written as
1 2
3 0
Or you could have multiple ballots on a line:
1 2 3 0 4 5 6 0
These files can be very large so I don't w开发者_如何学Goant to read an entire file into memory. Line-based reading is complicated since the data is not line-based. What is a good way to process these files in a memory-efficient way?
For me, the most straightforward way to solve this is with generators.
def tokens(filename):
with open(filename) as infile:
for line in infile:
for item in line.split():
yield int(item)
def ballots(tokens):
ballot = []
for t in tokens:
if t:
ballot.append(t)
else:
yield ballot
ballot = []
t = tokens("datafile.txt")
for b in ballots(t):
print b
I see @katrielalex posted a generator-using solution while I was posting mine. The difference between ours is that I'm using two separate generators, one for the individual tokens in the file and one for the specific data structure you wish to parse. The former is passed to the latter as a parameter, the basic idea being that you can write a function like ballots()
for each of the data structures you wish to parse. You can either iterate over everything yielded by the generator, or call next()
on either generator to get the next token or ballot (be prepared for a StopIteration
exception when you run out, or else write the generators to generate a sentinel value such as None
when they run out of real data, and check for that).
It would be pretty straightforward to wrap the whole thing in a class. In fact...
class Parser(object):
def __init__(self, filename):
def tokens(filename):
with open(filename) as infile:
for line in infile:
for item in line.split():
yield int(item)
self.tokens = tokens(filename)
def ballots(self):
ballot = []
for t in self.tokens:
if t:
ballot.append(t)
else:
yield ballot
ballot = []
p = Parser("datafile.txt")
for b in p.ballots():
print b
Use a generator:
>>> def ballots(f):
... ballots = []
... for line in f:
... for token in line.split():
... if token == '0':
... yield ballots
... ballots = []
... else:
... ballots.append(token)
This will read the file line by line, split on all whitespace, and append the tokens in the line one by one to a list. Whenever a zero is reached, that ballot is yield
ed and the list reset to empty.
精彩评论