开发者

Do all iterators cache? How about csv.Reader?

We know the following code is loading the data line-by-line only rather than loading them all in memory. i.e. the line alread read will be somehow marked 'deletable' for the OS

def fileGen( file ):
    for line in file:
        yield line

with open("somefile") as file:
    for line in fileGen( file ):
        print line

but is there anyway we could verify if this is still true if we modify the definition of fileGen to following?

def fileGen( file ):
    for line in csv.Reader( file ):
        yield line

How we could know if c开发者_StackOverflowsv.Reader will cache the data it loaded? thanks

regards, John


The most reliable way to find out what csv.reader is doing is to read the source. See _csv.c, lines 773 onwards. You'll see that the reader object has a pointer to the underlying iterator (typically a file iterator), and it calls PyIter_Next each time it needs another line. So it does not read ahead or otherwise cache the data it loads.

Another way to find out what csv.reader is doing is to make a mock file object that can report when it is being queried. For example:

class MockFile:
    def __init__(self): self.line = 0
    def __iter__(self): return self
    def next(self):
        self.line += 1
        print "MockFile line", self.line
        return "line,{0}".format(self.line)

>>> r = csv.reader(MockFile())
>>> next(r)
MockFile line 1
['line', '1']
>>> next(r)
MockFile line 2
['line', '2']

This confirms what we learned from reading the csv source code: it only requests the next line from the underlying iterator when its own next method is called.


John made it clear (see comments) that his concern is whether csv.reader keeps the lines alive, preventing them from being collected by Python's memory manager.

Again, you can either read the code (most reliable) or try an experiment. If you look at the implementation of Reader_iternext in _csv.c, you'll see that lineobj is the name given to the object returned by the underlying iterator, and there's a call to Py_DECREF(lineobj) on every path through the code. So csv.reader does not keep lineobj alive.

Here's an experiment to confirm that.

class FinalizableString(string):
    """A string that reports its deletion."""
    def __init__(self, s): self.s = s
    def __str__(self): return self.s
    def __del__(self): print "*** Deleting", self.s

class MockFile:
    def __init__(self): self.line = 0
    def __iter__(self): return self
    def next(self):
        self.line += 1
        return FinalizableString("line,{0}".format(self.line))

>>> r = csv.reader(MockFile())
>>> next(r)
*** Deleting line,1
['line', '1']
>>> next(r)
*** Deleting line,2
['line', '2']

So you can see that csv.reader does not hang on to the objects it gets from its iterator, and if nothing else is keeping them alive, then they get garbage-collected in a timely fashion.


I have a feeling that there's something more to this question that you're not telling us. Can you explain why you are worried about this?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜