Do all iterators cache? How about csv.Reader?

2023-03-16 12:09 问答作者：

We know the following code is loading the data line-by-line only rather than loading them all in memory. i.e. the line alread read will be somehow marked 'deletable' for the OS

def fileGen( file ):
    for line in file:
        yield line

with open("somefile") as file:
    for line in fileGen( file ):
        print line

but is there anyway we could verify if this is still true if we modify the definition of fileGen to following?

def fileGen( file ):
    for line in csv.Reader( file ):
        yield line

How we could know if c开发者_StackOverflowsv.Reader will cache the data it loaded? thanks

regards, John

The most reliable way to find out what csv.reader is doing is to read the source. See _csv.c, lines 773 onwards. You'll see that the reader object has a pointer to the underlying iterator (typically a file iterator), and it calls PyIter_Next each time it needs another line. So it does not read ahead or otherwise cache the data it loads.

Another way to find out what csv.reader is doing is to make a mock file object that can report when it is being queried. For example:

class MockFile:
    def __init__(self): self.line = 0
    def __iter__(self): return self
    def next(self):
        self.line += 1
        print "MockFile line", self.line
        return "line,{0}".format(self.line)

>>> r = csv.reader(MockFile())
>>> next(r)
MockFile line 1
['line', '1']
>>> next(r)
MockFile line 2
['line', '2']

This confirms what we learned from reading the csv source code: it only requests the next line from the underlying iterator when its own next method is called.

John made it clear (see comments) that his concern is whether csv.reader keeps the lines alive, preventing them from being collected by Python's memory manager.

Again, you can either read the code (most reliable) or try an experiment. If you look at the implementation of Reader_iternext in _csv.c, you'll see that lineobj is the name given to the object returned by the underlying iterator, and there's a call to Py_DECREF(lineobj) on every path through the code. So csv.reader does not keep lineobj alive.

Here's an experiment to confirm that.

class FinalizableString(string):
    """A string that reports its deletion."""
    def __init__(self, s): self.s = s
    def __str__(self): return self.s
    def __del__(self): print "*** Deleting", self.s

class MockFile:
    def __init__(self): self.line = 0
    def __iter__(self): return self
    def next(self):
        self.line += 1
        return FinalizableString("line,{0}".format(self.line))

>>> r = csv.reader(MockFile())
>>> next(r)
*** Deleting line,1
['line', '1']
>>> next(r)
*** Deleting line,2
['line', '2']

So you can see that csv.reader does not hang on to the objects it gets from its iterator, and if nothing else is keeping them alive, then they get garbage-collected in a timely fashion.

I have a feeling that there's something more to this question that you're not telling us. Can you explain why you are worried about this?

继续阅读：caching csv iterator memory python

Do all iterators cache? How about csv.Reader?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？