开发者

is List badly performing in python?

I was trying to read data from some huge file and write them back, but I realised that the main cost came from assigning data to a list rather then reading or writing data from/to file....

    rows = [None] * 1446311
    begin = datetime.datetime.now()
    for i in range( 1446311 ):
       row = csvReader.next()
       rows[i] = row
    print datetime.datetime.now() - begin

above code is taking 18 sec 开发者_StackOverflow中文版but 5 sec if I comment out line 5 (rows[i] = row), I have build the list in advance (i.e. reserve the memory) but why it is still so slow? anything I could do the make it faster? I tried row for row in csvReader but it performs worse...

regards, John


I get similar results, but not quite so dramatic as yours. (Note the use of the timeit module for timing code execution, and note that I've factored out the list creation since its common to both test cases.)

import csv
from timeit import Timer

def write_csv(f, n):
    """Write n records to the file named f."""
    w = csv.writer(open(f, 'wb'))
    for i in xrange(n):
        w.writerow((i, "squared", "equals", i**2))

def test1(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        rows[i] = r

def test2(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        pass

def test(t): 
    return (Timer('test%d(rows, F, N)' % t,
                  'from __main__ import test%d, F, N; rows = [None] * N' % t)
            .timeit(number=1))

>>> N = 1446311
>>> F = "test.csv"
>>> write_csv(F, N)
>>> test(1)
2.2321770191192627
>>> test(2)
1.7048690319061279

Here's my guess as to what is going on. In both tests, the CSV reader reads a record from the file and creates a data structure in memory representing that record.

In test2, where the record is not stored, the data structure gets deleted more or less immediately (on the next iteration of the loop, the row variable is updated, so the reference count of the previous record is decremented, and so the memory is reclaimed). This makes the memory used for the previous record available for reuse: this memory is already in the computer's virtual memory tables, and probably still in the cache, so it's (relatively) fast.

In test1, where the record is stored, each record has to be allocated in a new region of memory, which has to be allocated by the operating system, and copied to the cache, so it's (relatively) slow.

So the time is not taken up by list assignment, but by memory allocation.


Here are another couple of tests that illustrate what's going on, without the complicating factor of the csv module. In test3 we create a new 100-element list for each row, and store it. In test4 we create a new 100-element list for each row, but we don't store it, we throw it away so that the memory can be reused on the next time round the loop.

def test3(rows, f, n):
    for i in xrange(n):
        rows[i] = [i] * 100

def test4(rows, f, n):
    for i in xrange(n):
        temp = [i] * 100
        rows[i] = None

>>> test(3)
9.2103338241577148
>>> test(4)
1.5666921138763428

So I think the lesson is that if you do not need to store all the rows in memory at the same time, don't do that! If you can, read them in one at a time, process them one at a time, and then forget about them so that Python can deallocate them.


EDIT: this first part is not so valid (see comments below)

Did you make a try like this:

rows = [None] * 1446311
for i in range( 1446311 ):
   rows[i] = csvReader.next()

Because from what I see in your code, you're copying the data twice: one from file to memory with row = ..., and once from row to rows[i]. As you have non-mutable things here (strings), we really are talking about copy of data, not about copy of references.

Moreover, even if you created an empty list before, you put big piece of data inside; as you only put None in the beginning, no real memory space has been reserved. So maybe you could as well write directly a very simple thing like this:

rows = []
for i in range( 1446311 ):
   rows.append(csvReader.next())

or maybe even use the generator syntax directly!

rows = list(csvReader)

EDIT After reading Gareth's answer, I did some time-testing on my proposals. By the way, take care to put some protection when reading from an iterator, in order to stop nicely if the iterator is shorter than expected:

>>> from timeit import Timer
>>> import csv
>>> # building some timing framework:
>>> def test(n):
    return min(Timer('test%d(F, N)' % t,
                  'from __main__ import test%d, F, N' % t)
            .repeat(repeat=10, number=1))

>>> F = r"some\big\csvfile.csv"
>>> N = 200000
>>> def test1(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):  # using iterator syntax
        if i > number_of_lines:  # and limiting the number of lines
            break
        row = c
        rows[i] = row
    return rows

>>> test(1)
0.31833305864660133

>>> def test2(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        row = c
    return rows

>>> test(2)
0.25134269758603978  # remember that only last line is stored!

>>> def test3(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = [None] * number_of_lines
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        rows[i] = c
    return rows

>>> test(3)
0.30860502255637812

>>> def test4(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = []
    for i, c in enumerate(csvReader):
        if i > number_of_lines:
            break
        rows.append(c)
    return rows

>>> test(4)
0.32001576256431008

>>> def test5(file_in, number_of_lines):
    csvReader = csv.reader(open(file_in, 'rb'))
    rows = list(csvReader)  
    # problem: there's no way to limit the number of lines to parse!
    return rows

>>> test(5)
0.30347613834584308

What we can see, for a N greater than the number of lines in the document, no great difference in timing. test2 is, on my machine, unexpectingly only a little different. test5 is more elegant, but cannot limit the number of lines parsed, which can be annoying.

So, if you need all lines at once, my advice would be to go to the most elegant solution, even if a bit longer: test4. But maybe, as Gareth's ask, you do not need everything at once, which is the best way to gain speed and memory.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜