is List badly performing in python?
I was trying to read data from some huge file and write them back, but I realised that the main cost came from assigning data to a list rather then reading or writing data from/to file....
rows = [None] * 1446311
begin = datetime.datetime.now()
for i in range( 1446311 ):
row = csvReader.next()
rows[i] = row
print datetime.datetime.now() - begin
above code is taking 18 sec 开发者_StackOverflow中文版but 5 sec if I comment out line 5 (rows[i] = row
), I have build the list in advance (i.e. reserve the memory) but why it is still so slow? anything I could do the make it faster? I tried row for row in csvReader
but it performs worse...
regards, John
I get similar results, but not quite so dramatic as yours. (Note the use of the timeit
module for timing code execution, and note that I've factored out the list creation since its common to both test cases.)
import csv
from timeit import Timer
def write_csv(f, n):
"""Write n records to the file named f."""
w = csv.writer(open(f, 'wb'))
for i in xrange(n):
w.writerow((i, "squared", "equals", i**2))
def test1(rows, f, n):
for i, r in enumerate(csv.reader(open(f))):
rows[i] = r
def test2(rows, f, n):
for i, r in enumerate(csv.reader(open(f))):
pass
def test(t):
return (Timer('test%d(rows, F, N)' % t,
'from __main__ import test%d, F, N; rows = [None] * N' % t)
.timeit(number=1))
>>> N = 1446311
>>> F = "test.csv"
>>> write_csv(F, N)
>>> test(1)
2.2321770191192627
>>> test(2)
1.7048690319061279
Here's my guess as to what is going on. In both tests, the CSV reader reads a record from the file and creates a data structure in memory representing that record.
In test2
, where the record is not stored, the data structure gets deleted more or less immediately (on the next iteration of the loop, the row
variable is updated, so the reference count of the previous record is decremented, and so the memory is reclaimed). This makes the memory used for the previous record available for reuse: this memory is already in the computer's virtual memory tables, and probably still in the cache, so it's (relatively) fast.
In test1
, where the record is stored, each record has to be allocated in a new region of memory, which has to be allocated by the operating system, and copied to the cache, so it's (relatively) slow.
So the time is not taken up by list assignment, but by memory allocation.
Here are another couple of tests that illustrate what's going on, without the complicating factor of the csv
module. In test3
we create a new 100-element list for each row, and store it. In test4
we create a new 100-element list for each row, but we don't store it, we throw it away so that the memory can be reused on the next time round the loop.
def test3(rows, f, n):
for i in xrange(n):
rows[i] = [i] * 100
def test4(rows, f, n):
for i in xrange(n):
temp = [i] * 100
rows[i] = None
>>> test(3)
9.2103338241577148
>>> test(4)
1.5666921138763428
So I think the lesson is that if you do not need to store all the rows in memory at the same time, don't do that! If you can, read them in one at a time, process them one at a time, and then forget about them so that Python can deallocate them.
EDIT: this first part is not so valid (see comments below)
Did you make a try like this:
rows = [None] * 1446311
for i in range( 1446311 ):
rows[i] = csvReader.next()
Because from what I see in your code, you're copying the data twice: one from file to memory with row = ...
, and once from row
to rows[i]
. As you have non-mutable things here (strings), we really are talking about copy of data, not about copy of references.
Moreover, even if you created an empty list before, you put big piece of data inside; as you only put None
in the beginning, no real memory space has been reserved. So maybe you could as well write directly a very simple thing like this:
rows = []
for i in range( 1446311 ):
rows.append(csvReader.next())
or maybe even use the generator syntax directly!
rows = list(csvReader)
EDIT After reading Gareth's answer, I did some time-testing on my proposals. By the way, take care to put some protection when reading from an iterator, in order to stop nicely if the iterator is shorter than expected:
>>> from timeit import Timer
>>> import csv
>>> # building some timing framework:
>>> def test(n):
return min(Timer('test%d(F, N)' % t,
'from __main__ import test%d, F, N' % t)
.repeat(repeat=10, number=1))
>>> F = r"some\big\csvfile.csv"
>>> N = 200000
>>> def test1(file_in, number_of_lines):
csvReader = csv.reader(open(file_in, 'rb'))
rows = [None] * number_of_lines
for i, c in enumerate(csvReader): # using iterator syntax
if i > number_of_lines: # and limiting the number of lines
break
row = c
rows[i] = row
return rows
>>> test(1)
0.31833305864660133
>>> def test2(file_in, number_of_lines):
csvReader = csv.reader(open(file_in, 'rb'))
rows = [None] * number_of_lines
for i, c in enumerate(csvReader):
if i > number_of_lines:
break
row = c
return rows
>>> test(2)
0.25134269758603978 # remember that only last line is stored!
>>> def test3(file_in, number_of_lines):
csvReader = csv.reader(open(file_in, 'rb'))
rows = [None] * number_of_lines
for i, c in enumerate(csvReader):
if i > number_of_lines:
break
rows[i] = c
return rows
>>> test(3)
0.30860502255637812
>>> def test4(file_in, number_of_lines):
csvReader = csv.reader(open(file_in, 'rb'))
rows = []
for i, c in enumerate(csvReader):
if i > number_of_lines:
break
rows.append(c)
return rows
>>> test(4)
0.32001576256431008
>>> def test5(file_in, number_of_lines):
csvReader = csv.reader(open(file_in, 'rb'))
rows = list(csvReader)
# problem: there's no way to limit the number of lines to parse!
return rows
>>> test(5)
0.30347613834584308
What we can see, for a N greater than the number of lines in the document, no great difference in timing. test2
is, on my machine, unexpectingly only a little different. test5
is more elegant, but cannot limit the number of lines parsed, which can be annoying.
So, if you need all lines at once, my advice would be to go to the most elegant solution, even if a bit longer: test4
. But maybe, as Gareth's ask, you do not need everything at once, which is the best way to gain speed and memory.
精彩评论