开发者

How to efficiently output dictionary as csv file using Python's csv module? Out of memory error

I am trying to serialize a list of dictionaries to a csv text file using Python's CSV module. My list has about 13,000 elements, each is a dictionary with ~100 keys consisting of simple text and numbers. My function "dictlist2file" simply calls DictWriter to serialize this, but I am getting out of memory errors.

My function is:

def dictlist2file(dictrows, filename, fieldnames, delimiter='\t',
                  lineterminator='\n', extrasaction='ignore'):
    out_f = open(filename, 'w')

    # Write out header
    if fieldnames != None:
        header = delimiter.join(fieldnames) + lineterminator
    else:
        header = dictrows[0].keys()
        header.sort()
    out_f.write(header)

    print "dictlist2file: serializing %d entries to %s" \
          %(len(dictrows), filename)
    t1 = time.time()
    # Write out dictionary
    data = csv.DictWriter(out_f, fieldnames,
              delimiter=delimiter,
              lineterminator=lineterminator,
                          extrasaction=extrasaction) 
    data.writerows(dictrows)
    out_f.close()
    t2 = time.time()
    print "dictlist2file: took %.2f seconds" %(t2 - t1)

When I try this on my dictionary, I get the following output:

dictlist2file: serializing 13537 entries to myoutput_file.txt
Python(6310) malloc: *** mmap(size=45862912) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to de开发者_开发百科bug
Traceback (most recent call last):
...
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/csv.py", line 149, in writerows
    rows.append(self._dict_to_list(rowdict))
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/csv.py", line 141, in _dict_to_list
    return [rowdict.get(key, self.restval) for key in self.fieldnames]
MemoryError

Any idea what could be causing this? The list has only 13,000 elements and the dictionaries themselves are very simple and small (100 keys) so I don't see why this would lead to memory errors or be so inefficient. It takes minutes for it to get to the memory error.

thanks for your help.


DictWriter.writerows(...) takes all the dicts you pass in to it and creates (in memory) an entire new list of lists, one for each row. So if you have a lot of data, I can see how a MemoryError would pop up. Two ways you might proceed:

  1. Iterate over the list yourself and call DictWriter.writerow once for each one. Although this will mean a lot of writes.
  2. Batch up rows in to smaller lists and call DictWriter.writerows for them. Less IO, but you avoid the huge chunk of memory getting allocated.


You could be tripping over an internal Python issue. I'd report it at bugs.python.org.


I don't have an answer to what is happening with csv, but I found that the following substitute serializes the dictionary to a file in less than a few seconds:

for row in dictrows:
    out_f.write("%s%s" %(delimiter.join([row[name] for name in fieldnames]),
                         lineterminator))

where dictrows is a generator of dictionaries produced by DictReader from csv, fieldnames is a list of fields.

Any idea on why csv doesn't perform similarly would be greatly appreciated. thanks.


You say that if you loop over data.writerow(single_dict) that it still gets the problem. Put in code to show the row count every 100 rows. How many dicts has it processed before it gets the Memory error? Run more or fewer processes to soak up more or less memory ... does the place where it fails vary?

What is max(len(d) for d in dictrows) ? How long are the strings in the dicts?

How much free memory do you have anyway?

Update: See if Dictwriter is the problem; eliminate it and use basic csv functionality:

writer = csv.writer(.....)
for d in dictrows:
   row = [d[fieldname] for fieldname in fieldnames]
   writer.writerow(row)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜