开发者

writing large CSV files - dictionary based CSV writer seems to be the problem

I have a big bag of words array (words, and their counts) that I need to write to large flat csv file.

In testing with around 1000 or so words, this works just fine - I use the dictwriter as follows:

self.csv_out = csv.DictWriter(open(self.loc+'.csv','w'), quoting=csv.QUOTE_ALL, fieldnames=fields)

where fields is list of words (i.e. the keys, in the dictionary that I pass to csv_out.writerow).

However, it seems that this is scaling horribly, and as the number of words increase - the time required to write a row increases exponentially. The dict_to_list method in csv seems to be the instigator of my troubles.

I'm not entirely as to how to even begin to opti开发者_JAVA百科mize here ? any faster CSV routines I could use ?


Ok, this is by no means the answer but i looked up the source-code for the csv module and noticed that there is a very expensive if not check in the module (§ 136-141 in python 2.6).

if self.extrasaction == "raise":
    wrong_fields = [k for k in rowdict if k not in self.fieldnames]
    if wrong_fields:
        raise ValueError("dict contains fields not in fieldnames: " +
                         ", ".join(wrong_fields))
return [rowdict.get(key, self.restval) for key in self.fieldnames]

so a quick workaround seems to be to pass extrasaction="ignore" when creating the writer. This seems to speed up things very substantially.

Not a perfect solution, and perhaps somewhat obvious, but just posting it is helpful to somebody else..


The obvious optimisation is to use a csv.writer instead of a DictWriter, passing in iterables for each row instead of dictionaries. Does that not help?

When you say "the number of words", do you mean the number of columns in the CSV? Because I've never seen a CSV that needs thousands of columns! Maybe you have transposed your data and are writing columns instead of rows? Each row should represent one datum, with sections as defined by the columns. If you really do need that sort of size, maybe a database is a better choice?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜