Math operation on very large CSV (roughly 25-30K lines) in Python3

2023-04-05 07:11 问答作者：

I got some help in a previous question about this, however this is quite a different issue so I thought a new question would be best...

Once a month I need to parse through a very large CSV file, which I would normally do manually in Excel, however I'm now wanting to do it automatically in Python.

The CSV is structured like so:

[IDNUMBER],[DATE1],[DATE2],[DATE3],[STRING-OF-WHAT-HAPPENED],[DATE3 - DATE2 IN DAYS],[ORIGINAL-FILENAME]

What I need is essentially a printed out display of the following (or to a file, but I don't need to hold on to this data, I just plug it into some charts):

For each original-filename (which can be up to 1200 rows), I need an average of [DATE3 - DATE2 IN DAYS]. For example:

12345,2011-06-12,2011-07-01,2011-07-2,1,['1100.csv']
54321,2011-06-12,2011-07-01,2011-07-3,2,['1100.csv']
23452,2011-06-12,2011-07-01,2011-07-4,3,['1100.csv']

The average would be 2, and I'd need to know that number, and it would be helpful to know how many per file as well, which in this example would be 3.

Then move on to the next original-filename (the last item in the row), until the end of the CSV.

In excel I would use autofilter and select each listing in that column and just select the [date3 - da开发者_C百科te2] column and just get an average, but it's kind of tedious and time consuming.

Thanks!

If I understand you correctly you can do it as follows (hopefully the code is self explanatory):

You can use 'csv' module in Python to read the file.

import csv

fileHandler = open('yourFile', 'rU')
csvReader = csv.reader(fileHandler)

#Get the first row to initialize variables
firstRow = csvReader.next()
columnToAverage = 5
originalFileColumn = 6
runningDaysSum = atoi(firstRow[columnToAverage])
totalRows = 1
originalFileName = firstRow[originalFileColumn]

result = {}

for row in csvReader:
    #Move to next row
    if originalFileName != row[originalFileColumn]:
        average = runningDaysSum/totalRows
        result[originalFileName] = (average, totalRows)
        originalFileName = row[originalFileColumn]
        totalRows = 0
        runningDaysSum = 0

    runningDaysSum += atoi(row[columnToAverage])
    totalRows += 1

#For last row
result[originalFileName] = (average, totalRows)

Hope it helps.

Ok, I assume you've opened the file and can read and parse lines with csv. I suggest making a dictionary, with the keys being filenames, and the values being a tuple: (count, sum of datediffs). Something like this:

data = {}
while [read and parse line]:
    datediff = [DATE3 - DATE2]
    if filename not in data:
        data[filename] = (1, datediff)
    else:
        ct, sum = data[filename]
        data[filename] = (ct + 1, sum + datediff)

for fname in sorted(data.keys):
    ct, sum = data[filename]
    avg = sum / ct
    [print]

继续阅读：average csv python-3.x

Math operation on very large CSV (roughly 25-30K lines) in Python3

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？