Math operation on very large CSV (roughly 25-30K lines) in Python3
I got some help in a previous question about this, however this is quite a different issue so I thought a new question would be best...
Once a month I need to parse through a very large CSV file, which I would normally do manually in Excel, however I'm now wanting to do it automatically in Python.
The CSV is structured like so:
[IDNUMBER],[DATE1],[DATE2],[DATE3],[STRING-OF-WHAT-HAPPENED],[DATE3 - DATE2 IN DAYS],[ORIGINAL-FILENAME]
What I need is essentially a printed out display of the following (or to a file, but I don't need to hold on to this data, I just plug it into some charts):
For each original-filename (which can be up to 1200 rows), I need an average of [DATE3 - DATE2 IN DAYS]. For example:
12345,2011-06-12,2011-07-01,2011-07-2,1,['1100.csv']
54321,2011-06-12,2011-07-01,2011-07-3,2,['1100.csv']
23452,2011-06-12,2011-07-01,2011-07-4,3,['1100.csv']
The average would be 2, and I'd need to know that number, and it would be helpful to know how many per file as well, which in this example would be 3.
Then move on to the next original-filename (the last item in the row), until the end of the CSV.
In excel I would use autofilter and select each listing in that column and just select the [date3 - da开发者_C百科te2] column and just get an average, but it's kind of tedious and time consuming.
Thanks!
If I understand you correctly you can do it as follows (hopefully the code is self explanatory):
You can use 'csv' module in Python to read the file.
import csv
fileHandler = open('yourFile', 'rU')
csvReader = csv.reader(fileHandler)
#Get the first row to initialize variables
firstRow = csvReader.next()
columnToAverage = 5
originalFileColumn = 6
runningDaysSum = atoi(firstRow[columnToAverage])
totalRows = 1
originalFileName = firstRow[originalFileColumn]
result = {}
for row in csvReader:
#Move to next row
if originalFileName != row[originalFileColumn]:
average = runningDaysSum/totalRows
result[originalFileName] = (average, totalRows)
originalFileName = row[originalFileColumn]
totalRows = 0
runningDaysSum = 0
runningDaysSum += atoi(row[columnToAverage])
totalRows += 1
#For last row
result[originalFileName] = (average, totalRows)
Hope it helps.
Ok, I assume you've opened the file and can read and parse lines with csv
. I suggest making a dictionary, with the keys being filenames, and the values being a tuple: (count, sum of datediffs). Something like this:
data = {}
while [read and parse line]:
datediff = [DATE3 - DATE2]
if filename not in data:
data[filename] = (1, datediff)
else:
ct, sum = data[filename]
data[filename] = (ct + 1, sum + datediff)
for fname in sorted(data.keys):
ct, sum = data[filename]
avg = sum / ct
[print]
精彩评论