Need to do a math operation on every line in several CSV files in Python
I have about 100 CSV files I have to operate on once a month and I was trying to wrap my head around this but I'm running into a wall. I'm starting to understand some things about Python, but combining several things is still giving me issues, so I can't figure this out.
Here's my problem:
I have many CSV files, and here's what I need done:
add a "column" to the front of each row (or the back, doesn't matter really, but front is ideal). In addition, each line has 5 rows (not counting the filename that will be added), and here's the format:
6-digit ID number,YYYY-MM-DD(1),YYYY-MM-DD(2),YYYY-MM-DD(3),1-2-digit number
I need to subtract YYYY-MM-DD(3) from YYYY-MM-DD(2) for every line in the file (there is no header row), for every CSV in a given directory.
I need the filename inside the row because I will combine the files (which, if is included in the script would be awesome, but I think I can figure that part out), and I need to know what file the records came from. Format of filename is always '4-5-digit-number.csv'
I hope this makes sense, if it does not, please let me know. I'm kind of stumped as to where to even begin, so I don't have any sample code that even really began to work for me. Really frustrated, so I appreciate any help you guys may provide, thi开发者_如何学JAVAs site rocks!
Mylan
There's a tool in the standard library for each of these tasks:
To iterate over all CSV files in a directory, use the glob
module:
import glob
for csvfilename in glob.glob(r"C:\mydirectory\*.csv"):
#do_something
To parse a CSV file, use the csv
module:
import csv
with open(csvfilename, "rb") as csvfile:
reader = csv.reader(csvfile, delimiter=",")
for row in reader:
# row is a list of all the entries in the current row
To parse a date and calculate a difference, use the datetime
module:
from datetime import datetime
startdate = datetime.strptime("1999-10-20", "%Y-%m-%d")
enddate = datetime.strptime("2003-02-28", "%Y-%m-%d")
delta = enddate - startdate # difference in days
To add a value to the beginning of a row:
row[0:0] = [str(delta)]
To append the filename to the end of a row:
row.append(csvfilename)
And to write a row to a new CSV file:
with open(csvfilename, "wb") as csvfile:
writer = csv.writer(csvfile, delimiter=",")
writer.writerow(row)
Taken all together, you get:
import glob
import csv
from datetime import datetime
with open("combined_files_csv", "wb") as outfile:
writer = csv.writer(outfile, delimiter=",")
for csvfilename in glob.glob(r"C:\mydirectory\*.csv"):
with open(csvfilename, "rb") as infile:
reader = csv.reader(infile, delimiter=",")
for row in reader:
startdate = datetime.strptime(row[3], "%Y-%m-%d")
enddate = datetime.strptime(row[2], "%Y-%m-%d")
delta = enddate - startdate # difference in days
row[0:0] = [str(delta)]
row.append(csvfilename)
writer.writerow(row)
The basic outline of the program is going to be like this:
- Use the os module to get the filenames out of the directory/directories of interest
- Read in each file one at a time
- For each line in the file, split it into columns with
columns = line.split(",")
- Use datetime.date to convert strings like "2011-05-03" to datetime.dates.
- Subtract the third date from the second, which yields a datetime.timedelta.
- Put all your information in the format you want (hint: str(foo) yields a string representation of foo, for just about any type) and remember it for later
- Close your file, reopen it for writing, and write your new stuff in
精彩评论