Fastest way to read comma separated files (including datetimes) in python
I have data stored in comma delimited txt files. One of the columns represents a datetime.
I need to load each column into separate numpy arrays (and decode the date into a python datetime object).
What is the fastest way to do this (in terms of run time)?
NB. the files are several hundred MB of data and currently take several minutes to load in.
e.g. mydata.txt
15,3,0,2003-01-01 00:00:00,12.2
15,4.5,0,2003-01-01 00:00:00,13.7
15,6,0,2003-01-01 00:00:00,18.4
15,7.5,0,2003-01-01 00:00:00,17.9
15,9,0,2003-01-01 00:00:00,17.7
15,10.5,0,2003-01-01 00:00:00,16.3
15,12,0,2003-01-01 00:00:00,17.2
Here is my current code (it works, but is slow):
import csv
import datetime
import time
import numpy
a=[]
b=[]
c=[]
d=[]
timestmp=[]
myfile = open('mydata.txt',"r")
# Read in the data
csv_reader = csv.reader(myfile)
for row in csv_reader:
a.append(row[0])
b.append(row[1])
c.append(row[2])
timestmp.append(row[3])
d.append(row[4])
a = numpy.array(a)
b = numpy.array(b)
c = numpy.array(c)
d = numpy.array(d)
# Convert Time string list into list of Python datetime objects
times = []
time_format = "%Y-%m-%d %H:%M:%S"
for i in xrange(len(timestmp)):
times.append(datetime.datetime.fromtimestamp(time.mktime(time.strptime(timestmp[i], time_format))))
Is there a more efficient way to do this?
Any help is very much appreciated -thanks!
(edit: In the end the bottleneck turned out to be with the datetime conversion, and not reading the file as I originally assumed.)
First, you should run your sample script with Python's built-in profiler to see where the problem actually might be. You can do this from the command-line:
python -m cProfile myscript.py
Secondly, what jumps at me at least, why is that loop at the bottom necessary? Is there a technical reason that it can't be done while reading mydata.txt
in the loop you have above the instantiation of the numpy arrays?
Thirdly, you should create the datetime objects directly, as it also supports strptime. You don't need to create a time stamp, make the time, and just make a datetime from a timestamp. Your loop at the bottom can just be re-written like this:
times = []
timestamps = []
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
for t in timestmp:
parsed_time = datetime.datetime.strptime(t, TIME_FORMAT)
times.append(parsed_time)
timestamps.append(time.mktime(parsed_time.timetuple()))
I too the liberty of PEP-8ing your code a bit, such as changing your constant to all caps. Also, you can iterate over a list just by using the in
operator.
Try numpy.loadtxt()
, the doc string has a good example.
You can also try to use copy=False
when call numpy.array since the default behavior is copy it, this can speed up the script (especially since you said it process a lot of data).
npa = numpy.array(ar, copy=False)
If you follow Mahmoud Abdelkader's advice and use the profiler, and find out that the bottleneck is in the csv
loader, you could always try replacing your csv_reader with this:
for line in open("ProgToDo.txt"):
row = line.split(',')
a.append(int(row[0]))
b.append(int(row[1]))
c.append(int(row[2]))
timestmp.append(row[3])
d.append(float(row[4]))
But more probable I think is that you have a lot of data conversions. Especially the last loop for time conversion will take a long time if you have millions of conversions! If you succeed in doing it all in one step (read+convert), plus taking Terseus advice on not copying the arrays to numpy dittos, you will reduce execution times.
I'm not completely sure if this will help but you may be able to speed up the reading of the file by using ast.literal_eval
. For example:
from ast import literal_eval
myfile = open('mydata.txt',"r")
mylist = []
for line in myfile:
line = line.strip()
e = line.rindex(",")
row = literal_eval('[%s"%s"%s]' % (line[:e-19], line[e-19:e], line[e:]))
mylist.append(row)
a, b, c, timestamp, d = zip(*mylist)
# a, b, c, timestamp, and d are what they were after your csv_reader loop
精彩评论