Importing csv into Numpy datetime64
I am trying out the latest version of numpy 2.0 dev:
np.__version__
Out[44]: '2.0.0.dev-aded70c'
I am trying to import CSV data that looks like this:
date,system,pumping,rgt,agt,sps,eskom_import,temperature,wind,pressure,weather
2007-01-01 00:30,481.9,,,,,481.9,15,SW,1040,Fine
2007-01-01 01:00,471.9,,,,,471.9,15,SW,1040,Fine
2007-01-01 01:30,455.9,,,,,455.9,,,,
etc.
by using the following code:
convertdict = {0: lambda s: np.datetime64(s, 'm'), 1: lambda s: float(s or 0), 2: lambda s: float(s or 0), 3: lambda s: float(s or 0), 4: lambda s: float(s or 0), 5: lambda s: float(s or 0), 6: lambda s: float(s or 0), 7: lambda s: float(s or 0), 8: str, 9: str, 10: str}
dt = [('date', np.datetime64),('system', float), ('pumping', float),('rgt',
float), ('agt', float), ('sps', float) ,('eskom_import', float),('temperature', float), ('wind', str), ('pressure', float), ('weather', str)]
a = np.recfromcsv(fp, dtype=dt, converters=convertdict, usecols=range(0-11),
names=True)
The dtype it generates for a.date is 'object':
array([2007-01-01T00:30+0200, 2007-01-01T01:00+0200, 2007-01-01T01:30+0200,
..., 2007-12-31T23:00+0200, 2007-12-31T23:30+0200,
2008-01-01T0开发者_StackOverflow中文版0:00+0200], dtype=object)
But I need it to be datetime64, like in this example (but including hrs and minutes):
array(['2011-07-11', '2011-07-12', '2011-07-13', '2011-07-14',
'2011-07-15', '2011-07-16', '2011-07-17'], dtype='datetime64[D]')
It seems that the CSV import creates an embedded object datetype for 'date' rather than a datetime64 data type. Any ideas on how to fix this?
Grové
I think the trick to avoid the generic 'object' type is to avoid using the recfromcsv function. Manually reading in your data file and parsing the information yields the requested dtype='datetime64[m]'
import numpy as np
dt = np.dtype([ ('date', '<M8[m]'),
('system', '<f8'),
('pumping', '<f8'),
('rgt', '<f8'),
('agt', '<f8'),
('sps', '<f8'),
('eskom_import','<f8'),
('temperature', '<f8'),
('wind', np.str),
('pressure', '<f8'),
('weather', np.str) ])
numfields = len(dt.fields.keys())
data = np.zeros(numlines, dtype=dt)
fid = open('data.csv', 'rb')
count = 0
try:
fieldnames = fid.readline().strip().split(',') #Header
for line in fid:
parsedline = line.strip().split(',')
data['date'][count] = np.datetime64(parsedline[0], 'm')
data['system'][count] = np.double(parsedline[1])
data['pumping'][count] = np.double(parsedline[2])
data['rgt'][count] = np.double(parsedline[3])
data['agt'][count] = np.double(parsedline[4])
data['sps'][count] = np.double(parsedline[5])
data['eskom_import'][count] = np.double(parsedline[6])
data['temperature'][count] = np.double(parsedline[7])
data['wind'][count] = np.str(parsedline[8])
data['pressure'][count] = np.double(parsedline[9])
data['weather'][count] = np.str(parsedline[10])
count += 1
finally:
fid.close()
>>> data['date']
array(['2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
'2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
'2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
'2007-01-01T00:30-0500', '2007-01-01T01:00-0500'], dtype='datetime64[m]')
You could definitely improve upon this code by making use of your "convertdict" and iterating over the parsedline but the idea is the same.
精彩评论