genfromtxt dtype=None returns wrong shape
I'm a newcomer to numpy, and am having a hard time reading CSVs into a numpy array with genfromtxt.
I found a CSV file on t开发者_运维知识库he web that I'm using as an example. It's a mixture of floats and strings. It's here: http://pastebin.com/fMdRjRMv
I'm using numpy via pylab (initializing on a Ubuntu system via: ipython -pylab). numpy.version.version is 1.3.0.
Here's what I do:
Example #1:
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None)
data.shape
(374, 15)
data[10,10] ## Take a look at an example element
'30'
type(data[10,10])
type 'numpy.string_'
There are no errant quotation marks in the CSV file, so I've no idea why it should think that the number is a string. Does anyone know why this is the case?
Example #2 (skipping the first row):
data = genfromtxt("fMdRjRMv.txt", delimiter=',', dtype=None, skiprows=1)
data.shape
(373,)
Does anyone know why it would not read all of this into a 1-dimensional array?
Thanks so much!
In your example #1, the problem is that all the values in a single column must share the same datatype. Since the first line of your data file has the column names, this means that the datatype of every column is string.
You have the right idea in example #2 of skipping the first row. Note however that 1.3.0 is a rather old version (I have 1.6.1). In newer versions skiprows
is deprecated and you should use skip_header
instead.
The reason that the shape of the array is (373,)
is that it is a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html), which is what numpy uses to represent inhomogeneous data. So data[10]
gives you an entire row of your table. You can also access the data columns by name, for example data['f10']
. You can find the names of the columns in data.dtype.names
. It is also possible to use the original column names that are defined in the first line of your data file:
data = genfromtxt("fMdRjRMv.txt", dtype=None, delimiter=',', names=True)
then you can access a column like data['Age']
.
精彩评论