开发者

Python: how to do basic data manipulation like in R?

I have been working with R for several years. R is very strong in data manipulation. I'm learning python and I would like to know how to manipulate data using python. Basically my data sets are organized as data frames (e.g excel sheet). I would like to know (by example) how this kind of basi开发者_Go百科c data manipulation task can be done using python?

1. Read csv file like the following

var1, var2, var3
1, 2, 3
4, 5, 6 
7, 8, 9

2. Subset data where var2 in ('5', '8') 
3. Make a new variable --> var4 = var3 * 3
4. Transpose this data
5. Write to csv file

Your help and example is most appreciated!


I disagree with Cpfohl's comment - perhaps because I've been through this same transition myself, and it's not obvious how a naive user would be able to formulate the problem more precisely. It is actually an active development problem right now with a number of projects that have all come up with non-overlapping functionality (e.g. in the financial timeseries world, in the brain imaging world, etc.).

The short answer is that python's various libraries for dealing with tables and csv files are not as good for a beginner as those in R, which are the end result of many years of users of varying levels.

First, have a look at recarrays in numpy. This is probably the closest data structure that is in a commonly used library that is similar to a data.frame in R. In particular, you'll probably like the numpy.recfromcsv function, though it is not as robust as e.g. read.csv in R (it will have trouble with non-standard line-endings, for example).

Subsetting a recarray is easy (though creating one can seem clunky):

import numpy as np
mydata = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
mydata = mydata.view(np.recarray)
mydata[mydata.x > 2]

Modifying the nature of a numpy array is not generally as easy as in R, but there is a nice library of functions in numpy.lib.recfunctions (which must be imported separately - it doesn't come along with a simple import numpy). In particular, check out rec_append_fields and rec_join for adding columns.

Numpy has a function numpy.savetxt that will accept a simple delimiter argument to make a csv file, but it will not print column names sadly (at least, I don't see that it does). So, while I discourage adding unnecessary libraries (since it gives less portable code), you might just use matplotlib.mlab.rec2csv (you'll find some other similar functions in that neighborhood as well - the numpy community is trying to port generally useful numeric / data manip code to numpy proper. Who knows, maybe you'll do this?).

You'll notice I didn't answer (4), because that doesn't make sense. Tables don't transpose in python or R. Arrays or matrices do. So, convert your data to an array with a uniform dtype, then just use myarray.T.

Other tools you might look at are pytables (and the related package carray), larry, datarray, pandas and tabular. In particular, datarray is looking to create a system for labelled data arrays which would serve as a foundation to other projects (and I think has developers from the larry and pandas projects as well).

Hope that helps! Dav


import csv
from itertools import izip

with open('source.csv') as f:
    reader = csv.reader(f)
    # filter data
    data = (row for row in reader if row[1].strip() in ('5', '8'))
    # make a new variable
    data = (row + [int(row[2]) * 3] for row in data)
    # transpose data
    data = izip(*data)
    # write data to a new csv file
    with open('destination.csv', 'w') as fw:
        csv.writer(fw).writerows(data)


Simple answer: use pandas

1

In [2]: df = read_csv('foo.csv', index_col=None)

In [3]: df
Out[3]: 
   var1  var2  var3
0  1     2     3   
1  4     5     6   
2  7     8     9   

2

In [4]: df[df['var2'].isin([5, 8])]
Out[4]: 
   var1  var2  var3
1  4     5     6   
2  7     8     9   

3

In [5]: df['var4'] = df['var3'] * 2

In [6]: df
Out[6]: 
   var1  var2  var3  var4
0  1     2     3     6   
1  4     5     6     12  
2  7     8     9     18  

4

In [7]: df.T
Out[7]: 
      0  1   2 
var1  1  4   7 
var2  2  5   8 
var3  3  6   9 
var4  6  12  18

6

In [8]: df.to_csv('foo2.csv')

In [9]: !cat foo2.csv
index,var1,var2,var3,var4
0,1,2,3,6
1,4,5,6,12
2,7,8,9,18


There's a module for CSV parsing in the standard library. To get a list of row each containing a list of the cells, you can use list(csv.reader(...)).

Step 2 and 3 can be written in one list comprehension: [(var1, var2, var3, var3 * 3) for var1, var2, var3 in data if var2 in ('5', '8')].

I'm not aware of anything in the standard library for transposing lists of lists. Perhaps NumPy or SciPy has something. A quick and dirty way would be zipping the rows (zip(*lists)).

Writing back to file should be as simple as constructing as csv.writer and passing it each row in a loop.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜