开发者

rpy2: Converting a data.frame to a numpy array

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.

I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)

import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)

This code runs, but expression_data is simply array([[1]]).

I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:

In [40]: e._get_ncol()
Out[40]: 1

In [41]: e._get_nrow()
Out[41]: 1

But then again who knows? Even if e did rep开发者_开发知识库resent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.

Anyone any thoughts?


This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.

To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.

data(iris)
iris$Species = unclass(iris$Species)

write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")

# now start a python session
import numpy as NP

fpath = "/path/to/my/file/np_iris.txt"

A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)

# print(type(A))
# returns: <type 'numpy.ndarray'>

print(A.shape)
# returns: (150, 5)

print(A[1:5,])
# returns: 
 [[ 4.9  3.   1.4  0.2  1. ]
  [ 4.7  3.2  1.3  0.2  1. ]
  [ 4.6  3.1  1.5  0.2  1. ]
  [ 5.   3.6  1.4  0.2  1. ]]

According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.

You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).

Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.

If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:

from tempfile import mkdtemp
import os.path as path

filename = path.join(mkdtemp(), 'tempfile.dat')

# now create a memory-mapped file with shape and data type 
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))

# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to 
# the data stored on disk)
A[:] = somedata[:]


Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?

Passing the matrix to numpy is straightforward (and can even be made without making a copy): http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy

This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.

You seem to be working with bioconductor classes, and might be interested in the following: http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜