rpy2: Converting a data.frame to a numpy array

2022-12-27 04:17 问答作者：

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.

I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)

import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)

This code runs, but expression_data is simply array([[1]]).

I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:

In [40]: e._get_ncol()
Out[40]: 1

In [41]: e._get_nrow()
Out[41]: 1

But then again who knows? Even if e did rep开发者_开发知识库resent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.

Anyone any thoughts?

This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.

To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.

data(iris)
iris$Species = unclass(iris$Species)

write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")

# now start a python session
import numpy as NP

fpath = "/path/to/my/file/np_iris.txt"

A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)

# print(type(A))
# returns: <type 'numpy.ndarray'>

print(A.shape)
# returns: (150, 5)

print(A[1:5,])
# returns: 
 [[ 4.9  3.   1.4  0.2  1. ]
  [ 4.7  3.2  1.3  0.2  1. ]
  [ 4.6  3.1  1.5  0.2  1. ]
  [ 5.   3.6  1.4  0.2  1. ]]

According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.

You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).

Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.

If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:

from tempfile import mkdtemp
import os.path as path

filename = path.join(mkdtemp(), 'tempfile.dat')

# now create a memory-mapped file with shape and data type 
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))

# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to 
# the data stored on disk)
A[:] = somedata[:]

Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?

Passing the matrix to numpy is straightforward (and can even be made without making a copy): http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy

This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.

You seem to be working with bioconductor classes, and might be interested in the following: http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

继续阅读：bioconductor numpy python r rpy2

rpy2: Converting a data.frame to a numpy array

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？