How can I pass large arrays between numpy and R?
I'm using python and numpy/scipy to do regex and stemming for a text processing application. But I want to开发者_JS百科 use some of R's statistical packages as well.
What's the best way to pass the data from python to R? (And back?)
Also, I need to backup the array to disk at some point, so I'm open to saving from python and loading th R if that's the best solution. The matrices are pretty big (e.g. 100,000 x 10,000), so using sparse matrices might also be nice.
Apologies if this is a repost. I haven't been able to find anything that puts all these pieces together.
Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
Use Rpy, http://rpy.sourceforge.net/, to call R from Python.
The caveat is that both R and Python versions need to be exactly the one for which the Rpy binary has been built. You thus need to be careful with the installation.
I cannot comment on "large data" between shared between R and Python, but I have had a much easier time working with pyRserve than RPy or RPy2.
That being said, I am curious about the text processing you are doing? Python obviously has a lot to offer on the text processing side, but statistically there is a lot too in packages like NLTK and the Pattern package from CLiPS. Are you just more comfortable doing stats in R, or is there something specific missing in Python?
精彩评论