开发者

Cache expensive operations in R

A very simple question:

I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.

This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).

Is there a way to cache what a script does up to a certain point so every time I am only runni开发者_如何学编程ng the incremental lines of code (just as I would do by running R interactively)?

Thanks.


## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.


Some simple ways are doable with some combinations of

  • exists("foo") to test if a variable exists, else re-load or re-compute
  • file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.

There are also caching packages on CRAN that may be useful.


After you do something you discover to be costly, save the results of that costly step in an R data file.

For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:

save(c(myVeryLargeDataFrame, VLDFSummary), 
  file="~/myProject/cachedData/VLDF.RData", 
  compress="bzip2")

The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.

After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:

load("~/myProject/cachedData/VLDF.RData")

This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.


(Belated answer, but I began using SO a year after this question was posted.)

This is the basic idea behind memoization (or memoisation). I've got a long list of suggestions, especially the memoise and R.cache packages, in this query.

You could also take advantage of checkpointing, which is also addressed as part of that same list.

I think your use case mirrors my second: "memoization of monstrous calculations". :)

Another trick I use is to do a lot of memory mapped files, which I use a lot of, to store data. The nice thing about this is that multiple R instances can access shared data, so I can have a lot of instances cracking at the same problem.


I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.


Without going into too much detail, I usually follow one of three approaches:

  1. Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
  2. Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
  3. Use save and load to save the environment at crucial moments, again making sure that all names are unique.


The 'mustashe' package is great for this kind of problem. In addition to caching the results, it also can include links to dependencies so that the code is re-run if the dependencies change.

Disclosure: I wrote this tool ('mustashe'), though I do not make any financial gains from others using it. I made it for this exact purpose for my own work and want to share it with others.

Below is a simple example. The foo variable is created and "stashed" for later. If the same code is re-run, the foo variable is loaded from disk and added to the global environment.

library(mustashe)

stash("foo", {
    foo <- some_long_running_opperation(1e3)
}
#> Stashing object.

The documentation has additional examples of more complex use-cases and a detailed explanation of how it works under the hood.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜