开发者

saving and loading all environments in R

I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like

lapply(my.list, my.function)

where each call to my.function can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save and load to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.

No matter what my.function does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including l开发者_StackOverflow社区oading the necessary packages and everything. save and load save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function on each of the elements of my.list will have the same semantics locally and remotely.

Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.


Typically save and load should work just as you want: when a function is saved (actually, it's a "closure" that gets saved), the environment where it was defined is also saved. If that function was defined as part of a package, a reference to that package is saved instead, and the package is loaded back in again when load sees the reference. (You get a warning when saving if the package did not have a namespace).

The only problem should be the global environment. There, a reference is also saved but this will not save all the variables in the global environment, so you'd have to save them explicitly.

Other environments are saved including their content, and then recursively the parent environment is also saved (unless its a package or globalenv as described above).

Note that saveRDS and serialize alternatives provides a little more control: you get to provide a refhook function that is called whenever an environment is saved. You then do whatever you want to store the environment and return a string id. When loading, a similar refhook is called upon to recreate the environment from that string id. However, you still do not get called for saving the global environment.

e <- new.env() # parent is global env
e$foo <- 42
ee <- new.env(parent=e)
ee$bar <- 13
f <- local(function() foo+bar, ee) 
f() # foo+bar = 55
b <- serialize(f, NULL) # Gives you the serialized bytes

g <- unserialize(b) # Loads from the bytes
g() # 55
# It created new environments...
!identical(environment(g), environment(f))

Hope this helps a bit.

Good luck with rmr!


After thinking about this question a bit further, it seems that the answers may be useful to your problem. If you are having some of the same problems in saving environments as the OP, then Gabor's answer is probably going to help you get on track. However, if basic serialization and saving of environments is the problem, my (admittedly less sophisticated) answer might help - convert to lists via as.list() and then serialize that in the usual way, or consider serialization via JSON; my favorite such package for that is RJSONIO.

Tommy's answer, however, is much more informative about what's going on. Assuming you will be investigating these issues extensively, especially their serialization, I also recommend looking at Tommy's other excellent insights in this answer to a question on environments, closures, and frames.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜