saving and loading all environments in R
I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like
lapply(my.list, my.function)
where each call to my.function
can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save
and load
to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.
No matter what my.function
does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including l开发者_StackOverflow社区oading the necessary packages and everything. save
and load
save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function
on each of the elements of my.list
will have the same semantics locally and remotely.
Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.
Typically save
and load
should work just as you want: when a function is saved (actually, it's a "closure" that gets saved), the environment where it was defined is also saved. If that function was defined as part of a package, a reference to that package is saved instead, and the package is loaded back in again when load
sees the reference. (You get a warning when saving if the package did not have a namespace).
The only problem should be the global environment. There, a reference is also saved but this will not save all the variables in the global environment, so you'd have to save them explicitly.
Other environments are saved including their content, and then recursively the parent environment is also saved (unless its a package or globalenv as described above).
Note that saveRDS
and serialize
alternatives provides a little more control: you get to provide a refhook
function that is called whenever an environment is saved. You then do whatever you want to store the environment and return a string id. When loading, a similar refhook is called upon to recreate the environment from that string id. However, you still do not get called for saving the global environment.
e <- new.env() # parent is global env
e$foo <- 42
ee <- new.env(parent=e)
ee$bar <- 13
f <- local(function() foo+bar, ee)
f() # foo+bar = 55
b <- serialize(f, NULL) # Gives you the serialized bytes
g <- unserialize(b) # Loads from the bytes
g() # 55
# It created new environments...
!identical(environment(g), environment(f))
Hope this helps a bit.
Good luck with rmr
!
After thinking about this question a bit further, it seems that the answers may be useful to your problem. If you are having some of the same problems in saving environments as the OP, then Gabor's answer is probably going to help you get on track. However, if basic serialization and saving of environments is the problem, my (admittedly less sophisticated) answer might help - convert to lists via as.list()
and then serialize that in the usual way, or consider serialization via JSON; my favorite such package for that is RJSONIO
.
Tommy's answer, however, is much more informative about what's going on. Assuming you will be investigating these issues extensively, especially their serialization, I also recommend looking at Tommy's other excellent insights in this answer to a question on environments, closures, and frames.
精彩评论