Multicore and memory usage in R under Ubuntu

2023-02-13 11:56 问答作者：

I am running R on an Ubuntu workstation with 8 virtual cores and 8 Gb of ram. I was hoping to routinely use the multicore package to make use of the 8 cores in parallel; however I find that the whole R process becomes duplicated 8 times. As R actually seems to use much more memory than is reported in gc (by a factor 5, even after gc()), this means that even a relatively mild memory usage (one 200Mb object) becomes intractably memory-heavy once duplicated 8 times. I looked into bigmemory to have the child processes share the same memory space; but it would require some major rewriting of my code as it doesn't deal with dataframes.

Is there a way to make R as lean as possible before forking, i.e. have the OS reclaim as much memory as possible?

EDIT: I think I understand what is going on now. The problem is not where I thought it was -- objects that exist in the parent thread and are not manipulated do not get duplicated eight times. Instead my problem, I believe, came from the nature of the manipulation I am making each child process perform. Each has to manipulate a big factor with hundreds of thousands of levels, and I think this is the memory-heavy bit. As a result, it is indeed the case that the overall memory load is proportional to the number of cores; but not as dramatically as I thought. Another lesson I learned is that with 4 physical cores + possibility of hyperthreading, hyperthreading is actually not typically a good idea for R. The gain is minimal, and the memory cost may be non-trivial. So I'll be working开发者_如何学运维 on 4 cores from now on.

For those who would like to experiment, this is the type of code I was running:

# Create data
sampdata <- data.frame(id = 1:1000000)
for (letter in letters) {
sampdata[, letter] <- rnorm(1000000)
}
sampdata$groupid = ceiling(sampdata$id/2)

# Enable multicore
library(multicore)
options(cores=4) # number of cores to distribute the job to

# Actual job
system.time(do.call("cbind", 
    mclapply(subset(sampdata, select = c(a:z)), function(x) tapply(x, sampdata$groupid, sum))
))

Have you tried data.table?

> system.time(ans1 <- do.call("cbind",
lapply(subset(sampdata,select=c(a:z)),function(x)tapply(x,sampdata$groupid,sum))
))
   user  system elapsed 
906.157  13.965 928.645 

> require(data.table)
> DT = as.data.table(sampdata)
> setkey(DT,groupid)
> system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
   user  system elapsed 
186.920   1.056 191.582                # 4.8 times faster

> # massage minor diffs in results...
> ans2$groupid=NULL
> ans2=as.matrix(ans2)
> colnames(ans2)=letters
> rownames(ans1)=NULL

> identical(ans1,ans2)
[1] TRUE

Your example is very interesting. It is reasonably large (200MB), there are many groups (1/2 million), and each group is very small (2 rows). The 191s can probably be improved by quite a lot, but at least it's a start. [March 2011]

And now, this idiom (i.e. lapply(.SD,...)) has been improved a lot. With v1.8.2, and on a faster computer than the test above, and with the latest version of R etc, here is the updated comparison :

sampdata <- data.frame(id = 1:1000000)
for (letter in letters) sampdata[, letter] <- rnorm(1000000)
sampdata$groupid = ceiling(sampdata$id/2)
dim(sampdata)
# [1] 1000000      28
system.time(ans1 <- do.call("cbind",
  lapply(subset(sampdata,select=c(a:z)),function(x)
    tapply(x,sampdata$groupid,sum))
))
#   user  system elapsed
# 224.57    3.62  228.54
DT = as.data.table(sampdata)
setkey(DT,groupid)
system.time(ans2 <- DT[,lapply(.SD,sum),by=groupid])
#   user  system elapsed
#  11.23    0.01   11.24                # 20 times faster

# massage minor diffs in results...
ans2[,groupid:=NULL]
ans2[,id:=NULL]
ans2=as.matrix(ans2)
rownames(ans1)=NULL

identical(ans1,ans2)
# [1] TRUE

sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252   LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252  LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.8.2 RODBC_1.3-6

Things I've tried on Ubuntu 64 bit R, ranked in order of success:

Work with fewer cores, as you are doing.
Split the mclapply jobs into pieces, and save the partial results to a database using DBI with append=TRUE.
Use the rm function along with gc() often

I have tried all of these, and mclapply still begins to create larger and larger processes as it runs, leading me to suspect each process is holding onto some sort of residual memory it really doesn't need.

P.S. I was using data.table, and it seems each child process copies the data.table.

继续阅读：multicore r

Multicore and memory usage in R under Ubuntu

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？