Why is caret train taking up so much memory?

2023-03-16 10:42 问答作者：

When I train just using glm, everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm'), I run out of memory.

Is this because train is storing a lot of data for each iteration of the c开发者_开发问答ross-validation (or whatever the trControl procedure is)? I'm looking at trainControl and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses.

(I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.)

The problem is two fold. i) train doesn't just fit a model via glm(), it will bootstrap that model, so even with the defaults, train() will do 25 bootstrap samples, which, coupled with problem ii) is the (or a) source of your problem, and ii) train() simply calls the glm() function with its defaults. And those defaults are to store the model frame (argument model = TRUE of ?glm), which includes a copy of the data in model frame style. The object returned by train() already stores a copy of the data in $trainingData, and the "glm" object in $finalModel also has a copy of the actual data.

At this point, simply running glm() using train() will be producing 25 copies of the fully expanded model.frame and the original data, which will all need to be held in memory during the resampling process - whether these are held concurrently or consecutively is not immediately clear from a quick look at the code as the resampling happens in an lapply() call. There will also be 25 copies of the raw data.

Once the resampling is finished, the returned object will contain 2 copies of the raw data and a full copy of the model.frame. If your training data is large relative to available RAM or contains many factors to be expanded in the model.frame, then you could easily be using huge amounts of memory just carrying copies of the data around.

If you add model = FALSE to your train call, that might make a difference. Here is a small example using the clotting data in ?glm:

clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),
                       lot1 = c(118,58,42,35,27,25,21,19,18),
                       lot2 = c(69,35,26,21,18,16,13,12,12))
require(caret)

then

> m1 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm", 
+             model = TRUE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> m2 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",
+             model = FALSE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> object.size(m1)
121832 bytes
> object.size(m2)
116456 bytes
> ## ordinary glm() call:
> m3 <- glm(lot1 ~ log(u), data=clotting, family = Gamma)
> object.size(m3)
47272 bytes
> m4 <- glm(lot1 ~ log(u), data=clotting, family = Gamma, model = FALSE)
> object.size(m4)
42152 bytes

So there is a size difference in the returned object and memory use during training will be lower. How much lower will depend on whether the internals of train() keep all copies of the model.frame in memory during the resampling process.

The object returned by train() is also significantly larger than that returned by glm() - as mentioned by @DWin in the comments, below.

To take this further, either study the code more closely, or email Max Kuhn, the maintainer of caret, to enquire about options to reduce the memory footprint.

Gavin's answer is spot on. I built the function for ease of use rather than for speed or efficiency [1]

First, using the formula interface can be an issue when you have a lot of predictors. This is something that R Core could fix; the formula approach requires a very large but sparse terms() matrix to be retained and R has packages to effectively deal with that issue. For example, with n = 3, 000 and p = 2, 000, a 3–tree random forest model object was 1.5 times larger in size and took 23 times longer to execute when using the formula interface (282s vs 12s).

Second, you don't have to keep the training data (see the returnData argument in trainControl()).

Also, since R doesn't have any real shared memory infrastructure, Gavin is correct about the number of copies of the data that are retained in memory. Basically, a list is created for every resample and lapply() is used to process the list, then return only the resampled estimates. An alternative would be to sequentially make one copy of the data (for the current resample), do the required operations, then repeat for the remaining iterations. The issue there is I/O and the inability to do any parallel processing. [2]

If you have a large data set, I suggest using the non-formula interface (even though the actual model, like glm, eventually uses a formula). Also, for large data sets, train() saves the resampling indices for use by resamples() and other functions. You could probably remove those too.

Yang - it would be good to know more about the data via str(data) so we can understand the dimensions and other aspects (eg. factors with many levels etc).

I hope that helps,

Max

[1] I should not that we go to great lengths to fit as few models as possible when we can. The "sub-model" trick is used for many models, such as pls, gbm, rpart, earth and many others. Also, when a model has formula and non-formula interfaces (eg. lda() or earth(), we default to the non-formula interface.

[2] Every once in a while I get the insane urge to reboot the train() function. Using foreach might get around some of these issues.

I think the above answers are a bit outdated. The caret and caretEnsemble packages now include an additional parameter in trainControl 'trim.' Trim is initially set to FALSE but changing it to TRUE will significantly decrease model size. You should use this in combination with returnData=FALSE for the smallest model sizes possible. If you're using a model ensemble, you should also specify these two parameters in the greedy/stack ensemble trainControl.

For my case, a 1.6gb model shrunk to ~500mb with both parameters in the ensemble control and further shrunk to ~300mb also using the parameters in the greedy ensemble control.

Ensemble_control_A9 <- trainControl(trim=TRUE, method = "repeatedcv", number = 3, repeats = 2, verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE, allowParallel = TRUE, sampling = "up")


Ensemble_greedy_A5 <- caretEnsemble(Ensemble_list_A5, metric="ROC",  trControl=trainControl(number=2, trim=TRUE, returnData = FALSE, summaryFunction=twoClassSummary, classProbs=TRUE))

继续阅读：glm memory r-caret

Why is caret train taking up so much memory?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？