Performance of rbind.data.frame

2023-03-05 16:41 问答作者：

I have a list of dataframes for which I开发者_如何学JAVA am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091

Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).

If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.

This is ~25% faster, but there has to be a better way...

system.time({
  N <- do.call(sum, lapply(someParts, nrow))
  SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
  k <- 0
  for(i in 1:length(someParts)) {
    j <- k+1
    k <- k + nrow(someParts[[i]])
    SP[j:k,] <- someParts[[i]]
  }
})

Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

From the ecospace package, rbind_listdf works on chunks of 100 dataframes at a time. Compared to do.call(rbind) it seems to be more time and memory efficient than if you are merging a list of several hundred dataframes. For merging 5000 dataframes of ~5GB total size, I saw peak memory use was ~25% less.

继续阅读：dataframe r rbind

Performance of rbind.data.frame

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？