challenge: optimize unlisting [easy]

2023-01-22 07:57 问答作者：

Because SO is a bit slow lately, I'm posting an easy question. I would appreciate it if big fishes stayed on the bench for this one and give rookies a chance to respond.

Sometimes we have objects that have a ridiculous amount of large 开发者_StackOverflow社区list elements (vectors). How would you "unlist" this object into a single vector. Show proof that your method is faster than unlist().

If you don't need names and your list is one level deep, then if you can beat

.Internal(unlist(your_list, FALSE, FALSE))

I will vote up everything you do on SO for the next 1 year!!!

[Update: if non-unique names are needed and the list is not recursive, here is a version which improves over the unlist 100 times

 myunlist <- function(l){
    names <- names(l)
    vec <- unlist(l, F, F)
    reps <- unlist(lapply(l, length), F, F)
    names(vec) <- rep(names, reps)
    vec
    }

 myunlist(list(a=1:3, b=2))
 a a a b 
 1 2 3 2 

 > tl <- list(a = 1:20000, b = 1:5000, c = 2:30)
 > system.time(for(i in 1:200) unlist(tl))
 user  system elapsed 
 22.97    0.00   23.00 

 > system.time(for(i in 1:200) myunlist(tl))
 user  system elapsed 
 0.2     0.0     0.2 

 > system.time(for(i in 1:200) unlist(tl, F, F))
 user  system elapsed 
 0.02    0.00    0.02

]

[Update2: Responce to challenge Nr3 from Richie Cotton.

bigList3 <- replicate(500, rnorm(1e3), simplify = F)

unlist_vit <- function(l){
    names(l) <- NULL
    do.call(c, l)
    }

library(rbenchmark)

benchmark(unlist = unlist(bigList3, FALSE, FALSE),
          rjc    = unlist_rjc(bigList3),
          vit    = unlist_vit(bigList3),
          order  = "elapsed",
          replications = 100,
          columns = c("test", "relative", "elapsed")
          )

    test  relative elapsed
1 unlist   1.0000    2.06
3    vit   1.4369    2.96
2    rjc   3.5146    7.24

]

PS: I assume a "big fish" is the one with more reputation than you. So I am pretty much small here :).

A non-unlist() solution would have to be pretty darned fast to beat unlist() would it not? Here it takes less than two second to unlist a list with 2000 numeric vectors of length 100,000 each.

> bigList2 <- as.list(data.frame(matrix(rep(rnorm(1000000), times = 200), 
+                                       ncol = 2000)))
> print(object.size(bigList2), units = "Gb")
1.5 Gb
> system.time(foo <- unlist(bigList2, use.names = FALSE))
   user  system elapsed 
  1.897   0.000   2.019

With bigList2 and foo in my workspace, R is using ~9Gb of my available memory. The key is use.names = FALSE. Without it unlist() is painfully slow. Exactly how slow I'm still waiting to find out...

We can speed this up a little bit more by setting recursive = FALSE and then we have effectively the same as VitoshKa's answer (two representative timings):

> system.time(foo <- unlist(bigList2, recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.379   0.001   1.416
> system.time(foo <- .Internal(unlist(bigList2, FALSE, FALSE)))
   user  system elapsed 
  1.335   0.000   1.344

... finally the use.names = TRUE version finished...:

> system.time(foo <- unlist(bigList2, use = TRUE))
    user   system  elapsed 
2307.839   10.978 2335.815

and it was consuming all my systems 16Gb of RAM so I gave up at that point...

c() has the logical argument recursive which will recursively unlist a vector when set to TRUE (default is obviously FALSE).

l <- replicate(500, rnorm(1e3), simplify = F)

microbenchmark::microbenchmark(
  unlist = unlist(l, FALSE, FALSE),
  c = c(l, recursive = TRUE, use.names = FALSE)
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# unlist 3.083424 3.121067 4.662491 3.172401 3.985668 27.35040   100
#      c 3.084890 3.133779 4.090520 3.201246 3.920646 33.22832   100

As a medium-size fish, I'm jumping in with a first-attempt solution that gives a benchmark for little fishes to beat. It's about 3 times slower than unlist.

I'm using a smaller version of ucfagls's test list. (Since it fits in memory better.)

bigList3 <- as.list(data.frame(matrix(rep(rnorm(1e5), times = 200), ncol = 2000)))

The basic idea is to create one long vector to store the answer, then loop over list items copying values from the list.

unlist_rjc <- function(l)
{
  lengths <- vapply(l, length, FUN.VALUE = numeric(1), USE.NAMES = FALSE)
  total_len <- sum(lengths)
  end_index <- cumsum(lengths)
  start_index <- 1 + c(0, end_index)
  v <- numeric(total_len)
  for(i in seq_along(l))
  {
    v[start_index[i]:end_index[i]] <- l[[i]]
  }
  v
}

t1 <- system.time(for(i in 1:10) unlist(bigList2, FALSE, FALSE))
t2 <- system.time(for(i in 1:10) unlist_rjc(bigList2))
t2["user.self"] / t1["user.self"]  # 3.08

Challenges for little fishes:
1. Can you extend it to deal with other types than numeric?
2. Can you get it to work with recursion (nested lists)?
3. Can you make it faster?

I'll upvote anyone with less points than me whose answer meets one or more of these mini-challenges.

继续阅读：list optimization r vector

challenge: optimize unlisting [easy]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？