challenge: optimize unlisting [easy]
Because SO is a bit slow lately, I'm posting an easy question. I would appreciate it if big fishes stayed on the bench for this one and give rookies a chance to respond.
Sometimes we have objects that have a ridiculous amount of large 开发者_StackOverflow社区list elements (vectors). How would you "unlist" this object into a single vector. Show proof that your method is faster than unlist()
.
If you don't need names and your list is one level deep, then if you can beat
.Internal(unlist(your_list, FALSE, FALSE))
I will vote up everything you do on SO for the next 1 year!!!
[Update: if non-unique names are needed and the list is not recursive, here is a version which improves over the unlist 100 times
myunlist <- function(l){
names <- names(l)
vec <- unlist(l, F, F)
reps <- unlist(lapply(l, length), F, F)
names(vec) <- rep(names, reps)
vec
}
myunlist(list(a=1:3, b=2))
a a a b
1 2 3 2
> tl <- list(a = 1:20000, b = 1:5000, c = 2:30)
> system.time(for(i in 1:200) unlist(tl))
user system elapsed
22.97 0.00 23.00
> system.time(for(i in 1:200) myunlist(tl))
user system elapsed
0.2 0.0 0.2
> system.time(for(i in 1:200) unlist(tl, F, F))
user system elapsed
0.02 0.00 0.02
]
[Update2: Responce to challenge Nr3 from Richie Cotton.
bigList3 <- replicate(500, rnorm(1e3), simplify = F)
unlist_vit <- function(l){
names(l) <- NULL
do.call(c, l)
}
library(rbenchmark)
benchmark(unlist = unlist(bigList3, FALSE, FALSE),
rjc = unlist_rjc(bigList3),
vit = unlist_vit(bigList3),
order = "elapsed",
replications = 100,
columns = c("test", "relative", "elapsed")
)
test relative elapsed
1 unlist 1.0000 2.06
3 vit 1.4369 2.96
2 rjc 3.5146 7.24
]
PS: I assume a "big fish" is the one with more reputation than you. So I am pretty much small here :).
A non-unlist()
solution would have to be pretty darned fast to beat unlist()
would it not? Here it takes less than two second to unlist a list with 2000 numeric vectors of length 100,000 each.
> bigList2 <- as.list(data.frame(matrix(rep(rnorm(1000000), times = 200),
+ ncol = 2000)))
> print(object.size(bigList2), units = "Gb")
1.5 Gb
> system.time(foo <- unlist(bigList2, use.names = FALSE))
user system elapsed
1.897 0.000 2.019
With bigList2
and foo
in my workspace, R is using ~9Gb of my available memory. The key is use.names = FALSE
. Without it unlist()
is painfully slow. Exactly how slow I'm still waiting to find out...
We can speed this up a little bit more by setting recursive = FALSE
and then we have effectively the same as VitoshKa's answer (two representative timings):
> system.time(foo <- unlist(bigList2, recursive = FALSE, use.names = FALSE))
user system elapsed
1.379 0.001 1.416
> system.time(foo <- .Internal(unlist(bigList2, FALSE, FALSE)))
user system elapsed
1.335 0.000 1.344
... finally the use.names = TRUE
version finished...:
> system.time(foo <- unlist(bigList2, use = TRUE))
user system elapsed
2307.839 10.978 2335.815
and it was consuming all my systems 16Gb of RAM so I gave up at that point...
c()
has the logical argument recursive
which will recursively unlist a vector when set to TRUE
(default is obviously FALSE
).
l <- replicate(500, rnorm(1e3), simplify = F)
microbenchmark::microbenchmark(
unlist = unlist(l, FALSE, FALSE),
c = c(l, recursive = TRUE, use.names = FALSE)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# unlist 3.083424 3.121067 4.662491 3.172401 3.985668 27.35040 100
# c 3.084890 3.133779 4.090520 3.201246 3.920646 33.22832 100
As a medium-size fish, I'm jumping in with a first-attempt solution that gives a benchmark for little fishes to beat. It's about 3 times slower than unlist.
I'm using a smaller version of ucfagls
's test list. (Since it fits in memory better.)
bigList3 <- as.list(data.frame(matrix(rep(rnorm(1e5), times = 200), ncol = 2000)))
The basic idea is to create one long vector to store the answer, then loop over list items copying values from the list.
unlist_rjc <- function(l)
{
lengths <- vapply(l, length, FUN.VALUE = numeric(1), USE.NAMES = FALSE)
total_len <- sum(lengths)
end_index <- cumsum(lengths)
start_index <- 1 + c(0, end_index)
v <- numeric(total_len)
for(i in seq_along(l))
{
v[start_index[i]:end_index[i]] <- l[[i]]
}
v
}
t1 <- system.time(for(i in 1:10) unlist(bigList2, FALSE, FALSE))
t2 <- system.time(for(i in 1:10) unlist_rjc(bigList2))
t2["user.self"] / t1["user.self"] # 3.08
Challenges for little fishes:
1. Can you extend it to deal with other types than numeric?
2. Can you get it to work with recursion (nested lists)?
3. Can you make it faster?
I'll upvote anyone with less points than me whose answer meets one or more of these mini-challenges.
精彩评论