Subset a data.frame by list and apply function on each part, by rows
This may seem as a typical plyr
problem, but I have something different in mind.
Here's the function that I want to optimize (skip the for
loop).
# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)
dfsub <- function(dt, lst, fun) {
# check whether dt is `data.frame`
stopifnot (is.data.frame(dt))
# check if vectors in lst are "whole" / integer
# vector elements should be column indexes
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# fall if any non-integers in list
idx <- rapply(lst, is.wholenumber)
stopifnot(idx)
# check for list length
stopifnot(ncol(dt) == length(idx))
# subset the data
subs <- list()
for开发者_如何学Go (i in 1:length(lst)) {
# apply function on each part, by row
subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
}
# preserve names
names(subs) <- names(lst)
# convert to data.frame
subs <- as.data.frame(subs)
# guess what =)
return(subs)
}
And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame
by vectors gathered in list
object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m
as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame
). If we presuppose that the score on each subscale is nothing more than sum
(or some other function) of row values (results on that part of questionnaire for each subject), you could run:
> dfsub(m, lst, sum)
a b c
1 46 20 24
2 41 24 21
3 41 13 12
4 37 14 18
5 57 18 25
6 27 18 18
7 28 17 20
8 31 18 23
9 38 14 15
10 41 14 22
I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!
I'd take a different approach and keep everything as data frames so that you can use merge and ddply. I think you'll find this approach is a little more general, and it's easier to check that each step is performed correctly.
# Convert everything to long data frames
m$id <- 1:nrow(m)
library(reshape)
obs <- melt(m, id = "id")
obs$variable <- as.numeric(gsub("V", "", obs$variable))
varinfo <- melt(lst)
names(varinfo) <- c("variable", "scale")
# Merge and summarise
obs <- merge(obs, varinfo, by = "variable")
ddply(obs, c("id", "scale"), summarise,
mean = mean(value),
sum = sum(value))
after loading the plyr package, replace
subs <- list()
for (i in 1:length(lst)) {
# apply function on each part, by row
subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
}
with
subs <- llply(lst,function(x) apply(dt[,x],1,fun))
@Hadley, I've checked your response since it's quite straightforward and easy for bookkeeping (besides the fact it's more general-purpose-solution). However, here's my not-so-long script that does the thing and requires only base
package (which is trivial since I install plyr
and reshape
just after installing R). Now, here's the source:
dfsub <- function(dt, lst, fun) {
# check whether dt is `data.frame`
stopifnot (is.data.frame(dt))
# convert data.frame factors to numeric
dt <- as.data.frame(lapply(dt, as.numeric))
# check if vectors in lst are "whole" / integer
# vector elements should be column indexes
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# fall if any non-integers in list
idx <- rapply(lst, is.wholenumber)
stopifnot(idx)
# check for list length
stopifnot(ncol(dt) == length(idx))
# subset the data
subs <- list()
for (i in 1:length(lst)) {
# apply function on each part, by row
subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
}
names(subs) <- names(lst)
# convert to data.frame
subs <- as.data.frame(subs)
# guess what =)
return(subs)
}
For your specific example, a one-line solution is sapply(lst,function(x) rowSums(m[,x]))
(although you might add some more lines to check for valid input and put in the column names).
Do you have other, more general, applications in mind? Or is this possibly a case of YAGNI?
精彩评论