Subset a data.frame by list and apply function on each part, by rows

2022-12-21 17:08 问答作者：

This may seem as a typical plyr problem, but I have something different in mind. Here's the function that I want to optimize (skip the for loop).

# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)


dfsub <- function(dt, lst, fun) {
    # check whether dt is `data.frame`
    stopifnot (is.data.frame(dt))
    # check if vectors in lst are "whole" / integer
    # vector elements should be column indexes
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
    # fall if any non-integers in list
    idx <- rapply(lst, is.wholenumber)
    stopifnot(idx)
    # check for list length
    stopifnot(ncol(dt) == length(idx))
    # subset the data
    subs <- list()
    for开发者_如何学Go (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }
    # preserve names
    names(subs) <- names(lst)
    # convert to data.frame
    subs <- as.data.frame(subs)
    # guess what =)
    return(subs)
}

And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame by vectors gathered in list object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame). If we presuppose that the score on each subscale is nothing more than sum (or some other function) of row values (results on that part of questionnaire for each subject), you could run:

> dfsub(m, lst, sum)
    a  b  c
1  46 20 24
2  41 24 21
3  41 13 12
4  37 14 18
5  57 18 25
6  27 18 18
7  28 17 20
8  31 18 23
9  38 14 15
10 41 14 22

I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!

I'd take a different approach and keep everything as data frames so that you can use merge and ddply. I think you'll find this approach is a little more general, and it's easier to check that each step is performed correctly.

# Convert everything to long data frames
m$id <- 1:nrow(m)

library(reshape)
obs <- melt(m, id = "id")
obs$variable <- as.numeric(gsub("V", "", obs$variable))

varinfo <- melt(lst)
names(varinfo) <- c("variable", "scale")

# Merge and summarise
obs <- merge(obs, varinfo, by = "variable")

ddply(obs, c("id", "scale"), summarise, 
  mean = mean(value), 
  sum = sum(value))

after loading the plyr package, replace

subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }

with

subs <- llply(lst,function(x) apply(dt[,x],1,fun))

@Hadley, I've checked your response since it's quite straightforward and easy for bookkeeping (besides the fact it's more general-purpose-solution). However, here's my not-so-long script that does the thing and requires only base package (which is trivial since I install plyr and reshape just after installing R). Now, here's the source:

dfsub <- function(dt, lst, fun) {
        # check whether dt is `data.frame`
        stopifnot (is.data.frame(dt))
        # convert data.frame factors to numeric
        dt <- as.data.frame(lapply(dt, as.numeric))
        # check if vectors in lst are "whole" / integer
        # vector elements should be column indexes
        is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
        # fall if any non-integers in list
        idx <- rapply(lst, is.wholenumber)
        stopifnot(idx)
        # check for list length
        stopifnot(ncol(dt) == length(idx))
        # subset the data
        subs <- list()
        for (i in 1:length(lst)) {
                # apply function on each part, by row
                subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
        }
        names(subs) <- names(lst)
        # convert to data.frame
        subs <- as.data.frame(subs)
        # guess what =)
        return(subs)
}

For your specific example, a one-line solution is sapply(lst,function(x) rowSums(m[,x])) (although you might add some more lines to check for valid input and put in the column names).

Do you have other, more general, applications in mind? Or is this possibly a case of YAGNI?

继续阅读：apply dataframe lapply list r

Subset a data.frame by list and apply function on each part, by rows

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？