开发者

Merge list of data.frames with list element name as factor in merged data frame

I have a data.frame, like the following, where location is a factor and sample is some measurement sample:

  location sample
1      'A'   0.10
2      'A'   0.20
3      'A'   0.15
4      'B'   0.15
5      'B'   0.99
6      'B'   0.54
...

I have a function ECCDFpts(df), where df is a data.frame, that returns a set of <x,y> points on the empirical CCDF of df$sample, like so:

    x     y
1 0.0  1.00
2 0.1  0.99
3 0.2  0.75
...

Note that the number of <x,y> points returned is "arbitrary". There is 开发者_开发技巧not a one-to-one mapping between input samples and output <x,y> rows.

I would like to compute this CCDF data on a per factor (e.g., location) basis, yielding a data.frame like this:

  location    x    y
1      'A'  0.0  1.0
2      'A'  0.1  1.0
3      'A'  0.2  0.3
4      'B'  0.0  1.0
5      'B'  0.1  1.0
6      'B'  0.2  0.7
...

My current approach is to split the initial data frame on factor location:

eccdfs_by_factor <- by(data, data$location, ECCDFpts)

This yields a list of data.frames:

data$location: A
    x    y
1 0.0  1.0
2 0.1  1.0
3 0.2  0.3
-----------------
data$location: B
    x    y
1 0.0  1.0
2 0.1  1.0
3 0.2  0.7

I don't know how to merge or unsplit this back into my desired form, shown previously. I want to merge such that the name of the elements (data.frames) in the list becomes a column factor in the combined data.frame.

Solution:

This is a classic split-apply-combine problem, apparently. The cleanest solutions below use the plyr package function ddply(...)to do both the splitting, applying, and combining in one line! There's no need for the base by function I used above.


Update: If I understand what you want you correctly...

library(plyr)
ldply(your_data)

For example:

x <- list(a=data.frame(x=c(1,2,3,4),y=c(2,3,4,5)),
          b=data.frame(x=c(4,3,2,1),y=c(5,4,3,2)))
ldply(x)

  .id x y
1   a 1 2
2   a 2 3
3   a 3 4
4   a 4 5
5   b 4 5
6   b 3 4
7   b 2 3
8   b 1 2


A one shot solution uses the plyr package. Since I don't know your ECDFpts function, I am going to write my own to illustrate the usage.

# DEFINE DUMMY DATA
mydata = data.frame(
  location = rep(LETTERS[1:3], each = 3),
  sample   = runif(9)
)

# DEFINE DUMMY FUNCTION
myfunc = function(dat){
   x = dat - mean(dat)
   y = dat - median(dat)
   return(data.frame(x, y)) 
}

# USE PLYR TO APPLY FUNCTION BY LOCATION
library(plyr)
ans = ddply(mydata, .(location), transform, x = myfunc(sample)$x, 
         y = myfunc(sample)$y)

  location sample       x      y
1        A  0.911  0.3279  0.232
2        A  0.678  0.0958  0.000
3        A  0.159 -0.4237 -0.520
4        B  0.908  0.3096  0.048
5        B  0.860  0.2615  0.000
6        B  0.027 -0.5711 -0.833
7        C  0.745  0.0694  0.000
8        C  0.343 -0.3327 -0.402
9        C  0.939  0.2633  0.194

EDIT. As identified in the comments by @David, the code can be further simplified as

# DEFINE DUMMY FUNCTION
myfunc = function(dat){
   x = with(dat, sample - mean(sample))
   y = with(dat, sample - median(sample))
   return(data.frame(x, y)) 
}

ans = ddply(mydata, .(location), myfunc)

  location       x        y
1        A -0.0308 -0.00564
2        A -0.0251  0.00000
3        A  0.0559  0.08102
4        B -0.4985 -0.69084
5        B  0.3062  0.11392
6        B  0.1923  0.00000
7        C -0.2894 -0.31495
8        C  0.0255  0.00000
9        C  0.2639  0.23838


The answers you've received are more than adequate, but for completeness I'd like to add a solution that explains how to get your desired result starting from your output from the by command. I'm going to use a slightly modified version of Ramnath's example for illustration:

mydata = data.frame(
  location = rep(LETTERS[1:3], each = 3),
  sample   = runif(9)
)

# DEFINE DUMMY FUNCTION - slightly different from ramnath's
myfunc = function(dat){
    temp <- data.frame(x = runif(3), y = rnorm(3))
    return(temp) 
}         

You're splitting the data by location and applying your function using by:

rs <- by(mydata,mydata$location,FUN = myfunc)

mydata$location: A
          x           y
1 0.2730105 -0.06923224
2 0.9354096 -0.18336131
3 0.6359926 -0.04054326
----------------------------------------------------------- 
mydata$location: B
          x           y
1 0.5621529 -0.26404739
2 0.8098687  0.07912883
3 0.7334650  0.38287794
----------------------------------------------------------- 
mydata$location: C
          x          y
1 0.8443924 -0.9055125
2 0.7922256  0.1757586
3 0.4923929 -0.1931579

Now, a very handy thing to know is that we can put everything back together again using do.call and rbind:

result <- do.call(rbind,rs)

            x           y
A.1 0.2730105 -0.06923224
A.2 0.9354096 -0.18336131
A.3 0.6359926 -0.04054326
B.1 0.5621529 -0.26404739
B.2 0.8098687  0.07912883
B.3 0.7334650  0.38287794
C.1 0.8443924 -0.90551251
C.2 0.7922256  0.17575858
C.3 0.4923929 -0.19315789

But wait, you say! What about adding my location column? Well, notice what do.call(rbind,rs) did to the row names of your result! We can add the location column by just extracting the first character from the row names:

result$location <- substr(row.names(result),1,1)

This assumes, of course, that your locations are coded using a single character. But in general, the resulting row names should be in the form location.x, so you could always strsplit or regular expressions to extract the location names.

Finally, you can always simply modify the function you apply to each piece to add the location name as a column before returning the result, like so:

#Output not shown
myfunc1 = function(dat){
    temp <- data.frame(x = runif(3), y = rnorm(3))
    temp$location <- dat$location[1]
    return(temp) 
}
rs1 <- by(mydata,mydata$location,FUN = myfunc1)
result1 <- do.call(rbind,rs1)

So you'd just have to modify your ECCDFpts function in a similar manner.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜