Randomly subset data set multiple times and calculate means and variances
I never came to any conclusions re: this question, so I thought I would rephrase it and ask again.
I would like to subsample my dataset 10,000 times to generate means and 95% CIs for each of my responses.
Here is an example of how the data set is structured:
x <- read.table(tc <- textConnection("
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 3 B 1.2 2.9
1 4 C 1.5 2.3
2 1 A 1.7 0.3
2 2 A 1.9 0.3
3 1 A 0.2 0.5"), header = TRUE); close(tc)
I would like to subsample each study/variable combination only once. So, for example, the subsetted dataset would look like this:
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 4 C 1.5 2.3
2 1 A 1.7 0.3
3 1 A 0.2 0.5
Notice rows 3 and 6 are gone, because both measured a variable twice (B in the first case, A 开发者_如何转开发in the second case).
I want to draw subsampled data sets again and again so I may derive overall means of value1 and value2 with 95% CIs for each variable. So the output I would like after the whole subsampling routine would be:
variable mean_value1 lower_value1 upper_value1 mean_value2 etc....
A 2.3 2.0 2.6 2.1
B 2.5 2.0 3.0 2.5
C 2.1 1.9 2.3 2.6
Here is some code I have to grab the subset:
subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination,
#how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there
sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
x$value[idx+expIdx] #now get a vector of values
})
Any help is appreciated. I recognize this is complicated so please let me know if you need clarification!
Split your data by Study, Experiment and Variable, then apply the bootstrap to each subset. There are many ways to do this, including:
sdfr <- with(dfr, split(dfr, list(Study, Experiment, Variable)))
sdfr <- Filter(nrow, sdfr) #to remove empty data frames
lapply(sdfr, function(x)
{
boot(x$Response1, statistic = mean, R = 10000, sim = "parametric")
})
Here's a solution, although fair warning, it's not going to scale terribly well and I'm unaware of the statistical validity of this kind of scheme:
#Replicate your example data
set.seed(1)
dat <- expand.grid(Study = 1:4,Experiment = 1:3, Response = LETTERS[1:4])
dat$Value1 <- runif(48)
dat$Value2 <- runif(48)
#Function to apply to each Response level
#Note the rather inefficient use of ddply
# in a for loop to do the 'stratified'
# subsampling you describe
myFun <- function(x,B){
rs <- matrix(NA,B,2)
for (i in 1:B){
temp <- ddply(x,.(Study), .fun = function(x) x[sample(1:nrow(x),1),])
rs[i,] <- colMeans(temp[,4:5])
}
c(Value1 = mean(x$Value1), quantile(rs[,1],probs=c(0.025,0.975)),
Value2 = mean(x$Value2), quantile(rs[,2],probs=c(0.025,0.975)))
}
ddply(dat,.(Response),.fun = myFun,B=50)
Example output
Response Value1 2.5% 97.5% Value2 2.5% 97.5%
1 A 0.4914725 0.2721876 0.8311799 0.4600546 0.2596446 0.6909686
2 B 0.5941457 0.4018281 0.8047503 0.5241470 0.2865285 0.7099486
3 C 0.4596998 0.2752685 0.6340614 0.5761497 0.3546133 0.8115933
4 D 0.5550651 0.2717772 0.7298913 0.4645609 0.1868757 0.7985816
精彩评论