开发者

Transform a data.frame, while filling missing values

I have the data frame

data<-data.frame(id=开发者_开发问答c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))

showing Subject ID, day of event, duration of event, and observation period of Subject

I want to transform the data set to that it will show the whole observation period for each subject (all days of observation), while adding zero as duration values for the days where no event was observed

For the above dataset this would be something like this:

id  day duration    obs.period
A   1   0   60
A   2   0   60
A   3   0   60
A   4   0   60
A   5   12  60
A   6   1440    60
A   7   0   60
A   8   0   60
    .       
    .       
    .       
A   60  0   60
B   1   5   100
B   2   6   100
B   3   0   100
B   4   0   100
    .       
    .       
    .       
    .       
B   100 0   100

Any ideas?


Here's one approach using the plyr package. First, create a function to expand the data into the appropriate number of rows. Then, index into that new data.frame with the duration info from the original data. Finally, call this function with ddply() and group on the id variable.

require(plyr)
FUN <- function(x){
  dat <- data.frame(
    id = x[1,1]
    , day = seq_len(x[1,4])
    , duration = 0
    , obs.period = x[1,4]
    )

  dat[dat$id == x$id & dat$day == x$day, "duration"] <- x$duration
  return(dat)
}


ddply(data, "id", FUN)

    id day duration obs.period
1    A   1        0         60
2    A   2        0         60
3    A   3        0         60
4    A   4        0         60
5    A   5       12         60
6    A   6     1440         60
...
61   B   1        5        100
62   B   2        6        100
63   B   3        0        100
...
160  B 100        0        100


Create an empty data frame with the proper index columns, but no value columns, then merge it with your data and replace the NA's in the value columns with zeros.

data<-data.frame(id=c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))
zilch=data.frame(id=rep(c("A","B"),each=60),day=1:60)
all=merge(zilch,data, all=T)
all[is.na(all$duration),"duration"]<-0
all[is.na(all$obs.period),"obs.period"]<-0


I would first create a data frame to contain the results.

ob.period <- with(data, tapply(obs.period, id, max))

n <- sum(ob.period)
result <- data.frame(id=rep(names(ob.period), ob.period),
                     day=unlist(lapply(ob.period, function(a) 1:a)),
                     duration=rep(0, n),
                     obs.period=rep(ob.period,ob.period))

Then I would paste id and day together, use match to find the relevant rows in the larger data frame, and plug in the duration values.

idday.sm <- paste(data$id, data$day, sep=":")
idday.lg <- paste(result$id, result$day, sep=":")

result$duration[match(idday.sm, idday.lg)] <- data$duration


Here is an approach with plyr

fill1 <- function(df) {
  full_period <- 1:100
  to_fill <- setdiff(full_period, df$day)
  fill_id <- df[1,"id"]
  fill_dur <- 0
  fill_obs.p <- df[1,"obs.period"]
  rows_to_add <- data.frame(id=fill_id, day=to_fill, duration=fill_dur, obs.period=fill_obs.p)
  rbind(df,rows_to_add)
}
ddply(data, "id", fill1)

The result is not sorted by id, duration, however.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜