Reshape data based on column in dataframe
I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 开发者_StackOverflow社区 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end
. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr
which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
There's probably a more general way to do this, but below uses rbind.fill
.
cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
rbind.fill(apply(df[,4:5], 1, function(x)
data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))
## id1 id2 mean start end
## 1 A D 4 12 13
## 1.1 A D 4 13 14
## 1.2 A D 4 14 15
## 2 B E 5 14 15
## 3 C F 6 8 9
## 3.1 C F 6 9 10
The survSplit
function of the survival
package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.
No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
sq=seq(1:(end-start)))
Two alternatives, many years later, offering alternatives using today's popular data.table
and tidyverse
packages:
Option 1:
library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
id1 id2 mean start end
1: A D 4 12 13
2: A D 4 13 14
3: A D 4 14 15
4: B E 5 14 15
5: C F 6 8 9
6: C F 6 9 10
Option 2:
library(tidyverse)
mydf %>%
group_by(id1, id2, mean) %>%
summarise(start = list(start:(end-1))) %>%
unnest(start) %>%
mutate(end = start+1)
精彩评论