Only Selecting Cases for all Time Periods
I have a longitudinal data set for a month in which there is some user attrition.
I'd like to subset the data just for those users who are active across all 30 days, but I could not find an example of this type of subset. Here is an exam开发者_如何学编程ple of the data layout:
date userID x
2001-11-08 1 20
2001-11-08 2 2
2001-11-08 3 10
2001-11-08 4 5
2001-11-08 5 1
2001-11-09 1 19
2001-11-09 3 4
2001-11-09 4 5
...
2001-11-30 1 15
subset(dnow, ave(as.numeric(date), userID, FUN=function(x) length(unique(x)))==30)
You should consider using the data processing tools in the plyr library.
library(plyr)
startdate <- ISOdate(2011, 1, 1)
userdata <- data.frame(
date = startdate + rep(1:31, each=3),
userID = 1 + round(9*runif(93)),
x = round(100*runif(93))
)
summary <- ddply(userdata, .(userID), summarize, activedays=length(date))
summary[summary$activedays >= 30, ]
You can find out more about plyr at Hadley's excellent website: http://had.co.nz/plyr/
I would use ave
to determine the number of days each user was active per month.
Data$activeDays <- ave(Data$userID, Data$userID, FUN=length)
Data[ Data$activeDays >= 30, ]
It would be a bit more tricky if your data set contains multiple months...
which(tapply(userdata$date, userdata$userID, length) == 30)
精彩评论