开发者

How can I count the current people in a group from a list of start and end dates in R

Or rather, How can I do it better than I have fudged.

I have a dataframe with names and start and end dates in a group. I want to produce a dataframe with the number of people in the group over time. Note, some people haven't left yet (end date is NA)

Here's an example dataset

foo<-data.frame(name=c("Bob","Sue", "Richard", "Jane"), 
        Start=as.POSIXct(c("2006-03-23 GMT", "2007-01-20 GMT", "2007-01-20 GMT", "2006-03-23 GMT")),
        End=as.POSIXct(c("2009-01-20 GMT", "NA", "2006-03-23 GMT", "NA")))

Here I create a dataframe with dates covering the range I want. This feels very dirty.

daterange<-data.frame(date=as.POSIXct(
                paste(
                        rep(2006:2009, each=12), 
                        "-", 
                        rep(01:12, times=4),
                        "-", 
                        1,
                        " GMT", 
                        sep="")
                        )
                )

#cheat by setting NAs to soemthing far away
foo$End[is.na(foo$End)]<-as.POSIXct(Sys.time())+(365*24*60*60)

Now I use ddply to produce the result.

ddply(.data=daterange, .variable="date", function(df) {
   开发者_运维问答         result=nrow(subset(foo, Start<df$date & End>df$date))
            return(result)
        })

There must be an easier way ?


Here is an alternate approach using plyr. It directly works with the original data frame foo and does not require converting NA into a date. The code is self-explanatory and very readable. Any comments are welcome.

dates = seq(as.POSIXct('2006-01-01'), as.POSIXct('2009-12-01'), by = "month")
count = ldply(dates, function(d) 
   with(foo, sum((Start < d) + (d < End | is.na(End)) == 2)))
data.frame(dates, count)


Here's a different way that you might find easier:

foo<-data.frame(name=c("Bob","Sue", "Richard", "Jane"), 
    Start=as.POSIXct(c("2006-03-23 GMT", "2007-01-20 GMT", "2007-01-20 GMT", "2006-03-23 GMT")),
    End=as.POSIXct(c("2009-01-20 GMT", NA, "2006-03-23 GMT", NA)))


tmp <- expand.grid(foo$name,seq.POSIXt(as.POSIXct('2006-01-01'),
               as.POSIXct('2009-12-01'),by="month"))
colnames(tmp) <- c('name','date')
foo[is.na(foo)] <- max(tmp$date) + 1

tmp1 <- merge(tmp,foo,by="name")
tmp2 <- tmp1$Start <= tmp1$date & tmp1$End >= tmp1$date
aggregate(tmp2,by=list(date=tmp1$date),sum)

My two cents here are to use seq.* rather than pasteing dates together and that ddply is kind of an awkward tool if you're really just taking daterange one element at a time. I used aggregate, but you could have used lapply or something like that.

You could compress this into fewer lines if you really want to, but readability will suffer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜