开发者

Add indicator variable to long data frame for when the value increases from one year to the next

I have a long data frame with three columns fyear, tic, and dcvt (for fiscal year, ticker, and total convertible debt). There are about 18 fiscal years and a few thousand tickers. I would like to add an indicator variable that is one w开发者_如何学运维henever dcvt goes up from one year to the next.

I tried ddply, but I lost the fyear column and wasn't sure how to get it back.

library(plyr)
temp <- data.frame(fyear = rep(1992:2009, 10), tic = rep(letters[1:10], each = 18), dcvt = rnorm(180, 200, 10))
my.fun <- function(x) x <- c(0, ifelse(tail(x, -1) - head(x, -1) > 0, 1, 0))
temp2 <- ddply(temp, "tic", colwise(my.fun, "dcvt"))

I also tried to cast to wide with the reshape2 package, then run for loops, but of course, that took forever.

Is there a way that I can do this quickly? Should I make a wide zoo object then use diff? I would like to avoid passing through a time series, if I can. Thanks!


using tranform in ddply sometimes help us greatly:

ddply(temp, .(tic), transform, dcvt=c(0, diff(dcvt)>0))


ddpy() handles a data set of this size (10^2) quite well. However, for larger datasets and for situations where you don't necessarily need to return a full dataframe, I would consider the following do.call + lapply solution:

my.fun <- function(cur.tic){
  as.numeric(diff(temp$dcvt[temp$tic == cur.tic]) > 0)
}

do.call("c", lapply(unique(temp$tic), my.fun))

To demonstrate the performance payoffs (unfairly given the vector vs. dataframe issue), I took the OP's sample data, created new data frames of magnitude 10^4, 10^5, and 10^6, and then ran system.time() on @kohske's ddply solution and the solution above:

Original data (10^2):

> system.time(do.call("c", lapply(unique(temp$tic), my.fun)))
   user  system elapsed 
  0.000   0.000   0.003 
> system.time(ddply(temp, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
   user  system elapsed 
  0.020   0.000   0.013 

10^4 sample data

> system.time(do.call("c", lapply(unique(temp.2$tic), my.fun)))
   user  system elapsed 
  0.000   0.000   0.002 
> system.time(ddply(temp.2, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
   user  system elapsed 
  0.040   0.000   0.036 

10^5 sample data

> system.time(do.call("c", lapply(unique(temp.3$tic), my.fun)))
   user  system elapsed 
  0.000   0.000   0.004 
> system.time(ddply(temp.3, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
   user  system elapsed 
  0.270   0.000   0.279 

10^6 sample data

> system.time(do.call("c", lapply(unique(temp.4$tic), my.fun)))
   user  system elapsed 
  0.010   0.000   0.018 
> system.time(ddply(temp.4, .(tic), transform, dcvt=c(0, diff(dcvt)>0)))
   user  system elapsed 
  6.110   0.070   6.186 

Not a gripe about ddply() - rather, just an effort to share some code that I found useful while working on a very similar issue with a much larget dataset recently.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜