开发者

Making a better summary statistics table with plyr in R

Every time I get a new data set the first thing I do is check out the summary statistics. The summary function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary isn't really the easiest way to digest or what you see in journals (i.e., summary is horizontal instead of vertical).

For example, here is what I get from summary with some made up data.

> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4开发者_运维问答, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED] 
> my.summary <- summary(my.data)
> my.summary
 firm     returns           leverage     
 a:5   Min.   :-1.6765   Min.   :0.2863  
 b:5   1st Qu.:-0.6945   1st Qu.:0.3929  
 c:5   Median :-0.1930   Median :0.5061  
 d:5   Mean   :-0.1159   Mean   :0.5009  
 e:5   3rd Qu.: 0.4323   3rd Qu.:0.6011  
       Max.   : 1.1915   Max.   :0.7093  

But let's say I really want something more like this.

> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED] 
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
               mean     median        sd
returns  -0.1158633 -0.1929571 0.6996548
leverage  0.5008895  0.5061301 0.1453381

For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.

I tried this with reshape2 and plyr, but get an error.

> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) : 
  more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA

This leaves me with two questions:

  1. What am I doing wrong with ddply?
  2. Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?

Thanks!


Try the stat.desc in the pastecs package. You can use it on your data set by calling stat.desc(my.data). To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require


I found the conceptual error in my code above. Because mean, median, and sd operate on a vector, I need to feed them a specific vector in the data frame that ddply creates based on .variables. (I was incorrectly applying an example from the manual, which used data frame operators nrow and ncol.) Here's the correct code:

my.melted.data <- melt(my.data)
my.improved.summary <- ddply(
  my.melted.data
  , .(variable)
  , function(x) data.frame(
    mean = mean(x$value)
    , median = median(x$value)
    , sd = sd(x$value)
  )
)

Ramnath's solution is easier, but this is extensible to any type summary stats you might want.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜