Making a better summary statistics table with plyr in R
Every time I get a new data set the first thing I do is check out the summary statistics. The summary
function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary
isn't really the easiest way to digest or what you see in journals (i.e., summary
is horizontal instead of vertical).
For example, here is what I get from summary with some made up data.
> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4开发者_运维问答, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED]
> my.summary <- summary(my.data)
> my.summary
firm returns leverage
a:5 Min. :-1.6765 Min. :0.2863
b:5 1st Qu.:-0.6945 1st Qu.:0.3929
c:5 Median :-0.1930 Median :0.5061
d:5 Mean :-0.1159 Mean :0.5009
e:5 3rd Qu.: 0.4323 3rd Qu.:0.6011
Max. : 1.1915 Max. :0.7093
But let's say I really want something more like this.
> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED]
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
mean median sd
returns -0.1158633 -0.1929571 0.6996548
leverage 0.5008895 0.5061301 0.1453381
For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.
I tried this with reshape2
and plyr
, but get an error.
> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) :
more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
argument is not numeric or logical: returning NA
This leaves me with two questions:
- What am I doing wrong with
ddply
? - Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?
Thanks!
Try the stat.desc
in the pastecs
package. You can use it on your data set by calling stat.desc(my.data)
. To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require
I found the conceptual error in my code above. Because mean
, median
, and sd
operate on a vector, I need to feed them a specific vector in the data frame that ddply
creates based on .variables
. (I was incorrectly applying an example from the manual, which used data frame operators nrow
and ncol
.) Here's the correct code:
my.melted.data <- melt(my.data)
my.improved.summary <- ddply(
my.melted.data
, .(variable)
, function(x) data.frame(
mean = mean(x$value)
, median = median(x$value)
, sd = sd(x$value)
)
)
Ramnath's solution is easier, but this is extensible to any type summary stats you might want.
精彩评论