Making a better summary statistics table with plyr in R

2023-02-23 08:12 问答作者：

Every time I get a new data set the first thing I do is check out the summary statistics. The summary function does a pretty good job, but I'm frequently interested in standard deviations, quantiles with different breakpoints, number of observations, etc. Also, the presentation of summary isn't really the easiest way to digest or what you see in journals (i.e., summary is horizontal instead of vertical).

For example, here is what I get from summary with some made up data.

> library(plyr)
> library(reshape2)
> my.data <- data.frame(firm = factor(rep(letters[1:5], each = 5)), returns = rnorm(n = 5 * 5), leverage = rep(c(0.3, 0.4开发者_运维问答, 0.5, 0.6, 0.7), each = 5) + .... [TRUNCATED] 
> my.summary <- summary(my.data)
> my.summary
 firm     returns           leverage     
 a:5   Min.   :-1.6765   Min.   :0.2863  
 b:5   1st Qu.:-0.6945   1st Qu.:0.3929  
 c:5   Median :-0.1930   Median :0.5061  
 d:5   Mean   :-0.1159   Mean   :0.5009  
 e:5   3rd Qu.: 0.4323   3rd Qu.:0.6011  
       Max.   : 1.1915   Max.   :0.7093

But let's say I really want something more like this.

> my.manual.summary <- data.frame(mean = c(mean(my.data$returns), mean(my.data$leverage)), median = c(median(my.data$returns), median(my.data$leverage .... [TRUNCATED] 
> rownames(my.manual.summary) <- c("returns", "leverage")
> my.manual.summary
               mean     median        sd
returns  -0.1158633 -0.1929571 0.6996548
leverage  0.5008895  0.5061301 0.1453381

For this small data set (i.e., just a few firm characteristics) this is easy. But I have more or what to do more statistics or more slicing-dicing, it can get tedious.

I tried this with reshape2 and plyr, but get an error.

> my.melted.data <- melt(my.data)
Using firm as id variables
> my.improved.summary <- ddply(my.melted.data[, -1], .(variable), c("mean", "median", "sd"), na.rm = T)
Error in proto[[i]] <- fs[[i]](x, ...) : 
  more elements supplied than there are to replace
In addition: Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA
3: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
4: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA

This leaves me with two questions:

What am I doing wrong with ddply?
Am I re-inventing the wheel here? Given that this is table 1 in everything I read and write, is there an existing solution that I haven't found?

Thanks!

Try the stat.desc in the pastecs package. You can use it on your data set by calling stat.desc(my.data). To get the output in the format you desire, you need to (a) transpose the data frame, (b) remove non-numeric variables and (c) only retain the summary statistics columns you require

I found the conceptual error in my code above. Because mean, median, and sd operate on a vector, I need to feed them a specific vector in the data frame that ddply creates based on .variables. (I was incorrectly applying an example from the manual, which used data frame operators nrow and ncol.) Here's the correct code:

my.melted.data <- melt(my.data)
my.improved.summary <- ddply(
  my.melted.data
  , .(variable)
  , function(x) data.frame(
    mean = mean(x$value)
    , median = median(x$value)
    , sd = sd(x$value)
  )
)

Ramnath's solution is easier, but this is extensible to any type summary stats you might want.

继续阅读：plyr r

Making a better summary statistics table with plyr in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？