开发者

Calculate means of rows

I have a dataframe called ants detailing multiple entries per site, looks like this:

  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2

Sites have different numbers of entries, usually 3 but sometimes less or more. Where both date and site number match I would like to write a new dataframe with one entry per site detailing the average/mean readings for each parameter. I would like empty or "na" cells to be omitted from the calculation and 开发者_如何学JAVAsubsequent dataframe.

I'm not sure if this is an apply function or a version of rowMeans maybe? Very stuck, any help much appreciated!


Nico's answer looked like mine would have except that I would have added a named argument to be passed to mean() so that the NA's (in the aggregated columns) would not sabotage the results. (I could not tell whether the OP was asking that NA's in the by variables or in the otehr variables were known or suspected of having NA's) :

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)

You would probably need to also run aggregate or tapply calls in parallel to count the number of non-NA values.

The other method using aggregate's formula method might be different since na.action=na.omit is the default:

aggregate( . ~Site +Date, data=df,  FUN=mean, na.rm=TRUE)


Here is one way using the plyr package and its ddply() function:

R> df
  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
1   71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
2   71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
3   71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
4   71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
5   70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
6   70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
R> library(plyr)
R> ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
  Site     Date   Temp SpCond Salinity  Depth Turbidity Chlorophyll
1   70 6/8/2010 14.300 50.035   32.770 0.9975    -0.350       1.200
2   71 6/8/2010 14.315 49.933   32.695 0.8835    -0.075       1.625
R> 

I used a custom anonymous function to skip the first three columns.


You can also use aggregate

aggregate(df, by=list(df$Site, df$Date), FUN=mean, na.rm=TRUE)


Here is a complete new answer with a full log also covering your new specification:

R> Lines <- "  Site     Date     Time  Temp SpCond Salinity Depth Turbidity Chlorophyll
+ 71 6/8/2010 14:50:35 14.32  49.88    32.66 0.397       0.0         1.3
+ 71 6/8/2010 14:51:00 14.31  49.94    32.70 1.073       0.0         2.0
+ 71 6/8/2010 14:51:16 14.32  49.95    32.71 1.034      -0.1         1.6
+ 71 6/8/2010 14:51:29 14.31  49.96    32.71 1.030      -0.2         1.6
+ 70 6/8/2010 14:53:55 14.30  50.04    32.77 1.002      -0.2         1.2
+ 70 6/8/2010 14:54:09 14.30  50.03    32.77 0.993      -0.5         1.2
+ "
R> con <- textConnection(Lines)
R> df <- read.table(con, sep="", header=TRUE, stringsAsFactors=FALSE)
R> close(con)
R> df$pt <- as.POSIXct(strptime(paste(df$Date, df$Time), "%m/%d/%Y %H:%M:%S"))
R> library(plyr)
R> newdf <- ddply(df, .(Site,Date), function(x) mean(x[,-(1:3)], na.rm=TRUE))
R> newdf$pt <- as.POSIXct(newdf$pt, origin="1970-01-01")
R> newdf
  Site     Date  Temp SpCond Salinity  Depth Turbidity Chlorophyll                  pt
1   70 6/8/2010 14.30  50.03    32.77 0.9975    -0.350       1.200 2010-06-08 20:54:02
2   71 6/8/2010 14.32  49.93    32.70 0.8835    -0.075       1.625 2010-06-08 20:51:05
R> 


You were close with rowMeans(), but you need colMeans() instead. The others have shown how to use built-in or add-on functionality and I would certainly recommend you use them. However, it might be useful to see how to do something like this by hand:

## using df from Dirk's answer, we split the data in Site Date combinations
df.sp <- with(df,
              split(data.frame(Temp, SpCond, Salinity, Depth, Turbidity,
                               Chlorophyll),
                    list(Site = Site, Date = Date)))
## The above gives  a list of data frames one per date-site combo,
## to which we apply the colMeans() function
df.mean <- data.frame(t(sapply(df.sp, colMeans)))

At this point we need to do some extra tidying if you want the output to be nice like the others' answers:

## Process the rownames on df.mean
name.parts <- strsplit(rownames(df.mean), "\\.")
## pull out the Site part (before the '.')
df.mean <- within(df.mean, Site <- as.numeric(sapply(name.parts, `[`, 1)))
## pull out the Date part (after the '.')
df.mean <- within(df.mean, Date <- sapply(name.parts, `[`, 2))
## rearrange the columns
df.mean <- df.mean[, c(7:8,1:6)]

Note again, for most cases you should use the canned functions as described by the other answers. Sometimes it might be quicker to cook your own solution however, and the above might act as a guide to achieving this.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜