Calculating percentages in an apply statement (R)
I'm struggling with something quite simple, but I'm going around in circles, and just don't see where I make an error. I really hope that someone could lend me a handy suggestion, so that I'm no longer stuck!
My goal: I want to calculate the percentage of instances in an data.frame which have a result higher than 0. I've tried this with the for loop, but to no avail. So, after some more searching I used the apply function to calculate various metrics as mean, sd, and min/max. This works great, but for calculating the percentage the apply function doesn't work, even when I make a custom function, and insert this into the apply function.
This is the shortened version of my data.frame:
tradesList[c(1:5,10:15),c(1,7)]
Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
10 JPM -294
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT 开发者_运维百科 -206
I want to summarize this data.frame, for example by displaying the average TradeResult for each Instrument:
> tapply(tradesList$TradeResult.Currency., tradesList$Instrument, mean)
JPM KFT
42.3 14.6
However, I would also like to calculate the percentage of rows which have an TradeResult > 0 per instrument. With the 'which' function checking for instances which are > 0 does work, however, apply won't accept this function as an argument.
> length(which(tradesList$TradeResult.Currency. > 0)) / length(tradesList$TradeResult.Currency.) * 100
[1] 50
> tapply(tradesList$TradeResult.Currency., tradesList$Instrument, (length(which(tradesList$TradeResult.Currency. > 0)) / length(tradesList$TradeResult.Currency.) * 100))
Error in match.fun(FUN) :
c("'(length(which(tradesList$TradeResult.Currency. > 0))/length(tradesList$TradeResult.Currency.) * ' is not a function, character or symbol", "' 100)' is not a function, character or symbol")
>
I searched in the help function for more information on this error, and tried various different ways of formulating the function (for example with brackets or quotes), but each way led to the same result.
Does someone know a whay to calculate the percentage of instances which are greater than zero? Perhaps I'm missing something?
Thanks alot in advance,
Regards,
Edit: Thanks alot for your quick comments G. Grothendieck, Gavin Simpson and DWin. Highly appreciated and quite helpful!
Solved: Here's what I have now:
> tmpData <- tradesList[c(1:5,10:15),c(1,7)]
> tmpData
Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
10 JPM -294
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT -206
> 100* # to get percentages
+ with( tmpData,
+ tapply( (TradeResult.Currency. > 0) , Instrument, sum)/ # number GT 0
+ tapply( TradeResult.Currency., Instrument, length) ) # total number
JPM KFT
66.66667 20.00000
> 100 * tapply(tmpData$TradeResult.Currency. > 0, tmpData$Instrument, mean)
JPM KFT
66.66667 20.00000
> pcentFun <- function(x) {
+ res <- x > 0
+ 100 * (sum(res) / length(res))
+ }
>
> with(tmpData, tapply(TradeResult.Currency., Instrument, pcentFun))
JPM KFT
66.66667 20.00000
Thanks again!
Regards,
Write a simple function to do the computation:
pcentFun <- function(x) {
res <- x > 0
100 * (sum(res) / length(res))
}
Then we can apply that to groups of Instruments, via tapply()
> with(tradeList, tapply(TradeResult.Currency, Instrument, pcentFun))
JPM KFT
66.66667 20.00000
but aggregate()
would be more useful if you want the summary with instrument names:
> with(tradesList, aggregate(TradeResult.Currency,
+ by = list(Instrument = Instrument), pcentFun))
Instrument x
1 JPM 66.66667
2 KFT 20.00000
Try this:
100 * tapply(tradesList$TradeResult.Currency. > 0, tradesList$Instrument, mean)
With the sample data in the post it gives:
JPM KFT
66.67 20.00
and here it is using sqldf (note that the RSQLite driver translates dots to underscores since dots are also an SQL operator so we use underscores where dots were):
> library(sqldf)
> sqldf("select Instrument,
+ 100 * avg(TradeResult_Currency_ > 0) as '%>0',
+ avg(TradeResult_Currency_) as 'Avg Currency'
+ from tradesList group by Instrument")
Instrument %>0 Avg Currency
1 JPM 66.67 100.5
2 KFT 20.00 -57.4
These two could also be translated to aggregate
by suitable modification of the aggregate
solution already posted.
You can work with logical results using sum or mean to get meaningful summary results:
100* # to get percentages
with( tradesList,
tapply( (TradeResult.Currency. > 0) , Instrument, sum)/ # number GT 0
tapply( TradeResult.Currency., Instrument, length) ) # total number
Edit: I noticed that Gavin gave you an answer that returned a dataframe, a generally well understood class. The class of the values from both Gabor's and my replies were arrays of one dimension. They could be turned into named-vectors by surrounding the object by c() which doubles as a concatenation and coercion to vector function. And as they stand they are perfectly acceptable for rbinding or accessing with "[" in the expected ways and retrun expected results from names().
The tapply function returns arrays with the number of dimensions in the INDEX argument(s) and can be very effectively combined for matrix operations with table objects. I do a lot of dividing sums by counts, or counts by sums to get meaningful category statistics in 2, 3 or 4 dimensions.
精彩评论