Percentile for Each Observation w/r/t Grouping Variable

2022-12-18 15:31 问答作者：

I have some data that looks like the following. It is grouped by variable开发者_运维百科 "Year" and I want to extract the percentiles of each observation of Score, with respect to the Year it is from, preferably as a vector.

Year   Score
2001   89
2001   70
2001   72
2001   ...
..........
2004   87
2004   90

etc.

How can I do this? aggregate will not work, and I do not think apply will work either.

Following up on Vince's solution, you can also do this with plyr or by:

ddply(df, .(years), function(x) transform(x, percentile=ecdf(x$scores)(x$scores)))

Using ave

ave(d1$scores, d1$year, FUN=function(x) ecdf(x)(x))

I may be misunderstanding, but I think it can be done this way:

> years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001)
> scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100)
> tapply(scores, years, quantile)
$`2001`
  0%  25%  50%  75% 100% 
  56   78   89   98  100 

$`2006`
   0%   25%   50%   75%  100% 
13.00 20.50 28.50 41.75 65.00

Is this right?

I mean the actual percentile of each observation. – Ryan Rosario

Edit:

I think this may do it then:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) })
$`2001`
[1] 0.4 0.2 0.6 0.8 1.0

$`2006`
[1] 0.25 1.00 0.50 0.75

With your data:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) })
$`2000`
[1] 0.3333333 0.6666667 1.0000000

$`2008`
[1] 0.5 1.0

Edit 2:

This is probably faster:

tapply(scores, years, function(x) { f = ecdf(x); f(x) })

f() is vectorized :-)

Last, modification, I promise :-). If you want names:

> tapply(scores, years, function(x) { f = ecdf(x); r = f(x); names(r) <- x; r })
$`2000`
     1000      1700      2000 
0.3333333 0.6666667 1.0000000 

$`2008`
1500 2000 
 0.5  1.0

Using data.table is pretty straight-forward as well. Just for completeness and also as an easy way to find the data.table solution.

library(data.table)
year <- rep(2001:2005, 2)
score <- round(rnorm(10, 35, 3))

dt <- data.table(score)


dt[, .(Percentile = ecdf(score)(score)), by = list(year)]

You can also do something like this:

# first I'll create two dummy variables (Year, Score)
year <- rep(2001:2005, 2)
score <- round(rnorm(10, 35, 3))

# then coerce variables to data frame
d <- data.frame(year, score)

# then you can use split() function to apply
# function to each stratum of grouping variable
sapply(split(score, year), function(x) quantile(x, probs=seq(.1, .9, .1)))

Output will go something like this:

     2001 2002 2003 2004 2005
10%  34.3 32.1 34.3 29.6 36.1
20%  34.6 32.2 34.6 30.2 36.2
30%  34.9 32.3 34.9 30.8 36.3
40%  35.2 32.4 35.2 31.4 36.4
50%  35.5 32.5 35.5 32.0 36.5
60%  35.8 32.6 35.8 32.6 36.6
70%  36.1 32.7 36.1 33.2 36.7
80%  36.4 32.8 36.4 33.8 36.8
90%  36.7 32.9 36.7 34.4 36.9

You can utilize t() function to transpose rows and columns if you prefer. Writing a function will be a good way to tackle this kind of problems. I strongly recommend plyr package written by Hadley Wickam.

Hope this helps! All the best!

I found a method, but it requires a loop.

group.pctiles <- function(group.var, comparable) {
    unique.vals <- unique(group.var)
    pctiles <- vector(length = length(group.var))
    for (i in 1:length(unique.vals)) {
        slice <- which(group.var == unique.vals[i])
        F <- ecdf(comparable[slice])
        group.pctiles <- F(comparable[slice])
        pctiles[slice] <- group.pctiles
    }
    return(pctiles)
}

group.var is the variable that groups the data. In my example in my question, it is Year. comparable contains the values we want to find the percentiles for. In my question, comparable would be Score.

For the following data, I get the result below:

Year,School,Fees
2000,10,1000
2008,1,1050
2008,4,2000
2000,3,1700
2000,1,2000

> group.pctiles(dat, dat$Year, dat$Fees)
[1] 0.3333333 0.5000000 1.0000000 0.6666667 1.0000000

Then, I can cbind these percentiles back into the original data.frame for analysis, reporting, etc.

Anyone have a solution that doesn't require a loop?

How about something like:

Year <- c(2000,2008,2008,2000,2000)
Fees <- c(1000,1050,2000,1700,2000)
dat <- data.frame(Fees,Year,result=NA)
res <- tapply(Fees,Year,function(x) rank(x,ties.method="max")/length(x))
for(i in 1:length(res))
   dat[Year==as.numeric(names(res)[i]),"result"] <-res[[i]]

which yields:

  Fees Year    result
1 1000 2000 0.3333333
2 1050 2008 0.5000000
3 2000 2008 1.0000000
4 1700 2000 0.6666667
5 2000 2000 1.0000000

继续阅读：r

Percentile for Each Observation w/r/t Grouping Variable

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？