Group rows by attributes

2023-01-28 06:53 问答作者：

I have a data frame containing data about student lateness to various classes. Each row contains data about a late student and his class: date and time of the class, name of the class, class size, number of minutes late, and the gender of the student. In order to get the total percentage of late students for all classes, I need to count the number of rows (late students) and compare that with the total number of studen开发者_C百科ts that attended class.

I can't simply sum the class sizes for all of the rows; that would count the students of a given class several times, once for each late student in the class. Instead, I need to count each class size only once for each meeting of the class.

Example

Key: minutes late, class name, students in attendance, gender of tardy student, minutes late.

11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2

In this case, there are three different class meetings and 11 late students. How could I make sure each class meeting's class size is only counted once?

If I understand what you want correctly, this is easier to do with the plyr package, rather than tapply or by because it understands what amounts to a multivariate grouping. For instance:

ddply(df, .(DATE,CLASS), transform, PERCENT_LATE=length(MINUTES.LATE)/CLASS.SIZE))

The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.

To follow on @Gavin's comment re: the redundant output, using summarise:

df.out <- ddply(x, .(DATE, CLASS), summarise    
    , NLATE = length(c(DATE, CLASS)) / 2
    , SIZE = unique(CLASS.SIZE)
    , PCLATE = 100 * (length(c(DATE, CLASS)) / 2 )/ unique(CLASS.SIZE)
    )
> df.out
      DATE CLASS NLATE SIZE PCLATE
1 11/12/10 Stats     3   30  10.00
2 11/15/10 Stats     3   40   7.50
3 11/16/10 Radar     5   22  22.73

Different functions for sum number late and class size . Need to use a "paste" strategy to create unique combo's of data and class name:

>  sum_late <- tapply( tst$V5, paste(tst$V1, tst$V2, sep="_"), length)
>  csize <- tapply( tst$V3, paste(tst$V1, tst$V2, sep="_"), head,1)
> pct_late <- 100*sum_late/csize
> pct_late
11/12/10_Stats 11/15/10_Stats 11/16/10_Radar 
      10.00000        7.50000       22.72727

Or with aggregate:

>  dfcount <- aggregate( tst$V5, list(tst$V1, tst$V2), length)
> dfcount$pct <- 100*aggregate( tst$V5, list(tst$V1, tst$V2), length)$x/aggregate( tst$V3, list(tst$V1, tst$V2), head,1)$x
> dfcount
   Group.1 Group.2 x      pct
1 11/16/10   Radar 5 22.72727
2 11/12/10   Stats 3 10.00000
3 11/15/10   Stats 3  7.50000

Edit: My solution can be made a lot simpler by computing the trivial % late on a per row basis first, then use aggregate() to sum these percentages by Date and Class:

> df2 <- within(df, pcLate <- 100 * (1 / Size)) 
> df2
         Date Class Size Sex MinsLate   pcLate
1  2010-11-12 Stats   30   M        1 3.333333
2  2010-11-12 Stats   30   M        1 3.333333
3  2010-11-12 Stats   30   M        1 3.333333
4  2010-11-15 Stats   40   F        3 2.500000
5  2010-11-15 Stats   40   F        3 2.500000
6  2010-11-15 Stats   40   F        3 2.500000
7  2010-11-16 Radar   22   M        2 4.545455
8  2010-11-16 Radar   22   M        2 4.545455
9  2010-11-16 Radar   22   M        2 4.545455
10 2010-11-16 Radar   22   M        2 4.545455
11 2010-11-16 Radar   22   M        2 4.545455
> with(df2, aggregate(pcLate, by = list(Date = Date, Class = Class), sum))
        Date Class        x
1 2010-11-16 Radar 22.72727
2 2010-11-12 Stats 10.00000
3 2010-11-15 Stats  7.50000

Original Answer:

Assuming df contains the example data you provide, we can do this in a couple of steps using aggregate()

First, grab the number of late students per class:

summ <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                           FUN = length))
names(summ)[3] <- "nLate"

Which gives us this starting point

> head(summ)
        Date Class nLate
1 2010-11-16 Radar     5
2 2010-11-12 Stats     3
3 2010-11-15 Stats     3

Then form the class sizes:

summ$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                FUN = unique)$x)

Which gets us to here:

> head(summ)
        Date Class nLate Size
1 2010-11-16 Radar     5   22
2 2010-11-12 Stats     3   30
3 2010-11-15 Stats     3   40

Then compute the percentage late:

summ <- within(summ, pcLate <- 100 * (nLate / Size))

Which results in:

> head(summ)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

If you need to do this a lot, wrap it into a function

tardiness <- function(df) {
    out <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
                              FUN = length))
    names(out)[3] <- "nLate"
    out$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
                                   FUN = unique)$x)
    out <- within(out, pcLate <- 100 * (nLate / Size))
    out
}

this does all the steps for us:

> tardiness(df)
        Date Class nLate Size   pcLate
1 2010-11-16 Radar     5   22 22.72727
2 2010-11-12 Stats     3   30 10.00000
3 2010-11-15 Stats     3   40  7.50000

继续阅读：dataframe filter r

Group rows by attributes

Example

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Example

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？