Group rows by attributes
I have a data frame containing data about student lateness to various classes. Each row contains data about a late student and his class: date and time of the class, name of the class, class size, number of minutes late, and the gender of the student. In order to get the total percentage of late students for all classes, I need to count the number of rows (late students) and compare that with the total number of studen开发者_C百科ts that attended class.
I can't simply sum the class sizes for all of the rows; that would count the students of a given class several times, once for each late student in the class. Instead, I need to count each class size only once for each meeting of the class.
Example
Key: minutes late, class name, students in attendance, gender of tardy student, minutes late.
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
In this case, there are three different class meetings and 11 late students. How could I make sure each class meeting's class size is only counted once?
If I understand what you want correctly, this is easier to do with the plyr package, rather than tapply or by because it understands what amounts to a multivariate grouping. For instance:
The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.
ddply(df, .(DATE,CLASS), transform, PERCENT_LATE=length(MINUTES.LATE)/CLASS.SIZE))
To follow on @Gavin's comment re: the redundant output, using summarise:
df.out <- ddply(x, .(DATE, CLASS), summarise
, NLATE = length(c(DATE, CLASS)) / 2
, SIZE = unique(CLASS.SIZE)
, PCLATE = 100 * (length(c(DATE, CLASS)) / 2 )/ unique(CLASS.SIZE)
)
> df.out
DATE CLASS NLATE SIZE PCLATE
1 11/12/10 Stats 3 30 10.00
2 11/15/10 Stats 3 40 7.50
3 11/16/10 Radar 5 22 22.73
Different functions for sum number late and class size . Need to use a "paste" strategy to create unique combo's of data and class name:
> sum_late <- tapply( tst$V5, paste(tst$V1, tst$V2, sep="_"), length)
> csize <- tapply( tst$V3, paste(tst$V1, tst$V2, sep="_"), head,1)
> pct_late <- 100*sum_late/csize
> pct_late
11/12/10_Stats 11/15/10_Stats 11/16/10_Radar
10.00000 7.50000 22.72727
Or with aggregate:
> dfcount <- aggregate( tst$V5, list(tst$V1, tst$V2), length)
> dfcount$pct <- 100*aggregate( tst$V5, list(tst$V1, tst$V2), length)$x/aggregate( tst$V3, list(tst$V1, tst$V2), head,1)$x
> dfcount
Group.1 Group.2 x pct
1 11/16/10 Radar 5 22.72727
2 11/12/10 Stats 3 10.00000
3 11/15/10 Stats 3 7.50000
Edit: My solution can be made a lot simpler by computing the trivial % late on a per row basis first, then use aggregate()
to sum these percentages by Date and Class:
> df2 <- within(df, pcLate <- 100 * (1 / Size))
> df2
Date Class Size Sex MinsLate pcLate
1 2010-11-12 Stats 30 M 1 3.333333
2 2010-11-12 Stats 30 M 1 3.333333
3 2010-11-12 Stats 30 M 1 3.333333
4 2010-11-15 Stats 40 F 3 2.500000
5 2010-11-15 Stats 40 F 3 2.500000
6 2010-11-15 Stats 40 F 3 2.500000
7 2010-11-16 Radar 22 M 2 4.545455
8 2010-11-16 Radar 22 M 2 4.545455
9 2010-11-16 Radar 22 M 2 4.545455
10 2010-11-16 Radar 22 M 2 4.545455
11 2010-11-16 Radar 22 M 2 4.545455
> with(df2, aggregate(pcLate, by = list(Date = Date, Class = Class), sum))
Date Class x
1 2010-11-16 Radar 22.72727
2 2010-11-12 Stats 10.00000
3 2010-11-15 Stats 7.50000
Original Answer:
Assuming df
contains the example data you provide, we can do this in a couple of steps using aggregate()
First, grab the number of late students per class:
summ <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
FUN = length))
names(summ)[3] <- "nLate"
Which gives us this starting point
> head(summ)
Date Class nLate
1 2010-11-16 Radar 5
2 2010-11-12 Stats 3
3 2010-11-15 Stats 3
Then form the class sizes:
summ$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
FUN = unique)$x)
Which gets us to here:
> head(summ)
Date Class nLate Size
1 2010-11-16 Radar 5 22
2 2010-11-12 Stats 3 30
3 2010-11-15 Stats 3 40
Then compute the percentage late:
summ <- within(summ, pcLate <- 100 * (nLate / Size))
Which results in:
> head(summ)
Date Class nLate Size pcLate
1 2010-11-16 Radar 5 22 22.72727
2 2010-11-12 Stats 3 30 10.00000
3 2010-11-15 Stats 3 40 7.50000
If you need to do this a lot, wrap it into a function
tardiness <- function(df) {
out <- with(df, aggregate(MinsLate, by = list(Date = Date, Class = Class),
FUN = length))
names(out)[3] <- "nLate"
out$Size <- with(df, aggregate(Size, by = list(Date = Date, Class = Class),
FUN = unique)$x)
out <- within(out, pcLate <- 100 * (nLate / Size))
out
}
this does all the steps for us:
> tardiness(df)
Date Class nLate Size pcLate
1 2010-11-16 Radar 5 22 22.72727
2 2010-11-12 Stats 3 30 10.00000
3 2010-11-15 Stats 3 40 7.50000
精彩评论