Grouping on multiple variables in R
I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R.
I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.
Here is what the data looks like:
userid visits posts revenue
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
… … … …
280000 80 10 100
280001 42 4 25
280002 31 8 17
Here is what I am trying to get the output to look like:
VisitRange PostRange # of Users Total Revenue Average Revenue
0 0 X Y Z
1-10 0 X Y Z
11-20 0 X Y Z
21-30 0 X Y Z
31-40 0 X Y Z
41-50 0 X Y Z
> 50 0 X Y Z
0 1-10 X Y Z
1-10 1-10 X Y Z
11-20 1-10 X Y Z
21-30 1-10 X Y Z
31-40 1-10 X Y Z
41开发者_如何学Go-50 1-10 X Y Z
> 50 1-10 X Y Z
want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as '> 51'
I've looked a tapply and ddply as ways to accomplish this, but I don't think they will work the way I am expecting, but I could be wrong.
Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example - If visits between 1 and 10, then '1-10'), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn't the right tool here, but I think it is…
All help would be appreciated. Thanks in advance.
The idiom in the plyr
package and ddply
in particular, is very similar to pivot tables in Excel.
In your example, the only thing you need to do is the cut
your grouping variables into the desired breaks, before passing to ddply
. Here is an example:
First, create some sample data:
set.seed(1)
dat <- data.frame(
userid = 1:500,
visits =sample(0:50, 500, replace=TRUE),
posts = sample(0:50, 500, replace=TRUE),
revenue = sample(1:100, replace=TRUE)
)
Now, use cut
to divide your grouping variables into the desired ranges:
dat$PostRange <- cut(dat$posts, breaks=seq(0, 50, 10), include.lowest=TRUE)
dat$VisitRange <- cut(dat$visits, breaks=seq(0, 50, 10), include.lowest=TRUE)
Finally, use ddply
with summarise
:
library(plyr)
ddply(dat, .(VisitRange, PostRange),
summarise,
Users=length(userid),
`Total Revenue`=sum(revenue),
`Average Revenue`=mean(revenue))
The results:
VisitRange PostRange Users Total Revenue Average Revenue
1 [0,10] [0,10] 23 1318 57.30435
2 [0,10] (10,20] 23 1136 49.39130
3 [0,10] (20,30] 28 1499 53.53571
4 [0,10] (30,40] 20 923 46.15000
5 [0,10] (40,50] 14 826 59.00000
6 (10,20] [0,10] 23 1227 53.34783
7 (10,20] (10,20] 17 642 37.76471
8 (10,20] (20,30] 20 888 44.40000
9 (10,20] (30,40] 15 622 41.46667
10 (10,20] (40,50] 21 968 46.09524
11 (20,30] [0,10] 23 1226 53.30435
12 (20,30] (10,20] 19 1021 53.73684
13 (20,30] (20,30] 23 1380 60.00000
14 (20,30] (30,40] 8 313 39.12500
15 (20,30] (40,50] 19 710 37.36842
16 (30,40] [0,10] 18 782 43.44444
17 (30,40] (10,20] 25 1308 52.32000
18 (30,40] (20,30] 14 553 39.50000
19 (30,40] (30,40] 26 1131 43.50000
20 (30,40] (40,50] 20 1295 64.75000
21 (40,50] [0,10] 20 958 47.90000
22 (40,50] (10,20] 21 1168 55.61905
23 (40,50] (20,30] 20 1118 55.90000
24 (40,50] (30,40] 20 1009 50.45000
25 (40,50] (40,50] 20 934 46.70000
精彩评论