Calculating subtotals in R
I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:
column name: date / mcode / mname / ycode / yname / yissue / bsent / breturn / tsent / treturn / csales
type: Date / Char / Char / Char / Char / Numeric / Numeric / Numeric / Numeric / Numeric / Numeric
I want to calculate the subtotals. For example, I want to calculate the sums at each change in yname, and add subtotal to all numerical variables. There are 160 distinct ynames, so the resulting table should tell me the subtotal of each yname. I haven't sorted the data yet, but this is not a problem because I can sort the data in whatever way I want. Below is a snippet from my data:
date mcode mname ycode yname yissue bsent breturn tsent treturn csales
417572 2010-07-28 45740 ENDPOINT A 5772 XMAG 20100800 7 0 7 0 0
417573 2010-07-31 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417574 2010-08-04 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417575 2010-08-14 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417576 2010-08-26 45740 ENDPOINT A 5772 XMAG 20100800 0 4 0 0 0
417577 2010-07-28 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417578 2010-08-04 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417579 2010-08-26 45741 ENDPOINT L 5772 XMAG 20100800 0 4 0 0 0
4175开发者_如何学运维80 2010-07-28 46390 ENDPOINT R 5772 XMAG 20100800 3 0 3 0 1
417581 2010-07-29 46390 ENDPOINT R 5772 XMAG 20100800 0 0 0 0 2
417582 2010-08-01 46390 ENDPOINT R 5779 YMAG 20100800 3 0 3 0 0
417583 2010-08-11 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417584 2010-08-20 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417585 2010-08-24 46390 ENDPOINT R 5779 YMAG 20100800 2 0 2 0 1
417586 2010-08-26 46390 ENDPOINT R 5779 YMAG 20100800 0 2 0 2 0
417587 2010-07-28 46411 ENDPOINT D 5779 YMAG 20100800 6 0 6 0 0
417588 2010-08-08 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417589 2010-08-11 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417590 2010-08-26 46411 ENDPOINT D 5779 YMAG 20100800 0 4 0 4 0
What function should I use here? Maybe something like SQL group by
?
OK. Assuming your data are in a data frame named foo
:
> head(foo)
date mcode mname ycode yname yissue bsent breturn tsent
417572 2010/07/28 45740 ENDPOINT A 5772 XMAG 20100800 7 0 7
417573 2010/07/31 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0
417574 2010/08/04 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0
417575 2010/08/14 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0
417576 2010/08/26 45740 ENDPOINT A 5772 XMAG 20100800 0 4 0
417577 2010/07/28 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2
treturn csales
417572 0 0
417573 0 1
417574 0 1
417575 0 1
417576 0 0
417577 0 0
Then this will do the aggregation of the numeric columns in your data:
> aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo,
+ FUN = sum)
yname bsent breturn tsent treturn csales
1 XMAG 14 8 14 0 6
2 YMAG 11 6 11 6 5
That was using the snippet of data you included in your Q. I used the formula interface to aggregate()
, which is a bit nicer in this instance because you don't need all the foo$
bits on the variable names you wish the aggregate. If you have missing data (NA
)in your full data set, then you'll need add an extra argument na.rm = TRUE
which will get passed to sum()
, like so:
> aggregate(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data = foo,
+ FUN = sum, na.rm = TRUE)
Or the plyr
library, which is easily extensible to other data classes:
> library(plyr)
> result.2 <- ddply(df$a, .(df$b), sum)
> result.2
df.b V1
1 down 30
2 up 25
You can also use xtabs
or tapply
:
xtabs(cbind(bsent, breturn, tsent, treturn, csales) ~ yname, data)
tapply(data$bsent, data$yname, sum)
if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:
test replications elapsed relative
4 f4() 100 0.033 1.00
3 f3() 100 0.046 1.39
1 f1() 100 0.165 5.00
2 f2() 100 0.605 18.33
i have added my code below if someone wants to explore in more detail.
library(plyr);
library(rbenchmark);
val = rnorm(50);
name = rep(letters[1:5], each = 10);
data = data.frame(val, name);
f1 = function(){aggregate(data$val, by=list(data$name), FUN=sum)}
f2 = function(){ddply(data, .(name), summarise, sum = sum(val))}
f3 = function(){tapply(data$val, data$name, sum)}
f4 = function(){rowsum(x = data$val, group = data$name)}
benchmark(f1(), f2(), f3(), f4(),
columns=c("test", "replications", "elapsed", "relative"),
order="relative", replications=100)
Google wasn't super helpful when I tried to find an answer to a similar question. I thought I would share my solution below using the library(janitor)
package with split()
, and purrr::map_df()
.
My use case was to run a script that would grab CC expenses from many different people to be reviewed by a person.
library(janitor)
library(purrr)
library(dplyr)
mtcars %>%
split(.[,"cyl"]) %>% ## splits each change in cyl into a list of dataframes
map_df(., janitor::adorn_totals)
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 4 120.3 91 4.43 2.140 16.70 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Total 44 1156.5 909 44.78 25.143 210.51 10 8 45 17
21 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21 6 160.0 110 3.90 2.875 17.02 0 1 4 4
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Total 42 1283.2 856 25.10 21.820 125.84 4 3 27 24
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
15 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Total 112 4943.4 2929 45.21 55.989 234.81 0 2 46 49
# if you're sending the output to be reviewed by a person, add a row!
mtcars %>%
split(.[,"cyl"]) %>%
map_df(., ~janitor::adorn_totals(.x) %>%
dplyr::add_row()) %>%
write.csv(., "demo.csv")
You can use aggregate
For instance, say that you have
val = rnorm(50)
name = rep(letters[1:5], each=10)
data <- data.frame(val, name)
Then you can do
aggregate(data$val, by=list(data$name), FUN=sum)
There is a R package called sqldf that enables you to use SQL commands on R data.frames. Besides like you already said, GROUP BY would be nice. You can easily store your data in a local MySQL database and connect to R using the package RMySQL (You can use most other DBMS too but MySQL is the easiest to set up).
As far as I can judge it plyr is a great package, too. But from the way you ask and compare your problem to GROUP BY, I guess you know something about SQL, so using this might be easier for you. There are comfortable functions like dbReadTable, plus if your data grows bigger you can select only subparts of your data to only run your analysis with what you really need.
精彩评论