How to merge two data frames on common columns in R with sum of others?
R Version 2.11.1 32-bit on Windows 7
I got two data sets: data_A and data_B:
data_A
USER_A USER_B ACTION
1 11 0.3
1 13 0.25
1 16 0.63
1 17 0.26
2 11 0.14
2 14 0.28
data_B
USER_A USER_B ACTION
1 13 0.17
1 14 0.27
2 11 0.25
Now I want to add the ACTION of data_B to the data_A if their USER_A and USER_B are equal. As the example above, the result would be:
data_A
USER_A USER_B ACTIO开发者_C百科N
1 11 0.3
1 13 0.25+0.17
1 16 0.63
1 17 0.26
2 11 0.14+0.25
2 14 0.28
So how could I achieve it?
You can use ddply
in package plyr
and combine it with merge
:
library(plyr)
ddply(merge(data_A, data_B, all.x=TRUE),
.(USER_A, USER_B), summarise, ACTION=sum(ACTION))
Notice that merge
is called with the parameter all.x=TRUE
- this returns all of the values in the first data.frame passed to merge
, i.e. data_A:
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.25
3 1 16 0.63
4 1 17 0.26
5 2 11 0.14
6 2 14 0.28
This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf
to do a left (outer) join and then summarise the resulting object:
require(sqldf)
tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")
This results in:
> tmp
USER_A USER_B ACTION ACTION
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
Now we just need sum the two ACTION
columns:
data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))
Which gives the desired result:
> data_C
USER_A USER_B ACTION
1 1 11 0.30
2 1 13 0.42
3 1 16 0.63
4 1 17 0.26
5 2 11 0.39
6 2 14 0.28
This can be done using standard R function merge
:
> merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
USER_A USER_B ACTION.x ACTION.y
1 1 11 0.30 NA
2 1 13 0.25 0.17
3 1 16 0.63 NA
4 1 17 0.26 NA
5 2 11 0.14 0.25
6 2 14 0.28 NA
So we can replace the sqldf()
call above with:
tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
whilst the second line using transform()
remains the same.
We can use {powerjoin}:
library(powerjoin)
power_left_join(
data_A, data_B, by = c("USER_A", "USER_B"),
conflict = ~ .x + ifelse(is.na(.y), 0, .y)
)
#> USER_A USER_B ACTION
#> 1 1 11 0.30
#> 2 1 13 0.42
#> 3 1 16 0.63
#> 4 1 17 0.26
#> 5 2 11 0.39
#> 6 2 14 0.28
In case of conflict, the function fed to the conflict
argument will be used
on pairs of conflicting columns.
We can also use sum(, na.rm = TRUE)
row-wise for the same effect :
power_left_join(data_A,data_B, by = c("USER_A", "USER_B"),
conflict = rw ~ sum(.x, .y, na.rm = TRUE))
精彩评论