Identify records in data frame A not contained in data frame B [closed]
This is my first time posting here, so please be kind ;-)
EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!
QUESTION
How can I identify records/rows in data frame x.1
that are not contained in data frame x.2
based on all attributes available (i.e. all columns) in the most efficient way?
EXAMPLE DATA
> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
a b
1 1 1
2 1 1
3 2 99
4 3 3
5 4 4
DESIRED RESULT
a b
2 2 2
5 5 5
BEST SOLUTION SO FAR
by Prof. Brian Ripley and Gabor Grothendieck
> fun.12 <- function(x.1,x.2,...){
+ x.1p <- do.call("paste", x.1)
+ x.2p <- do.call("paste", x.2)
+ x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784
A collection of all solutions tested so far is available at my blog
FINAL EDIT 2011-10-14
Here's the best solution wrapped into a function 'mergeX()':
setGeneric(
name="mergeX",
signature=c("src.1", "src.2"),
def=function(
src.1,
src.2,
...
){
standardGeneric("mergeX")
}
)
setMethod(
f="mergeX",
signature=signature(src.1="data.frame", src.2="data.frame"),
definition=function(
src.1,
src.2,
do.inverse=FALSE,
...
){
if(!do.inverse){
out <- merge(x=src.1, y=src.2, ...)
} else {
if("by.y" %in% names(list(...))){
src.2.0 <- src.2
src.2 <- src.1
src.1 <- src.2.0
}
src.1p <- do.call("paste", src.1)
src.2p <- do.call("paste", src.2)
out <- src.1[! src.1p %in% src.2p, ]
}
return(out)
}
)
Here are a few ways. #1 and #4 assume that the rows of x.1
are unique. (If rows of x.1
are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:
# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]
# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))
# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]
# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")
EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.
What about using merge
- the simplest possible solution - I'd think it's also the fastest.
tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE)
# provided that there's no column myid in both dataframes
tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result
Corresponds to:
select x1.*
from x1 natural left join x2
where x2.myid is NULL
(you can also use sqldf to do that).
Note that the column myid
is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:
tmp = merge(x.1, x.2, all.x = TRUE)
# provided that there's no column myid in both dataframes
tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result
精彩评论