开发者

Identify records in data frame A not contained in data frame B [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help cl开发者_StackOverflow中文版arifying this question so that it can be reopened, visit the help center. Closed 11 years ago.

This is my first time posting here, so please be kind ;-)

EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!

QUESTION

How can I identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

EXAMPLE DATA

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
  a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
  a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

DESIRED RESULT

  a b
2 2 2
5 5 5

BEST SOLUTION SO FAR

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784

A collection of all solutions tested so far is available at my blog

FINAL EDIT 2011-10-14

Here's the best solution wrapped into a function 'mergeX()':

setGeneric(
    name="mergeX",
    signature=c("src.1", "src.2"),
    def=function(
        src.1,
        src.2,
        ...
    ){
    standardGeneric("mergeX")    
    }
)

setMethod(
    f="mergeX", 
    signature=signature(src.1="data.frame", src.2="data.frame"), 
    definition=function(
        src.1,
        src.2,
        do.inverse=FALSE,
        ...
    ){
    if(!do.inverse){
        out <- merge(x=src.1, y=src.2, ...)
    } else {
        if("by.y" %in% names(list(...))){
            src.2.0 <- src.2
            src.2 <- src.1
            src.1 <- src.2.0
        }
        src.1p <- do.call("paste", src.1)
        src.2p <- do.call("paste", src.2)
        out <- src.1[! src.1p %in% src.2p, ]
    }
    return(out)    
    }
)


Here are a few ways. #1 and #4 assume that the rows of x.1 are unique. (If rows of x.1 are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:

# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]

# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))

# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]

# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")

EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.


What about using merge - the simplest possible solution - I'd think it's also the fastest.

tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result

Corresponds to:

select x1.* 
from x1 natural left join x2 
where x2.myid is NULL

(you can also use sqldf to do that).

Note that the column myid is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:

tmp = merge(x.1, x.2, all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜