Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2
I have the following 2 data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
I want to find the row a1 has that a2 doesn't.
Is there a built in function for this type of operation?
(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)
Here is my solution:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
开发者_如何学运维a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
sqldf
provides a nice solution
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
require(sqldf)
a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')
And the rows which are in both data frames:
a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')
The new version of dplyr
has a function, anti_join
, for exactly these kinds of comparisons
require(dplyr)
anti_join(a1,a2)
And semi_join
to filter rows in a1
that are also in a2
semi_join(a1,a2)
In dplyr:
setdiff(a1,a2)
Basically, setdiff(bigFrame, smallFrame)
gets you the extra records in the first table.
In the SQLverse this is called a
For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins
But back to this question - here are the results for the setdiff()
code when using the OP's data:
> a1
a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
> a2
a b
1 1 a
2 2 b
3 3 c
> setdiff(a1,a2)
a b
1 4 d
2 5 e
Or even anti_join(a1,a2)
will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare
:
library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
# a b
#1 1 a
#2 2 b
#3 3 c
The function compare
gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):
difference <-
data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
# a b
#1 4 d
#2 5 e
It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:
a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)
missing values in included_a1 will note which rows are missing in a1. similarly for a2.
One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.
I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.
> df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
> df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
> df_compare = compare_df(df1, df2, "row")
> df_compare$comparison_df
row chng_type a b
1 4 + 4 d
2 5 + 5 e
A more complicated example:
library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", "Duster 360", "Merc 240D"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
hp = c(110, 110, 181, 110, 245, 62),
cyl = c(6, 6, 4, 6, 8, 4),
qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))
df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
"Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
hp = c(110, 110, 93, 110, 175, 105),
cyl = c(6, 6, 4, 6, 8, 6),
qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))
> df_compare$comparison_df
grp chng_type id1 id2 hp cyl qsec
1 1 - Hornet Sportabout Dus 175 8 17.02
2 2 + Datsun 710 Dat 181 4 33.00
3 2 - Datsun 710 Dat 93 4 18.61
4 3 + Duster 360 Dus 245 8 15.84
5 7 + Merc 240D Mer 62 4 20.00
6 8 - Valiant Val 105 6 20.22
The package also has an html_output command for quick checking
df_compare$html_output
You could use the daff
package (which wraps the daff.js
library using the V8
package):
library(daff)
diff_data(data_ref = a2,
data = a1)
produces the following difference object:
Daff Comparison: ‘a2’ vs. ‘a1’
First 6 and last 6 patch lines:
@@ a b
1 ... ... ...
2 3 c
3 +++ 4 d
4 +++ 5 e
5 ... ... ...
6 ... ... ...
7 3 c
8 +++ 4 d
9 +++ 5 e
The tabular diff format is described here and should be pretty self-explanatory. The lines with +++
in the first column @@
are the ones which are new in a1
and not present in a2
.
The difference object can be used to patch_data()
, to store the difference for documentation purposes using write_diff()
or to visualize the difference using render_diff()
:
render_diff(
diff_data(data_ref = a2,
data = a1)
)
generates a neat HTML output:
Using diffobj
package:
library(diffobj)
diffPrint(a1, a2)
diffObj(a1, a2)
I adapted the merge
function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.
Another solution is to use the library prob
.
# Derived from src/library/base/R/merge.R
# Part of the R package, http://www.R-project.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# A copy of the GNU General Public License is available at
# http://www.r-project.org/Licenses/
XinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = FALSE, incomparables = NULL,
...)
{
fix.by <- function(by, df)
{
## fix up 'by' to be a valid set of cols by number: 0 is row.names
if(is.null(by)) by <- numeric(0L)
by <- as.vector(by)
nc <- ncol(df)
if(is.character(by))
by <- match(by, c("row.names", names(df))) - 1L
else if(is.numeric(by)) {
if(any(by < 0L) || any(by > nc))
stop("'by' must match numbers of columns")
} else if(is.logical(by)) {
if(length(by) != nc) stop("'by' must match number of columns")
by <- seq_along(by)[by]
} else stop("'by' must specify column(s) as numbers, names or logical")
if(any(is.na(by))) stop("'by' must specify valid column(s)")
unique(by)
}
nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
by.x <- fix.by(by.x, x)
by.y <- fix.by(by.y, y)
if((l.b <- length(by.x)) != length(by.y))
stop("'by.x' and 'by.y' specify different numbers of columns")
if(l.b == 0L) {
## was: stop("no columns to match on")
## returns x
x
}
else {
if(any(by.x == 0L)) {
x <- cbind(Row.names = I(row.names(x)), x)
by.x <- by.x + 1L
}
if(any(by.y == 0L)) {
y <- cbind(Row.names = I(row.names(y)), y)
by.y <- by.y + 1L
}
## create keys from 'by' columns:
if(l.b == 1L) { # (be faster)
bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
} else {
## Do these together for consistency in as.character.
## Use same set of names.
bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}
comm <- match(bx, by, 0L)
if (notin) {
res <- x[comm == 0,]
} else {
res <- x[comm > 0,]
}
}
## avoid a copy
## row.names(res) <- NULL
attr(res, "row.names") <- .set_row_names(nrow(res))
res
}
XnotinY <-
function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
notin = TRUE, incomparables = NULL,
...)
{
XinY(x,y,by,by.x,by.y,notin,incomparables)
}
Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
library(data.table)
setDT(a1)
setDT(a2)
# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE
# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE
It needs data.table 1.9.8+
Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]
Using subset
:
missing<-subset(a1, !(a %in% a2$a))
Yet another solution based on match_df in plyr. Here's plyr's match_df:
match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[keys$x %in% keys$y, , drop = FALSE]
}
We can modify it to negate:
library(plyr)
negate_match_df <- function (x, y, on = NULL)
{
if (is.null(on)) {
on <- intersect(names(x), names(y))
message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[!(keys$x %in% keys$y), , drop = FALSE]
}
Then:
diff <- negate_match_df(a1,a2)
The following code uses both data.table
and fastmatch
for increased speed.
library("data.table")
library("fastmatch")
a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))
compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package
added_rows <- a1[which(compare_rows == FALSE)]
added_rows
# a b
# 1: 4 d
# 2: 5 e
Really fast comparison, to get count of differences. Using specific column name.
colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches
For complete dataframe, do not provide column or index name
sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches
精彩评论