Find duplicated rows (based on 2 columns) in Data Frame in R
I have a data frame in R which looks like:
| RIC | Dat开发者_运维知识库e | Open |
|--------|---------------------|--------|
| S1A.PA | 2011-06-30 20:00:00 | 23.7 |
| ABC.PA | 2011-07-03 20:00:00 | 24.31 |
| EFG.PA | 2011-07-04 20:00:00 | 24.495 |
| S1A.PA | 2011-07-05 20:00:00 | 24.23 |
I want to know if there's any duplicates regarding to the combination of RIC and Date. Is there a function for that in R?
You can always try simply passing those first two columns to the function duplicated
:
duplicated(dat[,1:2])
assuming your data frame is called dat
. For more information, we can consult the help files for the duplicated
function by typing ?duplicated
at the console. This will provide the following sentences:
Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.
So duplicated
returns a logical vector, which we can then use to extract a subset of dat
:
ind <- duplicated(dat[,1:2])
dat[ind,]
or you can skip the separate assignment step and simply use:
dat[duplicated(dat[,1:2]),]
dplyr is so much nicer for this sort of thing:
library(dplyr)
yourDataFrame %>%
distinct(RIC, Date, .keep_all = TRUE)
(the ".keep_all is optional. if not used, it will return only the deduped 2 columns. when used, it returns the deduped whole data frame)
Here's a dplyr
option for tagging duplicates based on two (or more) columns. In this case ric
and date
:
df <- data_frame(ric = c('S1A.PA', 'ABC.PA', 'EFG.PA', 'S1A.PA', 'ABC.PA', 'EFG.PA'),
date = c('2011-06-30 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00', '2011-07-05 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00'),
open = c(23.7, 24.31, 24.495, 24.23, 24.31, 24.495))
df %>%
group_by(ric, date) %>%
mutate(dupe = n()>1)
# A tibble: 6 x 4
# Groups: ric, date [4]
ric date open dupe
<chr> <chr> <dbl> <lgl>
1 S1A.PA 2011-06-30 20:00:00 23.7 FALSE
2 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
3 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
4 S1A.PA 2011-07-05 20:00:00 24.2 FALSE
5 ABC.PA 2011-07-03 20:00:00 24.3 TRUE
6 EFG.PA 2011-07-04 20:00:00 24.5 TRUE
Easy way to get the information you want is to use dplyr
.
yourDF %>%
group_by(RIC, Date) %>%
mutate(num_dups = n(),
dup_id = row_number()) %>%
ungroup() %>%
mutate(is_duplicated = dup_id > 1)
Using this:
num_dups
tells you how many times that particular combo is duplicateddup_id
tells you which duplicate number that particular row is (e.g. 1st, 2nd, or 3rd, etc)is_duplicated
gives you an easy condition you can filter on later to remove all the duplicate rows (e.g.filter(!is_duplicated)
), though you could also usedup_id
for this (e.g.filter(dup_id == 1)
)
If you want to remove duplicate records based on values of Columns Date and State in dataset data.frame:
#Indexes of the duplicate rows that will be removed:
duplicate_indexes <- which(duplicated(dataset[c('Date', 'State')]),)
duplicate_indexes
#new_uniq will contain unique dataset without the duplicates.
new_uniq <- dataset[!duplicated(dataset[c('Date', 'State')]),]
View(new_uniq)
I think what you're looking for is a way to return a data frame of the duplicated rows in the same format as your original data. There is probably a more elegant way to do this but this works:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
Found quite a masterful idea posted by Steve Lianouglou that helps solve this problem with the great advantage of indexing the repetitions:
If you generate a hash
column concatenating both your columns for which you want to check duplicates, you can then use dplyr::n()
together with seq
to give an index to each duplicate occurrence as follows
dat %>% mutate(hash = str_c(RIC, Date)) %>%
group_by(hash) %>%
mutate(duplication_id = seq(n()) %>%
ungroup ()
Your column duplication_id
tells you how many identical rows (same row values for both columns) are there in your table above the one indexed. I used this to remove duplicate Ids.
精彩评论