R equivalent of SELECT DISTINCT on two or more fields/variables
Say I have a dataframe df with two or more columns, is there an easy way to use unique()
or other R function to create a subset of unique combinations of two or more columns?
I know I can use sqldf()
and write an easy "SELECT DISTINCT var1, var2, ... va开发者_运维技巧rN"
query, but I am looking for an R way of doing this.
It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:
uniques <- as.data.frame(ftable(df$var1, df$var2))
unique
works on data.frame
so unique(df[c("var1","var2")])
should be what you want.
Another option is distinct
from dplyr
package:
df %>% distinct(var1, var2) # or distinct(df, var1, var2)
Note:
For older versions of dplyr (< 0.5.0, 2016-06-24) distinct
required additional step
df %>% select(var1, var2) %>% distinct
(or oldish way distinct(select(df, var1, var2))
).
@Marek's answer is obviously correct, but may be outdated. The current dplyr
version (0.7.4) allows for an even simpler code:
Simply use:
df %>% distinct(var1, var2)
If you want to keep all columns, add
df %>% distinct(var1, var2, .keep_all = TRUE)
To KEEP all other variables in df use this:
unique_rows <- !duplicated(df[c("var1","var2")])
unique.df <- df[unique_rows,]
Another less recommended method is using row.names() #(see David's comment below):
unique_rows <- row.names(unique(df[c("var1","var2")]))
unique.df <- df[unique_rows,]
In addition to answers above, the data.table version:
setDT(df)
unique_dt = unique(df, by = c('var1', 'var2'))
精彩评论