开发者

R equivalent of SELECT DISTINCT on two or more fields/variables

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?

I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... va开发者_运维技巧rN" query, but I am looking for an R way of doing this.

It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:

uniques <- as.data.frame(ftable(df$var1, df$var2))


unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).


@Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:

Simply use:

df %>% distinct(var1, var2)

If you want to keep all columns, add

df %>% distinct(var1, var2, .keep_all = TRUE)


To KEEP all other variables in df use this:

unique_rows <- !duplicated(df[c("var1","var2")])

unique.df <- df[unique_rows,]

Another less recommended method is using row.names() #(see David's comment below):

unique_rows <- row.names(unique(df[c("var1","var2")]))

unique.df <- df[unique_rows,]


In addition to answers above, the data.table version:

setDT(df)

unique_dt = unique(df, by = c('var1', 'var2'))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜