How to test whether a vector contains repetitive elements?
how do you test whether a v开发者_Python百科ector contains repetitive elements in R?
I think I found the answer. Use duplicated() function:
a=c(3,5,7,2,7,9)
b=1:10
any(duplicated(a)) #True
any(duplicated(b)) #False
Also try rle(x)
to find the lengths of runs of identical values in x
.
If you're looking for consecutive repeats you can use diff
.
a <- 1:10
b <- c(1:5, 5, 7, 8, 9, 10)
diff(a)
diff(b)
Or anywhere in the vector:
length(a) == length(unique(a))
length(b) == length(unique(b))
check this:
> all(diff(c(1,2,3)))
[1] TRUE
Warning message:
In all(diff(c(1, 2, 3))) : coercing argument of type 'double' to logical
> all(diff(c(1,2,2,3)))
[1] FALSE
Warning message:
In all(diff(sort(c(1, 2, 4, 2, 3)))) : coercing argument of type 'double' to logical
You can add some casting to get rid of warnings.
As mentioned in the comment section by Hadley:
anyDuplicated
will be a bit faster for very long vectors - it can terminate when it finds the first duplicate.
Example
a=c(3,5,7,2,7,9)
b=1:10
anyDuplicated(b) != 0L # TRUE
anyDuplicated(b) != 0L # FALSE
Benchmark with 1 million observations:
set.seed(2011)
x <- sample(1e7, size = 1e6, replace = TRUE)
bench::mark(
ZNN = any(duplicated(x)),
RL = length(x) != length(unique(x)),
BUA = !all(diff(sort(x))),
AD = anyDuplicated(x) != 0L
)
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ZNN 64.62ms 70.04ms 11.5 11.8MB 0 8 0 693ms <lgl [1]> <df[,3] [2 x 3]> <bch:tm> <tibble [8 x 3]>
2 RL 66.95ms 70.67ms 12.5 15.4MB 0 7 0 561ms <lgl [1]> <df[,3] [3 x 3]> <bch:tm> <tibble [7 x 3]>
3 BUA 84.66ms 87.79ms 10.6 42MB 3.54 3 1 283ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [4 x 3]>
4 AD 2.45ms 2.87ms 314. 8MB 5.98 105 2 335ms <lgl [1]> <df[,3] [1 x 3]> <bch:tm> <tibble [107 x 3]>
Benchmark with 100 observations
set.seed(2011)
x <- sample(1e7, size = 100, replace = TRUE)
bench::mark(
ZNN = any(duplicated(x)),
RL = length(x) != length(unique(x)),
BUA = !all(diff(sort(x))),
AD = anyDuplicated(x) != 0L
)
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 ZNN 7.14us 8.93us 60429. 1.48KB 6.04 9999 1 165.5ms <lgl [1]> <df[,3] [2 x 3]> <bch:tm> <tibble [10,000 x 3]>
2 RL 8.03us 9.37us 83754. 1.92KB 0 10000 0 119.4ms <lgl [1]> <df[,3] [3 x 3]> <bch:tm> <tibble [10,000 x 3]>
3 BUA 54.89us 61.58us 8317. 4.83KB 6.74 3701 3 445ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [3,704 x 3]>
4 AD 5.8us 6.69us 123838. 1.05KB 0 10000 0 80.8ms <lgl [1]> <df[,3] [1 x 3]> <bch:tm> <tibble [10,000 x 3]>
精彩评论