Fastest way to detect if vector has at least 1 NA?

2023-03-17 05:48 问答作者：

What is the fastest way to detect if a vector has at least 1 NA in R? I've been using:

sum( is.na( data开发者_运维技巧 ) ) > 0

But that requires examining each element, coercion, and the sum function.

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

I'm thinking:

any(is.na(data))

should be slightly faster.

We mention this in some of our Rcpp presentations and actually have some benchmarks which show a pretty large gain from embedded C++ with Rcpp over the R solution because

a vectorised R solution still computes every single element of the vector expression
if your goal is to just satisfy any(), then you can abort after the first match -- which is what our Rcpp sugar (in essence: some C++ template magic to make C++ expressions look more like R expressions, see this vignette for more) solution does.

So by getting a compiled specialised solution to work, we do indeed get a fast solution. I should add that while I have not compared this to the solutions offered in this SO question here, I am reasonably confident about the performance.

Edit And the Rcpp package contains examples in the directory sugarPerformance. It has an increase of the several thousand of the 'sugar-can-abort-soon' over 'R-computes-full-vector-expression' for any(), but I should add that that case does not involve is.na() but a simple boolean expression.

One could write a for loop stopping at NA, but the system.time then depends on where the NA is... (if there is none, it takes looooong)

set.seed(1234)
x <- sample(c(1:5, NA), 100000000, replace = TRUE)

nacount <- function(x){
  for(i in 1:length(x)){
    if(is.na(x[i])) {
      print(TRUE)
      break}
}}

system.time(
  nacount(x)
)
[1] TRUE
       User      System verstrichen 
       0.14        0.04        0.18 

system.time(
  any(is.na(x))
) 
       User      System verstrichen 
       0.28        0.08        0.37 

system.time(
  sum(is.na(x)) > 0
)
       User      System verstrichen 
       0.45        0.07        0.53

Here are some actual times from my (slow) machine for some of the various methods discussed so far:

x <- runif(1e7)
x[1e4] <- NA

system.time(sum(is.na(x)) > 0)
> system.time(sum(is.na(x)) > 0)
   user  system elapsed 
  0.065   0.001   0.065 

system.time(any(is.na(x)))  
> system.time(any(is.na(x)))
   user  system elapsed 
  0.035   0.000   0.034

system.time(match(NA,x)) 
> system.time(match(NA,x))
  user  system elapsed 
 1.824   0.112   1.918

system.time(NA %in% x) 
> system.time(NA %in% x)
  user  system elapsed 
 1.828   0.115   1.925 

system.time(which(is.na(x) == TRUE))
> system.time(which(is.na(x) == TRUE))
  user  system elapsed 
 0.099   0.029   0.127

It's not surprising that match and %in% are similar, since %in% is implemented using match.

You can try:

d <- c(1,2,3,NA,5,3)

which(is.na(d) == TRUE, arr.ind=TRUE)

继续阅读：na

Fastest way to detect if vector has at least 1 NA?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？