开发者

counting vectors with NA included

By mistake, I found that R count vector with NA included in an interesting way:

> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[temp>1])
[1] 3

> temp <- c(NA,NA,1) # 3 items
> length(temp[temp>1])
[1] 2

At first I assume R will process all NAs into one NA, but this is not the case.

Can anyone explain? Th开发者_Go百科anks.


You were expecting only TRUE's and FALSE's (and the results to only be FALSE) but a logical vector can also have NA's. If you were hoping for a length zero result, then you had at least three other choices:

> temp <- c(NA,NA,NA,1) # 4 items
>  length(temp[ which(temp>1) ] )
[1] 0

> temp <- c(NA,NA,NA,1) # 4 items
>  length(subset( temp, temp>1) )
[1] 0

> temp <- c(NA,NA,NA,1) # 4 items
>  length( temp[ !is.na(temp) & temp>1 ] )
[1] 0

You will find the last form in a lot of the internal code of well established functions. I happen to think the first version is more economical and easier to read, but the R Core seems to disagree. I have several times been advised on R help not to use which() around logical expressions. I remain unconvinced. It is correct that one should not combine it with negative indexing.

EDIT The reason not to use the construct "minus which" (negative indexing with which) is that in the case where all the items fail the which-test and where you would therefore expect all of them to be returned , it returns an unexpected empty vector:

 temp <- c(1,2,3,4,NA)
 temp[!temp > 5]
#[1]  1  2  3  4 NA             As expected
 temp[-which(temp > 5)]
#numeric(0)                 Not as expected
 temp[!temp > 5 & !is.na(temp)]
#[1] 1 2 3 4           A correct way to handle negation

I admit that the notion that NA's should select NA elements seems a bit odd, but it is rooted in the history of S and therefore R. There is a section in ?"[" about "NA's in indexing". The rationale is that each NA as an index should return an unknown result, i.e. another NA.


If you break down each command and look at the output, it's more enlightening:

> tmp = c(NA, NA, 1)
> tmp > 1
[1]    NA    NA FALSE
> tmp[tmp > 1]
[1] NA NA

So, when we next perform length(tmp[tmp > 1]), it's as if we're executing length(c(NA,NA)). It is fine to have a vector full of NAs - it has a fixed length (as if we'd created it via NA * vector(length = 2), which should be different from NA * vector(length = 3).


You can use 'sum':

> tmp <- c(NA, NA, NA, 3)
> sum(tmp > 1)
[1] NA
> sum(tmp > 1, na.rm=TRUE)
[1] 1

A bit of explanation: 'sum' expects numbers but 'tmp > 1' is logical. So it is automatically coerced to be numeric: TRUE => 1; FALSE => 0; NA => NA.

I don't think there is anything precisely like this in 'The R Inferno' but this is definitely the sort of question that it is aimed at. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜