开发者

Dealing with missing values for correlations calculation

I have huge matrix with a lot of missing values. I want to get the correlation between variables.

1. Is the solution

cor(na.omit(matrix))

better than below?

cor(matrix, use = "pairwise.complete.obs")

I already ha开发者_运维知识库ve selected only variables having more than 20% of missing values.

2. Which is the best method to make sense ?


I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.


I think the second option makes more sense,

You might consider using the rcorr function in the Hmisc package.

It is very fast, and only includes pairwise complete observations. The returned object contains a matrix

  1. of correlation scores
  2. with the number of observation used for each correlation value
  3. of a p-value for each correlation

This means that you can ignore correlation values based on a small number of observations (whatever that threshold is for you) or based on a the p-value.

library(Hmisc)
x<-matrix(nrow=10,ncol=10,data=runif(100))
x[x>0.5]<-NA
result<-rcorr(x)
result$r[result$n<5]<-0 # ignore less than five observations
result$r


For future readers Pairwise-complete correlation considered dangerous may be valuable, arguing that cor(matrix, use = "pairwise.complete.obs") is considered dangerous and suggesting alternatives such as use = "complete.obs").


Try WGCNA package. R base function, cor and some other packages like ppcor, shows an error if you have NA in your data. You need to get rid of NAs or set up some options. The package WGCNA handles the missing values issue plus provides some stats like pvalue for the calculated correlations.

library(WGCNA)
varX <- seq(from=1, to=10, length=10)
varY <- seq(from=20, to=50, length=10)
varZ <- rnorm(10)

varZ[c(1,5,7)] <- NA

mat <- cbind(varX, varY, varZ)

corAndPvalue(mat, method='spearman')
$cor
     varX varY varZ
varX  1.0  1.0  0.5
varY  1.0  1.0  0.5
varZ  0.5  0.5  1.0

$p
             varX         varY         varZ
varX 1.063504e-62 1.063504e-62 2.531700e-01
varY 1.063504e-62 1.063504e-62 2.531700e-01
varZ 2.531700e-01 2.531700e-01 1.411089e-39

$Z
          varX      varY      varZ
varX 51.953682 51.953682  1.228286
varY 51.953682 51.953682  1.228286
varZ  1.228286  1.228286 41.072992

$t
             varX         varY         varZ
varX 1.342177e+08 1.342177e+08 1.290994e+00
varY 1.342177e+08 1.342177e+08 1.290994e+00
varZ 1.290994e+00 1.290994e+00 1.061084e+08

$nObs
     varX varY varZ
varX   10   10    7
varY   10   10    7
varZ    7    7    7
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜