Dealing with missing values for correlations calculation
I have huge matrix with a lot of missing values. I want to get the correlation between variables.
1. Is the solution
cor(na.omit(matrix))
better than below?
cor(matrix, use = "pairwise.complete.obs")
I already ha开发者_运维知识库ve selected only variables having more than 20% of missing values.
2. Which is the best method to make sense ?
I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.
I think the second option makes more sense,
You might consider using the rcorr function in the Hmisc package.
It is very fast, and only includes pairwise complete observations. The returned object contains a matrix
- of correlation scores
- with the number of observation used for each correlation value
- of a p-value for each correlation
This means that you can ignore correlation values based on a small number of observations (whatever that threshold is for you) or based on a the p-value.
library(Hmisc)
x<-matrix(nrow=10,ncol=10,data=runif(100))
x[x>0.5]<-NA
result<-rcorr(x)
result$r[result$n<5]<-0 # ignore less than five observations
result$r
For future readers Pairwise-complete correlation considered dangerous may be valuable, arguing that cor(matrix, use = "pairwise.complete.obs")
is considered dangerous and suggesting alternatives such as use = "complete.obs")
.
Try WGCNA package. R base function, cor
and some other packages like ppcor
, shows an error if you have NA in your data. You need to get rid of NAs or set up some options. The package WGCNA
handles the missing values issue plus provides some stats like pvalue for the calculated correlations.
library(WGCNA)
varX <- seq(from=1, to=10, length=10)
varY <- seq(from=20, to=50, length=10)
varZ <- rnorm(10)
varZ[c(1,5,7)] <- NA
mat <- cbind(varX, varY, varZ)
corAndPvalue(mat, method='spearman')
$cor
varX varY varZ
varX 1.0 1.0 0.5
varY 1.0 1.0 0.5
varZ 0.5 0.5 1.0
$p
varX varY varZ
varX 1.063504e-62 1.063504e-62 2.531700e-01
varY 1.063504e-62 1.063504e-62 2.531700e-01
varZ 2.531700e-01 2.531700e-01 1.411089e-39
$Z
varX varY varZ
varX 51.953682 51.953682 1.228286
varY 51.953682 51.953682 1.228286
varZ 1.228286 1.228286 41.072992
$t
varX varY varZ
varX 1.342177e+08 1.342177e+08 1.290994e+00
varY 1.342177e+08 1.342177e+08 1.290994e+00
varZ 1.290994e+00 1.290994e+00 1.061084e+08
$nObs
varX varY varZ
varX 10 10 7
varY 10 10 7
varZ 7 7 7
精彩评论