Encountered invalid value when I use pearsonr
Maybe I made a mistake. If so, I am sorry to ask this.
I want to calculate Pearson's correlation coefficent by using scipy's pearsonr
function.
from scipy.stats.stats import pearsonr
X = [4, 4, 4, 4, 4, 4]
Y = [4, 5, 5, 4, 4, 4]
pearsonr(X, Y)
I get an error below
RuntimeWarning: invalid value encountered in double_scalars ###
The reason why I get an error is E[X] = 4 (Excepted Value of X is 4)
I look at the code of pearsonr function in scpy.stats.stats.py. Some part of the pearsonr function is as follows.
mx = x.mean() # which is 4
my = y.mean() # not necessary
xm, ym = x-mx, y-my # xm = [0 0 0 0 0 0]
r_num = n*(np.add.reduce(xm*ym)) #r_num = 0, because xm*ym 1x6 Zero Vector.
r_den = n*np.sqrt(ss(xm)*ss(ym)) #r_den = 0
r = (r_num / r_den) # Invalid value encountered in double_scalars
At the end, pearsonr
returns (nan, 1.0)
Should pearsonr
return (0, 1.0)
?
I think if a vector has same value for every row/column, covariance should be zero. Thus Pearson's Correleation Coefficient should also be zero by the definition of PCC.
Pearson's correlation coefficient between two variables is defined as the covari开发者_StackOverflow社区ance of the two variables divided by the product of their standard deviations.
Is it bug or where do I make a mistake?
Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations.
So it's the covariance over
- the standard deviation of
[4, 5, 5, 4, 4, 4]
times - the standard deviation of
[4, 4, 4, 4, 4, 4]
.
The standard deviation of [4, 4, 4, 4, 4, 4]
is zero.
So it's the covariance over
- the standard deviation of
[4, 5, 5, 4, 4, 4]
times - zero.
So it's the covariance over
- zero.
Anything divided by zero is nan
. The value of the covariance is irrelevant.
精彩评论