Chi Square Test using Frequencies, Bins, CDF, Python
I am trying to write a chi square goodness-of-fit test for Beta distribution from scratch, without using any external functions. The code below reports '1' for a fit, even though kstest from scipy.stats returns a zero. Data is distributed normally, so my function should also return zero.
import numpy as np
from scipy.stats import chi2
from scipy.stats import beta
from scipy.stats import kstest
from scipy.stats import norm
preds = norm.rvs(5,2,size=200)
preds.sort()
bin_size = 30
bins = np.linspace(0,10,bin_size)
counts = np.digitize(preds, bins)
mean = 5
var = 2
sum = 0
for i in range(len(bins)-1):
p = beta.cdf(bins[i+1], mean, var) - beta.cdf(bins[i], mean, var)
freq = len(counts[counts==i]) / float(len(counts))
sum = sum + ((freq - p)开发者_Python百科**2)/p
dof = len(counts)-2
pval = 1 - chi2.cdf(sum, dof)
print pval
In the code, I create bins, measure frequencies based on the bins, calculate expected frequency using Beta distribution CDF, and sum it up resulting in the X^2 test statistic.
The kstest call is
print kstest(preds, 'beta', [mean, var])
What am I doing wrong here?
Thanks,
I don't think your answer to your own question is correct, and there are a series of problems in your codes.
Firstly, according to your implementation, the dof calculated using len(counts)-2
is the same thing as len(preds)-2
. So changing that doesn't make any difference.
Secondly, to do a Chi^2 test on the parameter fit, you need to construct a number of bins that are MECE, which means no overlapping between bins and they collectively span all possible values of X
. However, by setting up your bins using bins = np.linspace(0,10,bin_size)
, you forced the rightmost bin to stop at 10
. While the Gaussian distribution spans -inf to inf. So there is chance that the random numbers you generated shoot over 10
.
But that might be less of a problem in comparison with this one: the number of counts for each bin is conventionally required to be 5 at least. However, using your method to count the numbers falling into the bins (here you set to 30 bins) could and actually almost always have numbers below 5, and even 0. 0 counts in any bin leads to infinity in the subsequent sum
calculation, and that could give a rejection no matter the fit is good or bad. And I think that's why you get a 0 after changing the dof to be len(preds)-2
, you just happen to have at least one 0 in the bin counts.
Another problem is the calculation of Chi^2. I think you don't use frequencies, but actual counts in each bin:
p = beta.cdf(bins[i+1], mean, var) - beta.cdf(bins[i], mean, var)
p = p*200
freq = len(counts[counts==i])
sum = sum + ((freq - p)**2)/p
So both p
and freq
are the number of counts in each category, rather than relative frequencies. But I am not entirely sure about this.
Finally, the definition of dof is number of bins - number of parameters fit (here 2) -1.
So if you have 10 bins, dof = 10 - 2 - 1 = 7
. In your code this is `200 - 2 = 198'. A chi^2 distribution with such a big dof is extremely flattened, which means you need extremely large chi^2 value to reject the fit. That's the reason you get 1 using your code.
Problem was with the DOF definition:
dof = len(preds)-2
is the correct choice. Also, I had to reduce bin size to 15 in order to get consistent '0' result. It is known that Chi^2 tests are sensitive on bin size.
精彩评论