Chi Square Test using Frequencies, Bins, CDF, Python

2023-01-21 23:35 问答作者：

I am trying to write a chi square goodness-of-fit test for Beta distribution from scratch, without using any external functions. The code below reports '1' for a fit, even though kstest from scipy.stats returns a zero. Data is distributed normally, so my function should also return zero.

import numpy as np
from scipy.stats import chi2
from scipy.stats import beta
from scipy.stats import kstest
from scipy.stats import norm

preds = norm.rvs(5,2,size=200)
preds.sort()

bin_size = 30
bins = np.linspace(0,10,bin_size)
counts = np.digitize(preds, bins)
mean = 5
var = 2

sum = 0
for i in range(len(bins)-1):
    p = beta.cdf(bins[i+1], mean, var) - beta.cdf(bins[i], mean, var)  
    freq = len(counts[counts==i]) / float(len(counts))    
    sum = sum + ((freq - p)开发者_Python百科**2)/p

dof = len(counts)-2
pval = 1 - chi2.cdf(sum, dof)
print pval

In the code, I create bins, measure frequencies based on the bins, calculate expected frequency using Beta distribution CDF, and sum it up resulting in the X^2 test statistic.

The kstest call is

print kstest(preds, 'beta', [mean, var])

What am I doing wrong here?

Thanks,

I don't think your answer to your own question is correct, and there are a series of problems in your codes.

Firstly, according to your implementation, the dof calculated using len(counts)-2 is the same thing as len(preds)-2. So changing that doesn't make any difference.

Secondly, to do a Chi^2 test on the parameter fit, you need to construct a number of bins that are MECE, which means no overlapping between bins and they collectively span all possible values of X. However, by setting up your bins using bins = np.linspace(0,10,bin_size), you forced the rightmost bin to stop at 10. While the Gaussian distribution spans -inf to inf. So there is chance that the random numbers you generated shoot over 10.

But that might be less of a problem in comparison with this one: the number of counts for each bin is conventionally required to be 5 at least. However, using your method to count the numbers falling into the bins (here you set to 30 bins) could and actually almost always have numbers below 5, and even 0. 0 counts in any bin leads to infinity in the subsequent sum calculation, and that could give a rejection no matter the fit is good or bad. And I think that's why you get a 0 after changing the dof to be len(preds)-2, you just happen to have at least one 0 in the bin counts.

Another problem is the calculation of Chi^2. I think you don't use frequencies, but actual counts in each bin:

p = beta.cdf(bins[i+1], mean, var) - beta.cdf(bins[i], mean, var)  
p = p*200
freq = len(counts[counts==i])    
sum = sum + ((freq - p)**2)/p

So both p and freq are the number of counts in each category, rather than relative frequencies. But I am not entirely sure about this.

Finally, the definition of dof is number of bins - number of parameters fit (here 2) -1. So if you have 10 bins, dof = 10 - 2 - 1 = 7. In your code this is `200 - 2 = 198'. A chi^2 distribution with such a big dof is extremely flattened, which means you need extremely large chi^2 value to reject the fit. That's the reason you get 1 using your code.

Problem was with the DOF definition:

dof = len(preds)-2

is the correct choice. Also, I had to reduce bin size to 15 in order to get consistent '0' result. It is known that Chi^2 tests are sensitive on bin size.

继续阅读：python statistics

Chi Square Test using Frequencies, Bins, CDF, Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？