开发者

Binomial test in Python for very large numbers

I need to do a binomial test in Pytho开发者_开发知识库n that allows calculation for 'n' numbers of the order of 10000.

I have implemented a quick binomial_test function using scipy.misc.comb, however, it is pretty much limited around n = 1000, I guess because it reaches the biggest representable number while computing factorials or the combinatorial itself. Here is my function:

from scipy.misc import comb
def binomial_test(n, k):
    """Calculate binomial probability
    """
    p = comb(n, k) * 0.5**k * 0.5**(n-k)
    return p

How could I use a native python (or numpy, scipy...) function in order to calculate that binomial probability? If possible, I need scipy 0.7.2 compatible code.

Many thanks!


Edited to add this comment: please note that, as Daniel Stutzbach mentions, the "binomial test" is probably not what the original poster was asking for (though he did use this expression). He seems to be asking for the probability density function of a binomial distribution, which is not what I'm suggesting below.

Have you tried scipy.stats.binom_test?

rbp@apfelstrudel ~$ python
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from scipy import stats
>>> print stats.binom_test.__doc__

    Perform a test that the probability of success is p.

    This is an exact, two-sided test of the null hypothesis
    that the probability of success in a Bernoulli experiment
    is `p`.

    Parameters
    ----------
    x : integer or array_like
        the number of successes, or if x has length 2, it is the
        number of successes and the number of failures.
    n : integer
        the number of trials.  This is ignored if x gives both the
        number of successes and failures
    p : float, optional
        The hypothesized probability of success.  0 <= p <= 1. The
        default value is p = 0.5

    Returns
    -------
    p-value : float
        The p-value of the hypothesis test

    References
    ----------
    .. [1] http://en.wikipedia.org/wiki/Binomial_test


>>> stats.binom_test(500, 10000)
4.9406564584124654e-324

Small edit to add documentation link: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html#scipy.stats.binom_test

BTW: works on scipy 0.7.2, as well as on current 0.8 dev.


Any solution that looks like comb(n, k) * 0.5**k * 0.5**(n-k) isn't going to work for large n. On most (all?) platforms, the smallest value a Python float can store is around 2**-1022. For large n-k or large k, the right-hand side will get rounded to 0. Likewise, comb(n, k) can grow so large that it will not fit in a float.

A more robust approach is to compute the probability density function as the difference between two consecutive points in the cumulative distribution function, which can be computed using the regularized incomplete beta function (look in SciPy's "special functions" package). Mathematically:

pdf(p, n, k) = cdf(p, n, k) - cdf(p, n, k-1)

Another option is to use the Normal approximation, which is quite accurate for large n. If speed is a concern, this is probably the way to go:

from math import *

def normal_pdf(x, m, v):
    return 1.0/sqrt(2*pi*v) * exp(-(x-m)**2/(2*v))

def binomial_pdf(p, n, k):
    if n < 100:
        return comb(n, k) * p**k * p**(n-k)  # Fall back to your current method
    return normal_pdf(k, n*p, n*p*(1.0-p))

I haven't tested the code, but that should give you the general idea.


GMPY also supports extended precision floating point calculations. For example:

>>> from gmpy import *
>>>
>>> def f(n,k,p,prec=256):
...     return mpf(comb(n,k),prec) * mpf(p,prec)**k * mpf(1-p,prec)**(n-k)
...
>>> print(f(1000,500,0.5))
0.0252250181783608019068416887621024545529410193921696384762532089115753731615931
>>>

I specified a floating point precision of 256 bits. By the way, source forge version is way out of date. The current version is maintained at code.google.com and supports Python 3.x. (Disclaimer: I'm the current maintainer of gmpy.)

casevh


I would look into the GNU Multi-Precision package (gmpy), which allows you to perform arbitrary precision calculations: you could probably do:

comb(n, k, exact=1)/2**k/2**(n-k)

but with the long integers of gmpy.

In fact, if you use exact integer computations, you can easily reach n=10000 for the combinations part; for this, you must use:

comb(n, k, exact=1)

instead of the floating point approximation comb(n, k), which overflows.

However, as the Original Poster noted, the returned (long) integer may be too long to be multiplied by a float!

Furthermore, one quickly runs into another problem: 0.5**1000=9.3…e-302 is already very close to the float underflow…

In summary: if you really need precise results for all k for n~10,000, you need to use a different approach than the formula from the original post, which suffers from the limitations of double precision floating point arithmetics. Using gmpy as indicated above could be a solution (not tested!).


Not specifically a Python solution, but if you can deal with small fractional errors, you might try using Stirling's approximation for n!:

comb(n, k) = n!/(k! * (n-k)!), where n! is approximately sqrt(2*Pin)(n/e)^n for large n.

For n>1000 the fractional errors should be very small.

For the probability calculation with large n, use logarithms for intermediate results:

log p = log(comb(n, k)) - n * log(2)

p = exp(log(p))


#  This imports the array function form numpy

from numpy import array

    # the following defines the factorial function to be used in the binomial commands/
# n+1 is used in the range to include the nth term

def factorial (n):
    f=1
    for x in range(1,n+1):
        f=f*(x)
    return f

# The follwong calculates the binomial coefficients for given values of n & k 
def binomial (n,k):
    b=1
    b=(factorial(n)/(factorial(k)*factorial(n-k)))
    return int(b)

# the following lines define the pascal triangle , and print it out for 20 rows./
# in order to include nth term, the n +1 term needs to be in the range. The commands/
# append the next binomial coeficiant to a raw first and then append rows to the triangle/
# and prints a 20 row size pascal triangle
def pascal(T):
    triangle=[]
    for n in range(T):
        r=[]
        for k in range(n+1):
            r.append(binomial(n,k))
        triangle.append(r)
    return triangle

for r in pascal(20):
    print((r))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜