Efficient item binning algorithm (itertools/numpy)
I think this is a common combinatorics problem, but I can't seem to find a name for it开发者_开发技巧 or any material about it. I am doing this in Python and numpy, but if there is a fast matrix method for this, I can probably translate.
Basically, given n items, I need to generate all ways to put them in m bins. As an example, binning 4 items into 3 bins would give something like [(4, 0, 0), (3, 1, 0), (3, 0, 1), (2, 2, 0), (2, 1, 1), ...]
. This is a product with a fixed total.
Implementing this with itertools is straightforward.
import itertools
def fixed_total_product(bins, num_items):
""" Return iterator of all item binning possibilities. """
return itertools.ifilter(lambda combo: sum(combo) == num_items,
itertools.product(xrange(num_items + 1), repeat=bins))
Unfortunately, I think doing subsequent calculations in loops will be inefficient. Working with this as a 2D numpy array would be faster later on, but I can't figure out an efficient way to build an array with this. I could iterate over the ifilter result, building a list of the possibilities, and use this to construct the array, but that seems like a huge waste.
I'm guessing the best way to do this is to build everything "the numpy way", but I'm not sure how to do this. There is a speedy product implementation on stackoverflow: Using numpy to build an array of all combinations of two arrays. I'm guessing you can modify this only to output products with the right sum. The size of the array should be ((m-1) + n) choose n, since there are m-1 bin boundaries.
Any ideas? Benchmarks much appreciated, but not required.
Based on the recursion C(n, k) = C(n - 1, k) + C(n - 1, k - 1), memoized, using numpy operations.
import numpy as np
def binnings(n, k, cache={}):
if n == 0:
return np.zeros((1, k))
if k == 0:
return np.empty((0, 0))
args = (n, k)
if args in cache:
return cache[args]
a = binnings(n - 1, k, cache)
a1 = a + (np.arange(k) == 0)
b = binnings(n, k - 1, cache)
b1 = np.hstack((np.zeros((b.shape[0], 1)), b))
result = np.vstack((a1, b1))
cache[args] = result
return result
if __name__ == '__main__':
import timeit
print timeit.timeit('binnings(20, 5, {})', setup='from __main__ import binnings', number=1)
Time in seconds for (20, 5):
0.0129251480103
There probably is a faster way using a few different tricks in numpy. numpy.indices
is where you want to start. It's essentially the equivalent of itertools.product
, once you combine it with rollaxis
. Sven Marnach's answer in this question is an excellent example of this (There's a minor error in his last example, however, which is what you want to use. It should be numpy.indices((len(some_list) + 1), * some_length...
)
However, for something like this, it's likely to be more readable using itertools.
numpy.fromiter
is a bit faster than explicitly converting things to a list, especially if you give it a count of the number of items in the iterator. The main advantage is that is uses considerably less memory, which can be very helpful in the case of large iterators.
Here are some examples using both the numpy.indices
trick and various ways of converting the iterator to a numpy array:
import itertools
import numpy as np
import scipy.special
def fixed_total_product(bins, num_items):
return itertools.ifilter(lambda combo: sum(combo) == num_items,
itertools.product(xrange(num_items + 1), repeat=bins))
def fixed_total_product_fromiter(bins, num_items):
size = scipy.special.binom(bins - 1 + num_items, num_items)
combinations = fixed_total_product(bins, num_items)
indicies = (idx for row in combinations for idx in row)
arr = np.fromiter(indicies, count=bins * int(size), dtype=np.int)
return arr.reshape(-1, bins)
def fixed_total_product_fromlist(bins, num_items):
return np.array(list(fixed_total_product(bins, num_items)), dtype=np.int)
def fixed_total_product_numpy(bins, num_items):
arr = np.rollaxis(np.indices((num_items + 1,) * bins), 0, bins + 1)
arr = arr.reshape(-1, bins)
arr = np.arange(num_items + 1)[arr]
sums = arr.sum(axis=1)
return arr[sums == num_items]
m, n = 5, 20
if __name__ == '__main__':
import timeit
list_time = timeit.timeit('fixed_total_product_fromlist(m, n)',
setup='from __main__ import fixed_total_product_fromlist, m, n',
number=1)
iter_time = timeit.timeit('fixed_total_product_fromiter(m, n)',
setup='from __main__ import fixed_total_product_fromiter, m, n',
number=1)
numpy_time = timeit.timeit('fixed_total_product_numpy(m, n)',
setup='from __main__ import fixed_total_product_numpy, m, n',
number=1)
print 'All combinations of {0} items drawn from a set of {1} items...'.format(m,n)
print ' Converting to a list and then an array: {0} sec'.format(list_time)
print ' Using fromiter: {0} sec'.format(iter_time)
print ' Using numpy.indices: {0} sec'.format(numpy_time)
As for timing:
All combinations of 5 items drawn from a set of 20 items...
Converting to a list and then an array: 2.75901389122 sec
Using fromiter: 2.10619592667 sec
Using numpy.indices: 1.44955015182 sec
You'll notice that none of them are particularly fast.
You're using a brute-force algorithm (generate all possible combinations and then filter them), and I'm just copying it in the numpy-based example here.
There is probably a much more efficient way to do this! However, I'm not a CS or math person by any means, so I don't know if there's a well-known algorithm to do this without generating all possible combinations first...
Good luck, at any rate!
精彩评论