开发者

Python/Numpy: Convert list of bools to unsigned int

  1. What is the fastest (or most "Pythonic") way to convert

    x = [False, False, True, True]
    

    into 12? (If there is such a way.)

  2. What if x were instead a numpy.array of bools? Is there a special command for that?

开发者_高级运维

I have a large m-by-n array of booleans, where each n-element row represents a single low-dimensional hash of a high-dimensional feature vector. (In the example above, n = 4.) I would like to know the answer in order to compress my data as much as possible. Thank you.


Edit: Thank you for the responses! Using the following test code,

t = 0
for iter in range(500):
    B = scipy.signbit(scipy.randn(1000,20))
    for b in B:
        t0 = time.clock()
        # test code here
        t1 = time.clock()
        t += (t1-t0)
print t

...here were the runtimes on my Thinkpad laptop:

  • My answer: 4.26 sec
  • Sven Marnach 1: 7.88
  • Emil H: 8.51
  • Sven Marnach 2: 8.72
  • delnan: 10.14
  • liori: 53.49

Of course, I welcome any independent tests that may confirm or refute my data!


Edit: In my answer below, changing int(j) to simply j still works, but runs six times as slow! Then perhaps the other answers would become faster if the bool was casted using int. But I'm too lazy to test everything again.


Edit: liori posted results of independent tests here.


Taking various ideas from various other answers, here's another way to do it:

sum(1<<i for i, b in enumerate(x) if b)

It is quite fast in my tests - right up with the numpy method for large number of bits even though it overflows like crazy. I used liori's testing module for testing. Steve's method, with the change I suggested, is just barely faster. However, if a lot of these sorts of conversions need to be done at a time (and with not too many bits), I'm betting that numpy will be faster.


Most Pythonic might be this:

sum(2**i*b for i, b in enumerate(x))

It's hard to tell if it is also the fastest.

In numpy I would use

numpy.sum(2**numpy.arange(len(x))*x)

but this won't be faster for small arrays x, and it won't work for big arrays x since machine size integers are used instead of Pythons arbitrary precision ints.


reduce(lambda a,b:2*a+b, reversed(x))

You could get rid of reversed() if you had least significant bit at the end of array. This works with numpy.array too, and doesn't need enumerate(). From my tests seem to be faster too: no need to use exponentiation.


My initial attempt, just for reference:

def bool2int(x):
    y = 0
    for i,j in enumerate(x):
        if j: y += int(j)<<i
    return y


An elegant, pythonic, always-working way is this:

def powers(x):
    """yield powers of x, starting from x**0 forever"""
    power = 1
    while True:
        yield power
        power *= x

def bools_to_int(bools):
    # in Python 2, use itertools.izip!
    return sum(int(place) * place_weight for place_weight, place in 
               zip(powers(2), bools))

Note that you can get rid of powers (by enumerate and squaring in the comprehension, as other answers do) - but maybe it's clearer this way.


Something like this?

>>> x = [False, False, True, True]
>>> sum([int(y[1])*2**y[0] for y in enumerate(x)])
12

You can convert a numpy array to a regular list using a list() cast.

>>> a = numpy.array([1,2,3,4])
>>> a
array([1, 2, 3, 4])
>>> list(a)
[1, 2, 3, 4]


If you have a matrix, you probably want to do it like this:

#precompute powers of two
vals = 2.**np.arange(20)

B = ....
compressed = np.dot(B, vals) # matrix multiplication.

np.dot should be faster than any loop in Python. Much faster.


I was trying ipython %timeit and it seems that doing the following is faster:

y = 0
for i,j in enumerate(x):
    if j: y += 1<<i

In addition, if your boolean vector is a numpy.ndarray, converting it to python array x.tolist() and running the same seems to work faster in this case. It's all marginal, but consistent as well as, at these speeds, marginals add up well.


numpy has the packbits function for this. It also supports operations along axes:

In [3]: B = scipy.signbit(scipy.randn(1000,8)).astype("i1")

In [3]: B[0]
Out[3]: array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [4]: np.packbits(B[0])
Out[4]: array([68], dtype=uint8)

In [5]: %timeit np.packbits(B, axis=1)
10000 loops, best of 3: 37 µs per loop

it works for int8 sizes for larger sizes you have to shift and or

In [8]: x # multiple of 8
Out[8]: array([1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1], dtype=int8)

In [9]: r = np.packbits(x).astype(np.int32); r
Out[9]: array([171, 129], dtype=uint8)

In [10]: r[0] << 8 | r[1] 
Out[10]: 33237

In [11]: sum(1<<i for i, b in enumerate(x[::-1]) if b)
Out[11]: 33237

if x is no multiple of 8 you have to pad in zeros


If you're willing to add another extension to the mix, I added pack() and unpack() to the development branch of gmpy. My tests show it may be 2x or 3x faster.

>>> import gmpy2
>>> gmpy2.pack([0,0,1,1],1)
mpz(12)
>>> gmpy2.unpack(12,1)
[mpz(0), mpz(0), mpz(1), mpz(1)]

Disclaimer: The development version is called gmpy2 and can co-exist with the stable version. It is still in alpha phase but will hopefully become beta in a few weeks. You need to have both GMP and MPFR libraries installed. The source is available at http://code.google.com/p/gmpy/source/checkout

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜