Consecutive, Overlapping Subsets of Array (NumPy, Python)

2022-12-24 16:13 问答作者：

I have a NumPy array [1,2,3,4,5,6,7,8,9,开发者_如何学Python10,11,12,13,14] and want to have an array structured like [[1,2,3,4], [2,3,4,5], [3,4,5,6], ..., [11,12,13,14]].

Sure this is possible by looping over the large array and adding arrays of length four to the new array, but I'm curious if there is some secret 'magic' Python method doing just this :)

You should use stride_tricks. When I first saw this, the word 'magic' did spring to mind. It's simple and is by far the fastest method.

>>> as_strided = numpy.lib.stride_tricks.as_strided
>>> a = numpy.arange(1,15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> b = as_strided(a, (11,4), a.strides*2)
>>> b
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

Be aware that the values in array b are those in a, just viewed differently. Do a .copy() on b if you plan to modify it.

I saw this at a SciPy conference. Here are the slides for more explanation.

The fastest way seems to be to preallocate the array, given as option 7 right at the bottom of this answer.

>>> import numpy as np
>>> A=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14])
>>> A
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> np.array(zip(A,A[1:],A[2:],A[3:]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])
>>>

You can easily adapt this to do it for variable chunk size.

>>> n=5
>>> np.array(zip(*(A[i:] for i in range(n))))
array([[ 1,  2,  3,  4,  5],
       [ 2,  3,  4,  5,  6],
       [ 3,  4,  5,  6,  7],
       [ 4,  5,  6,  7,  8],
       [ 5,  6,  7,  8,  9],
       [ 6,  7,  8,  9, 10],
       [ 7,  8,  9, 10, 11],
       [ 8,  9, 10, 11, 12],
       [ 9, 10, 11, 12, 13],
       [10, 11, 12, 13, 14]])

You may wish to compare performance between this and using itertools.islice.

>>> from itertools import islice
>>> n=4
>>> np.array(zip(*[islice(A,i,None) for i in range(n)]))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

My timing results:

1. timeit np.array(zip(A,A[1:],A[2:],A[3:]))
10000 loops, best of 3: 92.9 us per loop

2. timeit np.array(zip(*(A[i:] for i in range(4))))
10000 loops, best of 3: 101 us per loop

3. timeit np.array(zip(*[islice(A,i,None) for i in range(4)]))
10000 loops, best of 3: 101 us per loop

4. timeit numpy.array([ A[i:i+4] for i in range(len(A)-3) ])
10000 loops, best of 3: 37.8 us per loop

5. timeit numpy.array(list(chunks(A, 4)))
10000 loops, best of 3: 43.2 us per loop

6. timeit numpy.array(byN(A, 4))
10000 loops, best of 3: 100 us per loop

# Does preallocation of the array help? (11 is from len(A)+1-4)
7. timeit B=np.zeros(shape=(11, 4),dtype=np.int32)
100000 loops, best of 3: 2.19 us per loop
   timeit for i in range(4):B[:,i]=A[i:11+i]
10000 loops, best of 3: 20.9 us per loop
total 23.1us per loop

As len(A) increases (20000) 4 and 5 converge to be equivalent speed (44 ms). 1,2,3 and 6 all remain about 3 times slower (135 ms). 7 is much faster (1.36 ms).

Quick&dirty solution:

>>> a = numpy.arange(1,15)
>>> numpy.array([ a[i:i+4] for i in range(len(a)-3) ])
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

Using itertools, and assuming Python 2.6:

import itertools

def byN(iterable, N):
    itrs = itertools.tee(iter(iterable), N)
    for n in range(N):
        for i in range(n):
            next(itrs[n], None)
    return zip(*itrs)

aby4 = numpy.array(byN(thearray, 4))

Broadcast!

from numpy import ogrid
def stretch(N=5,M=15):
    x, y = ogrid[0:M,0:N]
    return x+y+1

Note that ogrid does give things like:

>> ogrid[0:5,0:5]
>> 
[array([[0],
       [1],
       [2],
       [3],
       [4]]),
 array([[0, 1, 2, 3, 4]])]

Let's compare with another solution given here:

def zipping(N=5,M=15):
    A = numpy.arange(1, M+1)
    return numpy.array(zip(*(A[i:] for i in range(N))))

comparison (python 2.6, 32 bit, 1Go RAM) gives

>>> %timeit stretch(5,15)
10000 loops, best of 3: 61.2 us per loop

>>> %timeit zipping(5,15)
10000 loops, best of 3: 72.5 us per loop

>>> %timeit stretch(5,1e3)
10000 loops, best of 3: 128 us per loop

>>> %timeit zipping(5,1e3)
100 loops, best of 3: 4.25 ms per loop

The 40-fold speed-up is kind of consistant to scaling.

I know of no Python stdlib function that quite does that. It's easy enough to do. Here is a generator that basically does it:

def chunks(sequence, length):
    for index in xrange(0, len(sequence) - length + 1):
        yield sequence[index:index + length]

You could use it like so

>>> import numpy
>>> a = numpy.arange(1, 15)
>>> a
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
>>> numpy.array(list(chunks(a, 4)))
array([[ 1,  2,  3,  4],
       [ 2,  3,  4,  5],
       [ 3,  4,  5,  6],
       [ 4,  5,  6,  7],
       [ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10],
       [ 8,  9, 10, 11],
       [ 9, 10, 11, 12],
       [10, 11, 12, 13],
       [11, 12, 13, 14]])

The only weird thing about this code is that I called list on the result of chunks(a, 4). This is since numpy.array does not accept an arbitrary iterable, such as the generator chunks returns. If you just want to iterate over these chunks, you don't need to bother. If you really need to put the result into an array you can do it this way or some more efficient ways.

The efficient NumPy way to do this is given here, which is somewhat too long to reproduce here. It boils down to using some stride tricks, and is much faster than itertools for largish window sizes. For example, using a method essentially the same as that of of Alex Martelli's:

In [16]: def windowed(sequence, length):
    seqs = tee(sequence, length)
    [ seq.next() for i, seq in enumerate(seqs) for j in xrange(i) ]
    return zip(*seqs)

We get:

In [19]: data = numpy.random.randint(0, 2, 1000000)

In [20]: %timeit windowed(data, 2)
100000 loops, best of 3: 6.62 us per loop
In [21]: %timeit windowed(data, 10)
10000 loops, best of 3: 29.3 us per loop
In [22]: %timeit windowed(data, 100)
1000 loops, best of 3: 1.41 ms per loop
In [23]: %timeit segment_axis(data, 2, 1)
10000 loops, best of 3: 30.1 us per loop
In [24]: %timeit segment_axis(data, 10, 9)
10000 loops, best of 3: 30.2 us per loop
In [25]: %timeit segment_axis(data, 100, 99)
10000 loops, best of 3: 30.5 us per loop

继续阅读：numpy python scipy

Consecutive, Overlapping Subsets of Array (NumPy, Python)

My timing results:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

My timing results:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？