Pythonic way to create a numpy array from a list of numpy arrays

2022-12-16 21:57 问答作者：

I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. 开发者_开发百科I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list.

The mock up is below:

>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0.,  0.]), array([ 1.,  1.]), array([ 2.,  2.]), array([ 3.,  3.]), array([ 4.,  4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

My question is the following:

Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? Is there an "expandable" matrix data structure available in a well tested module?

A typical size of my 2d matrix would be between 100x10 and 5000x10 floats

EDIT: In this example i'm using map, but in my actual application I have a for loop

Convenient way, using numpy.concatenate. I believe it's also faster, than @unutbu's answer:

In [32]: import numpy as np 

In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))

In [34]: list_of_arrays
Out[34]: 
[array([ 0.,  0.]),
 array([ 1.,  1.]),
 array([ 2.,  2.]),
 array([ 3.,  3.]),
 array([ 4.,  4.])]

In [37]: shape = list(list_of_arrays[0].shape)

In [38]: shape
Out[38]: [2]

In [39]: shape[:0] = [len(list_of_arrays)]

In [40]: shape
Out[40]: [5, 2]

In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)

In [42]: arr
Out[42]: 
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]])

Suppose you know that the final array arr will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize to cut it down to the discovered size after exiting the loop.

The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.

Also, arr.resize de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array.

This shows numpy_all_the_way is faster:

% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop

% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop

This shows numpy_all_the_way uses less memory:

% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348

test.py:

import numpy as np
import os


def memory_usage():
    pid = os.getpid()
    return next(line for line in open('/proc/%s/status' % pid).read().splitlines()
                if line.startswith('VmSize')).split()[-2]

N, M = 5000, 10


def python_lists_to_array(k):
    list_of_arrays = list(map(lambda x: x * np.ones(M), range(k)))
    arr = np.array(list_of_arrays)
    return arr


def numpy_all_the_way(k):
    arr = np.empty((N, M))
    for x in range(k):
        arr[x] = x * np.ones(M)
    arr.resize((k, M))
    return arr

if __name__ == '__main__':
    print('Initial memory usage: %s' % memory_usage())
    arr = python_lists_to_array(5000)
    print('After python_lists_to_array: %s' % memory_usage())
    arr = numpy_all_the_way(5000)
    print('After numpy_all_the_way: %s' % memory_usage())

Even simpler than @Gill Bates' answer, here is an one line code:

np.stack(list_of_arrays, axis=0)

What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides member of PyArrayObject, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.

One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.

When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row)) every time you get new data. vstack() will copy all the data for every "row".

There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.

I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.

I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.

initial_guess = 1000

def my_numpy_all_the_way(k):
    arr=np.empty((initial_guess,M))
    for x,row in enumerate(make_test_data(k)):
        try:
            arr[x]=row
        except IndexError:
            arr.resize((arr.shape[0]*2, arr.shape[1]))
            arr[x]=row
    arr.resize((k,M))
    return arr

Even simpler @fnjn answer

np.vstack(list_of_arrays)

继续阅读：arrays numpy performance python scipy

Pythonic way to create a numpy array from a list of numpy arrays

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？