开发者

Numpy.mean, amin, amax, std huge returns

I am struggling working with large numpy arrays. Here is the scenario. I am working with 300MB - 950MB images and using GDAL to read them as Numpy arrays. Reading in the array uses exactly as much memory as one would expect, ie. 250MB for a 250MB image, etc...

My problem occurs when I use numpy to get the mean, min, max, or standard deviation. In main() I open the image and read the array (type ndarray). I then call the following function, to开发者_如何学编程 get the standard deviation, on a 2D array:

def get_array_std(input_array):
    array_standard_deviation = numpy.std(input_array, copy=False)
    return array_standard_deviation

Here I am constantly having memory errors (on a 6GB machine). From the documentation it looks like numpy is returning an ndarray with the same shape and dtype as my input, thereby doubling the in memory size.

Using:

print type(array_standard_deviation)

Returns:

numpy.float64

Additionally, using:

print array_standard_deviation

Returns a float std as one would expect. Is numpy reading the array in again to perform this calculation? Would I be better off iterating through the array and manually performing the calculation(s)? How about working with a flattened array?

I have tried placing each statistic call (numpy.amin(), numpy.amax(), numpy.std(), numpy.mean()) into their own function so that the large array would go out of scope, but no luck there. I have also tried casting the return to another type, but no joy.


Numpy does a "naive" reduce operation for std. It is quite memory inefficient. Look here for a better implementation: http://luispedro.org/software/ncreduce


Don't know if this is helpful, but does using the array method resolve the issue? i.e.

input_array.std()

instead of

numpy.std(input_array)

The problem you describe doesn't make a whole lot of sense to me; I work with large arrays often but don't encounter errors with simple tasks like these. Is there anything else you're doing that might end up passing copies of the arrays instead of references?


Are you sure this is a problem with all of the statistics functions you're trying, or is it just np.std?

I've tried the following method to reproduce this:

  1. Start ipython -cs 0, import numpy as nd
  2. q = rand(5600,16000), giving me a nice large test array.
  3. Watch memory usage externally during np.mean(q), np.amin(q), np.amax(q), np.std(q)

Of these, np.std is significantly slower: most functions take 0.2 seconds on my computer, whereas std takes 2.3. While I don't have the exact memory leak you have, my memory usage stays mostly constant while running everything except std, but doubles when I run std, and then goes back down to the initial amount.

I've written the following modified std, which operates on chunks of a given number of elements (I'm using 100000):

def chunked_std( A, chunksize ):
    Aflat = A.ravel()
    Amean = A.mean()
    Alen = len(Aflat)

    i = np.concatenate( ( np.arange(0,Alen,chunksize), [Alen] ) )

    return np.sqrt(np.sum(np.sum(abs(Aflat[x:y]-Amean)**2) for (x,y) in zip(i[:-1],i[1:]))/Alen)

This seems to significantly reduce memory usage, while also being about twice as fast as normal np.std for me. There are probably significantly more elegant ways of writing such a function, but this seems to work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜