Filling gaps in a numpy array

2023-02-22 08:43 问答作者：

I just wan开发者_运维问答t to interpolate, in the simplest possible terms, a 3D dataset. Linear interpolation, nearest neighbour, all that would suffice (this is to start off some algorithm, so no accurate estimate is required).

In new scipy versions, things like griddata would be useful, but currently I only have scipy 0.8. So I have a "cube" (data[:,:,:], (NixNjxNk)) array, and an array of flags (flags[:,:,:,], True or False) of the same size. I want to interpolate my data for the elements of data where the corresponding element of flag is False, using eg the nearest valid datapoint in data, or some linear combination of "close by" points.

There can be large gaps in the dataset in at least two dimensions. Other than coding a full-blown nearest neighbour algorithm using kdtrees or similar, I can't really find a generic, N-dimensional nearest-neighbour interpolator.

Using scipy.ndimage, your problem can be solved with nearest neighbor interpolation in 2 lines :

from scipy import ndimage as nd

indices = nd.distance_transform_edt(invalid_cell_mask, return_distances=False, return_indices=True)
data = data[tuple(ind)]

Now, in the form of a function:

import numpy as np
from scipy import ndimage as nd

def fill(data, invalid=None):
    """
    Replace the value of invalid 'data' cells (indicated by 'invalid') 
    by the value of the nearest valid data cell

    Input:
        data:    numpy array of any dimension
        invalid: a binary array of same shape as 'data'. 
                 data value are replaced where invalid is True
                 If None (default), use: invalid  = np.isnan(data)

    Output: 
        Return a filled array. 
    """    
    if invalid is None: invalid = np.isnan(data)

    ind = nd.distance_transform_edt(invalid, 
                                    return_distances=False, 
                                    return_indices=True)
    return data[tuple(ind)]

Exemple of use:

def test_fill(s,d):
     # s is size of one dimension, d is the number of dimension
    data = np.arange(s**d).reshape((s,)*d)
    seed = np.zeros(data.shape,dtype=bool)
    seed.flat[np.random.randint(0,seed.size,int(data.size/20**d))] = True

    return fill(data,-seed), seed

import matplotlib.pyplot as plt
data,seed  = test_fill(500,2)
data[nd.binary_dilation(seed,iterations=2)] = 0   # draw (dilated) seeds in black
plt.imshow(np.mod(data,42))                       # show cluster

result:

Filling gaps in a numpy array

You can set up a crystal-growth-style algorithm shifting a view alternately along each axis, replacing only data that is flagged with a False but has a True neighbor. This gives a "nearest-neighbor"-like result (but not in Euclidean or Manhattan distance -- I think it might be nearest-neighbor if you are counting pixels, counting all connecting pixels with common corners) This should be fairly efficient with NumPy as it iterates over only axis and convergence iterations, not small slices of the data.

Crude, fast and stable. I think that's what you were after:

import numpy as np
# -- setup --
shape = (10,10,10)
dim = len(shape)
data = np.random.random(shape)
flag = np.zeros(shape, dtype=bool)
t_ct = int(data.size/5)
flag.flat[np.random.randint(0, flag.size, t_ct)] = True
# True flags the data
# -- end setup --

slcs = [slice(None)]*dim

while np.any(~flag): # as long as there are any False's in flag
    for i in range(dim): # do each axis
        # make slices to shift view one element along the axis
        slcs1 = slcs[:]
        slcs2 = slcs[:]
        slcs1[i] = slice(0, -1)
        slcs2[i] = slice(1, None)

        # replace from the right
        repmask = np.logical_and(~flag[slcs1], flag[slcs2])
        data[slcs1][repmask] = data[slcs2][repmask]
        flag[slcs1][repmask] = True

        # replace from the left
        repmask = np.logical_and(~flag[slcs2], flag[slcs1])
        data[slcs2][repmask] = data[slcs1][repmask]
        flag[slcs2][repmask] = True

For good measure, here's a visualization (2D) of the zones seeded by the data originally flagged True.

Filling gaps in a numpy array

Some time ago I wrote this script for my PhD: https://github.com/Technariumas/Inpainting

An example: http://blog.technariumas.lt/post/117630308826/healing-holes-in-python-arrays

Slow, but does the work. Gaussian kernel is the best choice, just check size/sigma values.

You may try to tackle your problem like:

# main ideas described in very high level pseudo code
choose suitable base kernel shape and type (gaussian?)
while true
    loop over your array (moving average manner)
        adapt your base kernel to current sparsity pattern
        set current value based on adapted kernel
    break if converged

This actually can be implemented quite a straightforward manner (especially if performance is not a top concern).

Obviously this is just heuristics and you need to do some experiments with your actual data to find proper adaptation scheme. When seeing kernel adaptation as kernel reweighing, you may like to do it based on how the values have been propagated. For example your weights for original supports are 1 and they decay related on which iteration they emerged.

Also the determination of when this process has actually converged may be tricky one. Depending on the application it may be reasonable eventually to leave some 'gap regions' remain 'unfilled'.

Update: Here is a very simple implementation along the lines *) described above:

from numpy import any, asarray as asa, isnan, NaN, ones, seterr
from numpy.lib.stride_tricks import as_strided as ast
from scipy.stats import nanmean

def _a2t(a):
    """Array to tuple."""
    return tuple(a.tolist())

def _view(D, shape, strides):
    """View of flattened neighbourhood of D."""
    V= ast(D, shape= shape, strides= strides)
    return V.reshape(V.shape[:len(D.shape)]+ (-1,))

def filler(A, n_shape, n_iter= 49):
    """Fill in NaNs from mean calculated from neighbour."""
    # boundary conditions
    D= NaN* ones(_a2t(asa(A.shape)+ asa(n_shape)- 1), dtype= A.dtype)
    slc= tuple([slice(n/ 2, -(n/ 2)) for n in n_shape])
    D[slc]= A

    # neighbourhood
    shape= _a2t(asa(D.shape)- asa(n_shape)+ 1)+ n_shape
    strides= D.strides* 2

    # iterate until no NaNs, but not more than n iterations
    old= seterr(invalid= 'ignore')
    for k in xrange(n_iter):
        M= isnan(D[slc])
        if not any(M): break
        D[slc][M]= nanmean(_view(D, shape, strides), -1)[M]
    seterr(**old)
    A[:]= D[slc]

And a simple demonstration of the filler(.) on action, would be something like:

In []: x= ones((3, 6, 99))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])
In []: x= NaN* x
In []: x[1, 2, 3]= 1
In []: x.sum(-1)
Out[]:
array([[ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan]])
In []: filler(x, (3, 3, 5))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])

*) So here the nanmean(.) is just used to demonstrate the idea of the adaptation process. Based on this demonstration, it should be quite straightforward to implement a more complex adaptation and decaying weighing scheme. Also note that, no attention is paid to actual execution performance, but it still should be good (with reasonable input shapes).

Maybe what you are looking for is a machine learning algorithm, like a neural network or a support vector machine.

You may check this page, which has some links to SVM packages for python: http://web.media.mit.edu/~stefie10/technical/pythonml.html

继续阅读：interpolation numpy python scipy

Filling gaps in a numpy array

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？