Filling gaps in a numpy array
I just wan开发者_运维问答t to interpolate, in the simplest possible terms, a 3D dataset. Linear interpolation, nearest neighbour, all that would suffice (this is to start off some algorithm, so no accurate estimate is required).
In new scipy versions, things like griddata would be useful, but currently I only have scipy 0.8. So I have a "cube" (data[:,:,:]
, (NixNjxNk)) array, and an array of flags (flags[:,:,:,]
, True
or False
) of the same size. I want to interpolate my data for the elements of data where the corresponding element of flag is False, using eg the nearest valid datapoint in data, or some linear combination of "close by" points.
There can be large gaps in the dataset in at least two dimensions. Other than coding a full-blown nearest neighbour algorithm using kdtrees or similar, I can't really find a generic, N-dimensional nearest-neighbour interpolator.
Using scipy.ndimage, your problem can be solved with nearest neighbor interpolation in 2 lines :
from scipy import ndimage as nd
indices = nd.distance_transform_edt(invalid_cell_mask, return_distances=False, return_indices=True)
data = data[tuple(ind)]
Now, in the form of a function:
import numpy as np
from scipy import ndimage as nd
def fill(data, invalid=None):
"""
Replace the value of invalid 'data' cells (indicated by 'invalid')
by the value of the nearest valid data cell
Input:
data: numpy array of any dimension
invalid: a binary array of same shape as 'data'.
data value are replaced where invalid is True
If None (default), use: invalid = np.isnan(data)
Output:
Return a filled array.
"""
if invalid is None: invalid = np.isnan(data)
ind = nd.distance_transform_edt(invalid,
return_distances=False,
return_indices=True)
return data[tuple(ind)]
Exemple of use:
def test_fill(s,d):
# s is size of one dimension, d is the number of dimension
data = np.arange(s**d).reshape((s,)*d)
seed = np.zeros(data.shape,dtype=bool)
seed.flat[np.random.randint(0,seed.size,int(data.size/20**d))] = True
return fill(data,-seed), seed
import matplotlib.pyplot as plt
data,seed = test_fill(500,2)
data[nd.binary_dilation(seed,iterations=2)] = 0 # draw (dilated) seeds in black
plt.imshow(np.mod(data,42)) # show cluster
result:
You can set up a crystal-growth-style algorithm shifting a view alternately along each axis, replacing only data that is flagged with a False
but has a True
neighbor. This gives a "nearest-neighbor"-like result (but not in Euclidean or Manhattan distance -- I think it might be nearest-neighbor if you are counting pixels, counting all connecting pixels with common corners) This should be fairly efficient with NumPy as it iterates over only axis and convergence iterations, not small slices of the data.
Crude, fast and stable. I think that's what you were after:
import numpy as np
# -- setup --
shape = (10,10,10)
dim = len(shape)
data = np.random.random(shape)
flag = np.zeros(shape, dtype=bool)
t_ct = int(data.size/5)
flag.flat[np.random.randint(0, flag.size, t_ct)] = True
# True flags the data
# -- end setup --
slcs = [slice(None)]*dim
while np.any(~flag): # as long as there are any False's in flag
for i in range(dim): # do each axis
# make slices to shift view one element along the axis
slcs1 = slcs[:]
slcs2 = slcs[:]
slcs1[i] = slice(0, -1)
slcs2[i] = slice(1, None)
# replace from the right
repmask = np.logical_and(~flag[slcs1], flag[slcs2])
data[slcs1][repmask] = data[slcs2][repmask]
flag[slcs1][repmask] = True
# replace from the left
repmask = np.logical_and(~flag[slcs2], flag[slcs1])
data[slcs2][repmask] = data[slcs1][repmask]
flag[slcs2][repmask] = True
For good measure, here's a visualization (2D) of the zones seeded by the data originally flagged True
.
Some time ago I wrote this script for my PhD: https://github.com/Technariumas/Inpainting
An example: http://blog.technariumas.lt/post/117630308826/healing-holes-in-python-arrays
Slow, but does the work. Gaussian kernel is the best choice, just check size/sigma values.
You may try to tackle your problem like:
# main ideas described in very high level pseudo code
choose suitable base kernel shape and type (gaussian?)
while true
loop over your array (moving average manner)
adapt your base kernel to current sparsity pattern
set current value based on adapted kernel
break if converged
This actually can be implemented quite a straightforward manner (especially if performance is not a top concern).
Obviously this is just heuristics and you need to do some experiments with your actual data to find proper adaptation scheme. When seeing kernel adaptation as kernel reweighing, you may like to do it based on how the values have been propagated. For example your weights for original supports are 1 and they decay related on which iteration they emerged.
Also the determination of when this process has actually converged may be tricky one. Depending on the application it may be reasonable eventually to leave some 'gap regions' remain 'unfilled'.
Update: Here is a very simple implementation along the lines *) described above:
from numpy import any, asarray as asa, isnan, NaN, ones, seterr
from numpy.lib.stride_tricks import as_strided as ast
from scipy.stats import nanmean
def _a2t(a):
"""Array to tuple."""
return tuple(a.tolist())
def _view(D, shape, strides):
"""View of flattened neighbourhood of D."""
V= ast(D, shape= shape, strides= strides)
return V.reshape(V.shape[:len(D.shape)]+ (-1,))
def filler(A, n_shape, n_iter= 49):
"""Fill in NaNs from mean calculated from neighbour."""
# boundary conditions
D= NaN* ones(_a2t(asa(A.shape)+ asa(n_shape)- 1), dtype= A.dtype)
slc= tuple([slice(n/ 2, -(n/ 2)) for n in n_shape])
D[slc]= A
# neighbourhood
shape= _a2t(asa(D.shape)- asa(n_shape)+ 1)+ n_shape
strides= D.strides* 2
# iterate until no NaNs, but not more than n iterations
old= seterr(invalid= 'ignore')
for k in xrange(n_iter):
M= isnan(D[slc])
if not any(M): break
D[slc][M]= nanmean(_view(D, shape, strides), -1)[M]
seterr(**old)
A[:]= D[slc]
And a simple demonstration of the filler(.)
on action, would be something like:
In []: x= ones((3, 6, 99))
In []: x.sum(-1)
Out[]:
array([[ 99., 99., 99., 99., 99., 99.],
[ 99., 99., 99., 99., 99., 99.],
[ 99., 99., 99., 99., 99., 99.]])
In []: x= NaN* x
In []: x[1, 2, 3]= 1
In []: x.sum(-1)
Out[]:
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
In []: filler(x, (3, 3, 5))
In []: x.sum(-1)
Out[]:
array([[ 99., 99., 99., 99., 99., 99.],
[ 99., 99., 99., 99., 99., 99.],
[ 99., 99., 99., 99., 99., 99.]])
*) So here the nanmean(.)
is just used to demonstrate the idea of the adaptation process. Based on this demonstration, it should be quite straightforward to implement a more complex adaptation and decaying weighing scheme. Also note that, no attention is paid to actual execution performance, but it still should be good (with reasonable input shapes).
Maybe what you are looking for is a machine learning algorithm, like a neural network or a support vector machine.
You may check this page, which has some links to SVM packages for python: http://web.media.mit.edu/~stefie10/technical/pythonml.html
精彩评论