Matching two grids for data analysis, is there a good algorithm for my problem?

2023-03-12 13:01 问答作者：

I'd like to compare two differently spaced datasets in python. I always want to find the closest (nearest neighbour) match and regrid the data, see this example:

Dataset A:

ALTITUDE[m]   VALUE
1.            a1
2.            a2
3.            a3
4.            a4

Dataset B:

ALTITUDE[m]   VALUE
0.7           b1
0.9           b2
1.7           b3
2.            b4
2.4           b5
2.9           b6
3.1           b7
3.2           b8
3.9           b9
4.1           b10

ai and bi contain double numbers, but also nan fields.

I'd like to transform dataset B to the altitude grid of dataset A, but since dataset A contains less altitude levels than dataset B, I'd like to average them.

ALTITUDE[m]   VALUE
1.            median(b1,b2)
2.            开发者_开发知识库median(b3,b4,b5)
3.            median(b6,b7,b8)
4.            median(b9,b10)

i.e. the closest altitude levels have been found and averaged over.

Conversely, if I want to match dataset A to the grid of dataset B, dataset A should look like this (nearest neighbour):

ALTITUDE[m]   VALUE
0.7           a1
0.9           a1
1.7           a2
2.            a2
2.4           a2
2.9           a3
3.1           a3
3.2           a3
3.9           a4
4.1           a4

Maybe this even has a name (I imagine it being a common problem), but I don't know it and thus cannot search for it. I believe there is an efficient way of doing this, apart from the obvious solution coding it myself (but I'm afraid it won't be efficient and I'd introduce many bugs).

Preferably using numpy.

EDIT: Thanks for your input to all four contributors. I learned a bit and I apologize for not asking very clearly. I was myself in the progress of understanding the problem. Your answers pointed me towards the usage of interp1d and this answer allowed me to abuse it for me. I will post the result shortly. I can accept only one answer, but anyone would do.

Two assumptions: 1: You are not looking for the nearest neighbour, but for all altitudes within some range. So, let's say for a1 you want all bn that are within 0.5 of a1 (giving you b1 and b2 as per your example). I would define the 'nearest neighbour' as something different.

2: You don't count nan in your medians (numpy counts them as infinity as per some IEEE convention, but this seems odd to me). As per your suggestion we thus use nanmedian from scipy.stats.

I would do the following:

from numpy import *
from pylab import *

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

range = .5

B_Agrid = [nanmedian(B_Val[abs(B_Alt - k)<range]).item() for k in A_Alt]
A_Bgrid = [nanmedian(A_Val[abs(A_Alt - k)<range]).item() for k in B_Alt]

We find all indices where the distance of B_Alt to k in A_Alt is less than a specified range. Then we take the median of those B_Val. The same works for A_Bgrid with the results as requested.

==Edit==

Different assumption as to your nearest neighbours: Let's take the nearest neighbour to be the entry (or entries in case of a tie) with the smallest absolute altitude difference while not having nan as a value. N.B. these results do not match your example as b1 would not be the nearest neighbour of a1 on account of b2 being closer.

Under this assumption, the following code should work:

from numpy import *
from pylab import *
from scipy.stats import nanmedian

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

def ReGridMedian(AltIn, ValIn, AltOut):
    part = isfinite(ValIn)
    q = [abs(AltIn[part]-k) for k in AltOut]
    q = [nonzero(abs(k - min(k))<3*finfo(k.dtype).eps) for k in q]
    q = [ValIn[part][k] for k in q]
    return [median(k) for k in q]

B_Agrid = ReGridMedian(B_Alt, B_Val, A_Alt)    
A_Bgrid = ReGridMedian(A_Alt, A_Val, B_Alt)

I hacked together something that checks if two values are identical within machine precision, but I assume there's a better way of doing that. In any case, we first filter all values that are not nan, then find the closest match, then check for duplicate minima, then get the median of those values.

====

Does this cover your question, or are my assumptions incorrect?

~~Have a look at numpy.interp:~~

~~http://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html~~

(EDIT: numpy.interp only provides linear interpolation which, evidently, is not what the OP is looking for. Instead use the scipy methods like interp1d using kind='nearest')

http://docs.scipy.org/doc/scipy/reference/interpolate.html

What it sounds like you want to do is use the altitude points of one data set to interpolate the values of the other. This can be done pretty easily with either the numpy method or one of the scipy interpolation methods.

Here's one way:

import numpy as np

def regrid_op(x, y, xi, op=np.median):
    x, y, xi = np.atleast_1d(x, y, xi)
    if (x.ndim, y.ndim, xi.ndim) != (1, 1, 1):
        raise ValueError("works only for 1D data")

    yi = np.zeros(xi.shape, dtype=y.dtype)
    yi.fill(np.nan)

    # sort data
    j = np.argsort(x)
    x = x[j]
    y = y[j]

    # group items by nearest neighbour
    x0s = np.r_[xi, np.inf]
    xc = .5*(x0s[:-1] + x0s[1:])

    j0 = 0
    for i, j1 in enumerate(np.searchsorted(x, xc)):
        print "x =", xi[i], ", y =", y[j0:j1] # print some debug info
        yi[i] = op(y[j0:j1])
        j0 = j1

    return yi

xi = np.array([1, 2, 3, 4])
x = np.array([0.7, 0.9, 1.7, 2.0, 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
y = np.array([1,   2,   3,   4,   5,   6,   7,   8,   9,   10.])

print regrid_op(x, y, xi)

I don't see a way to vectorize the loop over items in the xi array, so this should be efficient provided the number of points in grid A is not too large.

EDIT: This also assumes that the points in xi are sorted.

This is not exactly the answer you were looking for, but this is my 50c answer...

A = {1:'a1',2:'a2',3:'a3',4:'a4'}
B = {0.7:'b1',0.9:'b2',1.7:'b3',2:'b4', 2.4:'b5'}

C = {} # result

# find altitude in A that is the closest to altitude in B
def findAltitude( altB,A):
    toto = [ ((alt-altB)**2,alt) for alt in A.keys() ]
    toto.sort()
    return toto[0][1]

#iter on each altitude of B
for altB,valueB in B.iteritems():
    altC = findAltitude( altB,A)
    if altC in C:
        C[altC].append(valueB)
    else:
        C[altC] = [valueB,]

# then do the median operation
#for altC,valueC in C.iteritems():
#   C[altC] = map( median, valueC ) # where median is your median function

print C

It is NOT the best solution at all (specially if you have a lot of values), but only the fastest to write...

In fact, it depends how your datas are stored. Dictionnary is not the best choice.

It is more interesting/clever to use the fact that your altitudes are sorted. You should provide more details on how your datas are stored (array with numpy ?)

==== Edit ====

I still do not know how your datas are, but let's try something more "clever", based on the fact that your altitudes are sorted.

from numpy import *
from pylab import *
from scipy.stats import nanmedian

# add val into C at the end of C or in the last place (depending if alt already exists in C or not)
def addto(C,val,alt):
    if C and C[-1][0]==alt:
        C[-1][1].append(valB)
    else:
        C.append( (alt,[valB,] ))



# values
A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])
B_Val = array([.3, NaN, .8, .6, .7, .4, .3, NaN, .99, 1.3])

#intermediate list of tuple (altitude, list_of_values)
C= []

#iterator on A
Aa = iter(A_Alt)
ainf = Aa.next()
asup = Aa.next()  # two first values of A_Alt

#iterator on B
Ba = iter(B_Alt)
Bv = iter(B_Val)

# regrid
try:
    while True:
        altB = Ba.next()
        valB = Bv.next()

        # find ainf and asup in A_Alt such that ainf < altB < asup
        while asup<altB:
            try:
                ainf,asup = asup, Aa.next()
            except StopIteration:
                break

        # find closest
        if abs(ainf-altB)<=abs(asup-altB):
            addto(C, valB, ainf)
        else:
            addto(C, valB, asup)

except StopIteration:
    pass

# do the median
res = [ nanmedian(k[1]) for k in C ] 

print res

The idea is then to iterate over the two vectors/lists of altitudes, and for each altitude of B, find the two altitudes of A that surround it. Then, it is easy to find the closest...

This is less readable than Daan's solution, but it should be more efficient (linear in the size of your datas).

You just need to modify if your datas are not stored like that.

One way to cover the second case (Grid B to A, i.e. from few altitudes to many altitudes) is this:

Extrapolation function (from here)

from scipy.interpolate import interp1d

def extrap1d(interpolator):
    xs = interpolator.x
    ys = interpolator.y

    def pointwise(x):
        if x < xs[0]:
            return ys[0]
        elif x > xs[-1]:
            return ys[-1]
        else:
            return interpolator(x)

    def ufunclike(xs):
        return array(map(pointwise, array(xs)))

    return ufunclike

Values

A_Alt = array([1,2,3,4])
A_Val = array([.33, .5, .6, .8])
B_Alt = array([.7, 0.9, 1.7, 2., 2.4, 2.9, 3.1, 3.2, 3.9, 4.1])

Actual regridding:

f_i = interp1d(A_Alt, A_Val, kind='nearest')
f_x = extrap1d(f_i)

f_x(B_Alt)

Output:

array([ 0.33,  0.33,  0.5 ,  0.5 ,  0.5 ,  0.6 ,  0.6 ,  0.6 ,  0.8 ,  0.8 ])

继续阅读：comparison dataset numpy python scipy

Matching two grids for data analysis, is there a good algorithm for my problem?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？