Condensed matrix function to find pairs

2023-02-17 16:59 问答作者：

For a set of observations:

[a1,a2,a3,a4,a5]

their pairwise distances

d=[[0,a12,a13,a14,a15]
   [a21,0,a23,a24,a25]
   [a31,a32,0,a34,a35]
   [a41,a42,a43开发者_开发百科,0,a45]
   [a51,a52,a53,a54,0]]

Are given in a condensed matrix form (upper triangular of the above, calculated from scipy.spatial.distance.pdist ):

c=[a12,a13,a14,a15,a23,a24,a25,a34,a35,a45]

The question is, given that I have the index in the condensed matrix is there a function (in python preferably) f to quickly give which two observations were used to calculate them?

f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...

I have tried some solutions but none worth mentioning :(

The formula for an index of the condensed matrix is

index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + j - i - 1

Where i is the row index, j is the column index, and d is the row length of the original (d X d) upper triangular matrix.

Consider the case when the index refers to the leftmost, non-zero entry of some row in the original matrix. For all the leftmost indices,

j == i + 1

index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + i + 1 - i - 1
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2

With some algebra, we can rewrite this as

i ** 2 + (1 - (2 * d)) * i + 2 * index == 0

Then we can use the quadratic formula to find the roots of the equation, and we only are going to care about the positive root.

If this index does correspond to leftmost, non-zero cell, then we get a positive integer as a solution that corresponds to the row number. Then, finding the column number is just arithmetic.

j = index - d * (d - 1) / 2 + (d - i) * (d - i - 1)/ 2 + i + 1

If the index does not correspond to the leftmost, non-zero cell, then we will not find an integer root, but we can take the floor of the positive root as the row number.

def row_col_from_condensed_index(d,index):
    b = 1 - (2 * d) 
    i = (-b - math.sqrt(b ** 2 - 8 * index)) // 2
    j = index + i * (b + i + 2) // 2 + 1
    return (i,j)

If you don't know d, you can figure it from the length of the condensed matrix.

((d - 1) * d) / 2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8 * len(condensed_matrix))) // 2

You may find triu_indices useful. Like,

In []: ti= triu_indices(5, 1)
In []: r, c= ti[0][5], ti[1][5]
In []: r, c
Out[]: (1, 3)

Just notice that indices starts from 0. You may adjust it as you like, for example:

In []: def f(n, c):
   ..:     n= ceil(sqrt(2* n))
   ..:     ti= triu_indices(n, 1)
   ..:     return ti[0][c]+ 1, ti[1][c]+ 1
   ..:
In []: f(len(c), 5)
Out[]: (2, 4)

Cleary, the function f you are searching for, needs a second argument: the dimension of the matrix - in your case: 5

First Try:

def f(dim,i): 
  d = dim-1 ; s = d
  while i<s: 
    s+=d ; d-=1
  return (dim-d, i-s+d)

To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:

def vec_row_col(d,i):                                                                
    i = np.array(i)                                                                 
    b = 1 - 2 * d                                                                   
    x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)                                      
    y = (i + x*(b + x + 2)/2 + 1).astype(int)                                                    
    if i.shape:                                                                     
        return zip(x,y)                                                             
    else:                                                                           
        return (x,y)

I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):

import numpy as np
from scipy.spatial import distance

test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)

%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop

res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop

That's about 37 times faster in this example!

This is in addition to the answer provided by phynfo and your comment. It does not feel like a clean design to me to infer the dimension of the matrix from the length of the compressed matrix. That said, here is how you can compute it:

from math import sqrt, ceil

for i in range(1,10):
   thelen = (i * (i+1)) / 2
   thedim = sqrt(2*thelen + ceil(sqrt(2*thelen)))
   print "compressed array of length %d has dimension %d" % (thelen, thedim)

The argument to the outer square root should always be a square integer, but sqrt returns a floating point number, so some care is needed when using this.

Here's another solution:

import numpy as np

def f(c,n):
    tt = np.zeros_like(c)
    tt[n] = 1
    return tuple(np.nonzero(squareform(tt))[0])

To improve the efficiency using numpy.triu_indices
use this:

def PdistIndices(n,I):
    '''idx = {} indices for pdist results'''
    idx = numpy.array(numpy.triu_indices(n,1)).T[I]
    return idx

So I is an array of indices.

However a better solution is to implement an optimized Brute-force search, say, in Fortran:

function PdistIndices(n,indices,m) result(IJ)
    !IJ = {} indices for pdist[python] selected results[indices]
    implicit none
    integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
    logical:: finished
    k = 0; w = 0; finished = .false.
    do i=0,n-2
        do j=i+1,n-1
            if (k==indices(w)) then
                IJ(w,:) = [i,j]
                w = w+1
                if (w==m) then
                    finished = .true.
                    exit
                endif
            endif
            k = k+1
        enddo
        if (finished) then
            exit
        endif
    enddo
end function

then compile using F2PY and enjoy unbeatable performance. ;)

继续阅读：algorithm math python scipy statistics

Condensed matrix function to find pairs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？