Condensed matrix function to find pairs
For a set of observations:
[a1,a2,a3,a4,a5]
their pairwise distances
d=[[0,a12,a13,a14,a15]
[a21,0,a23,a24,a25]
[a31,a32,0,a34,a35]
[a41,a42,a43开发者_开发百科,0,a45]
[a51,a52,a53,a54,0]]
Are given in a condensed matrix form (upper triangular of the above, calculated from scipy.spatial.distance.pdist
):
c=[a12,a13,a14,a15,a23,a24,a25,a34,a35,a45]
The question is, given that I have the index in the condensed matrix is there a function (in python preferably) f to quickly give which two observations were used to calculate them?
f(c,0)=(1,2)
f(c,5)=(2,4)
f(c,9)=(4,5)
...
I have tried some solutions but none worth mentioning :(
The formula for an index of the condensed matrix is
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + j - i - 1
Where i
is the row index, j
is the column index, and d
is the row length of the original (d X d) upper triangular matrix.
Consider the case when the index refers to the leftmost, non-zero entry of some row in the original matrix. For all the leftmost indices,
j == i + 1
so
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2 + i + 1 - i - 1
index = d * (d - 1) / 2 - (d - i) * (d - i - 1) / 2
With some algebra, we can rewrite this as
i ** 2 + (1 - (2 * d)) * i + 2 * index == 0
Then we can use the quadratic formula to find the roots of the equation, and we only are going to care about the positive root.
If this index does correspond to leftmost, non-zero cell, then we get a positive integer as a solution that corresponds to the row number. Then, finding the column number is just arithmetic.
j = index - d * (d - 1) / 2 + (d - i) * (d - i - 1)/ 2 + i + 1
If the index does not correspond to the leftmost, non-zero cell, then we will not find an integer root, but we can take the floor of the positive root as the row number.
def row_col_from_condensed_index(d,index):
b = 1 - (2 * d)
i = (-b - math.sqrt(b ** 2 - 8 * index)) // 2
j = index + i * (b + i + 2) // 2 + 1
return (i,j)
If you don't know d
, you can figure it from the length of the condensed matrix.
((d - 1) * d) / 2 == len(condensed_matrix)
d = (1 + math.sqrt(1 + 8 * len(condensed_matrix))) // 2
You may find triu_indices useful. Like,
In []: ti= triu_indices(5, 1)
In []: r, c= ti[0][5], ti[1][5]
In []: r, c
Out[]: (1, 3)
Just notice that indices starts from 0. You may adjust it as you like, for example:
In []: def f(n, c):
..: n= ceil(sqrt(2* n))
..: ti= triu_indices(n, 1)
..: return ti[0][c]+ 1, ti[1][c]+ 1
..:
In []: f(len(c), 5)
Out[]: (2, 4)
Cleary, the function f you are searching for, needs a second argument: the dimension of the matrix - in your case: 5
First Try:
def f(dim,i):
d = dim-1 ; s = d
while i<s:
s+=d ; d-=1
return (dim-d, i-s+d)
To complete the list of answers to this question: A fast, vectorized version of fgreggs answer (as suggested by David Marx) could look like this:
def vec_row_col(d,i):
i = np.array(i)
b = 1 - 2 * d
x = np.floor((-b - np.sqrt(b**2 - 8*i))/2).astype(int)
y = (i + x*(b + x + 2)/2 + 1).astype(int)
if i.shape:
return zip(x,y)
else:
return (x,y)
I needed to do these calculations for huge arrays, and the speedup as compared to the un-vectorized version (https://stackoverflow.com/a/14839010/3631440) is (as usual) quite impressive (using IPython %timeit):
import numpy as np
from scipy.spatial import distance
test = np.random.rand(1000,1000)
condense = distance.pdist(test)
sample = np.random.randint(0,len(condense), 1000)
%timeit res = vec_row_col(1000, sample)
10000 loops, best of 3: 156 µs per loop
res = []
%timeit for i in sample: res.append(row_col_from_condensed_index(1000, i))
100 loops, best of 3: 5.87 ms per loop
That's about 37 times faster in this example!
This is in addition to the answer provided by phynfo and your comment. It does not feel like a clean design to me to infer the dimension of the matrix from the length of the compressed matrix. That said, here is how you can compute it:
from math import sqrt, ceil
for i in range(1,10):
thelen = (i * (i+1)) / 2
thedim = sqrt(2*thelen + ceil(sqrt(2*thelen)))
print "compressed array of length %d has dimension %d" % (thelen, thedim)
The argument to the outer square root should always be a square integer, but sqrt returns a floating point number, so some care is needed when using this.
Here's another solution:
import numpy as np
def f(c,n):
tt = np.zeros_like(c)
tt[n] = 1
return tuple(np.nonzero(squareform(tt))[0])
To improve the efficiency using numpy.triu_indices
use this:
def PdistIndices(n,I):
'''idx = {} indices for pdist results'''
idx = numpy.array(numpy.triu_indices(n,1)).T[I]
return idx
So I
is an array of indices.
However a better solution is to implement an optimized Brute-force search, say, in Fortran
:
function PdistIndices(n,indices,m) result(IJ)
!IJ = {} indices for pdist[python] selected results[indices]
implicit none
integer:: i,j,m,n,k,w,indices(0:m-1),IJ(0:m-1,2)
logical:: finished
k = 0; w = 0; finished = .false.
do i=0,n-2
do j=i+1,n-1
if (k==indices(w)) then
IJ(w,:) = [i,j]
w = w+1
if (w==m) then
finished = .true.
exit
endif
endif
k = k+1
enddo
if (finished) then
exit
endif
enddo
end function
then compile using F2PY
and enjoy unbeatable performance. ;)
精彩评论