Combinatorial optimisation of a distance metric
I have a set of trajectories, made up of points along the trajectory, and with the coordinates associated with each point. I store these in a 3d array ( traj开发者_StackOverflow社区ectory, point, param). I want to find the set of r trajectories that have the maximum accumulated distance between the possible pairwise combinations of these trajectories. My first attempt, which I think is working looks like this:
max_dist = 0
for h in itertools.combinations ( xrange(num_traj), r):
for (m,l) in itertools.combinations (h, 2):
accum = 0.
for ( i, j ) in itertools.izip ( range(k), range(k) ):
A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
for z in xrange(k) ]
A = numpy.array( numpy.sqrt (A) ).sum()
accum += A
if max_dist < accum:
selected_trajectories = h
This takes forever, as num_traj can be around 500-1000, and r can be around 5-20. k is arbitrary, but can typically be up to 50.
Trying to be super-clever, I have put everything into two nested list comprehensions, making heavy use of itertools:
chunk = [[ numpy.sqrt((my_mat[m, i, :] - my_mat[l, j, :])**2).sum() \
for ((m,l),i,j) in \
itertools.product ( itertools.combinations(h,2), range(k), range(k)) ]\
for h in itertools.combinations(range(num_traj), r) ]
Apart from being quite unreadable (!!!), it is also taking a long time. Can anyone suggest any ways to improve on this?
Rather than recalculate the distance between each pair of trajectories on-demand, you can start by calculating the distance between all pairs of trajectories. You can store those in a dictionary and look them up as needed.
This way your inner-loop for (i,j) ...
will be replaced with a constant-time lookup.
You can ditch the square root calculation on the distance calculation... the maximal sum will also have the maximal squared sum, although that only yields a constant speedup.
Here are few points of interest and suggestions in addition to what everyone else has mentioned. (By the way, mathmike's suggestion of generating a look-up list all all pair distances is one that you should put in place immediately. It gets rid of a O(r^2) from your algorithm complexity.)
First, the lines
for ( i, j ) in itertools.izip ( range(k), range(k) ):
A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
for z in xrange(k) ]
can be replaced with
for i in xrange(k):
A = [ (my_mat[m, i, z] - my_mat[l, i, z])**2 \
for z in xrange(k) ]
because i and j are always the same in every loop. There's no need to use izip at all here.
Second, regarding the line
A = numpy.array( numpy.sqrt (A) ).sum()
Are you sure this is how you want to calculate it? Possibly so, but it just struck me as odd because if this was more of a Euclidean distance between vectors then the line would be:
A = numpy.sqrt (numpy.array( A ).sum())
or just
A = numpy.sqrt(sum(A))
because I would think that converting A to a numpy array to use numpy's sum function would be slower than just using the built-in Python sum function, but I could be wrong. Also, if it truly is a Euclidean distance that you want, then you will be doing less sqrt's this way.
Third, do you realize how many potential combinations you may be trying to iterate over? For the worst case with num_traj = 1000 and r = 20, that is approximately 6.79E42 combinations by my estimate. That's quite intractable with your current method. Even for the best case of num_traj = 500 and r = 5, that's 1.28E12 combinations which is quite a few, but not impossible. This is the real problem you're having here because by taking mathmike's advice, the first two points that I have mentioned aren't very important.
What can you do then? Well, you will need to be a little more clever. It isn't clear to me yet what would be a great method use for this. I'm guessing that you will need to make the algorithm heuristic in some way. One thought I had was to try a dynamic programming sort of approach with a heuristic. For each trajectory you could find the sum or mean of the distances for every pairing of it with another trajectory and use this as a fitness measure. Some of the trajectories with the lowest fitness measures could be dropped before moving on to trios. You could then do the same thing with trios: find the sum or mean of the accumulated distances for all trios (among remaining possible trajectories) that each trajectory is involved with and use that as the fitness measure to decide which ones to drop before moving on to foursomes. It doesn't guarantee the optimal solution, but it should be quite good and it will greatly lower the time complexity of the solution I believe.
It's likely to take forever anyhow, as your algorithm takes about ~ O( C( N, r ) * r^2 )
, where C( N, r )
is N choose r. For smaller r's (or N) this might be okay, but if you absolutely need to find the max, as opposed to using an approximation heuristic, you should try branch and bound with different strategies. This might work for smaller r's, and it saves you quite a bit on unnecessary recalculations.
This sounds like a "weighted clique" problem: find e.g.
r=5 people in a network with maximim compatibility / max sum of C(5,2) pair weights.
Google "weighted clique" algorithm -"clique percolation" → 3k hits.
BUT I would go with Justin Peel's method
because it's understandable and controllable
(take the n2 best pairs, from them the best n3 triples ...
adjust n2 n3 ... to easily tradeoff runtime / quality of results.)
Added 18may, a cut at an implementation follows.
@Jose, it would be interesting to see what nbest[] sequence works for you.
#!/usr/bin/env python
""" cliq.py: grow high-weight 2 3 4 5-cliques, taking nbest at each stage
weight ab = dist[a,b] -- a symmetric numpy array, diag << 0
weight abc, abcd ... = sum weight all pairs
C[2] = [ (dist[j,k], (j,k)) ... ] nbest[2] pairs
C[3] = [ (cliqwt(j,k,l), (j,k,l)) ... ] nbest[3] triples
...
run time ~ N * (N + nbest[2] + nbest[3] ...)
keywords: weighted-clique heuristic python
"""
# cf "graph clustering algorithm"
from __future__ import division
import numpy as np
__version__ = "denis 18may 2010"
me = __file__.split('/') [-1]
def cliqdistances( cliq, dist ):
return sorted( [dist[j,k] for j in cliq for k in cliq if j < k], reverse=True )
def maxarray2( a, n ):
""" -> max n [ (a[j,k], (j,k)) ...] j <= k, a symmetric """
jkflat = np.argsort( a, axis=None )[:-2*n:-1]
jks = [np.unravel_index( jk, a.shape ) for jk in jkflat]
return [(a[j,k], (j,k)) for j,k in jks if j <= k] [:n]
def _str( iter, fmt="%.2g" ):
return " ".join( fmt % x for x in iter )
#...............................................................................
def maxweightcliques( dist, nbest, r, verbose=10 ):
def cliqwt( cliq, p ):
return sum( dist[c,p] for c in cliq ) # << 0 if p in c
def growcliqs( cliqs, nbest ):
""" [(cliqweight, n-cliq) ...] -> nbest [(cliqweight, n+1 cliq) ...] """
# heapq the nbest ? here just gen all N * |cliqs|, sort
all = []
dups = set()
for w, c in cliqs:
for p in xrange(N):
# fast gen [sorted c+p ...] with small sorted c ?
cp = c + [p]
cp.sort()
tup = tuple(cp)
if tup in dups: continue
dups.add( tup )
all.append( (w + cliqwt(c, p), cp ))
all.sort( reverse=True )
if verbose:
print "growcliqs: %s" % _str( w for w,c in all[:verbose] ) ,
print " best: %s" % _str( cliqdistances( all[0][1], dist )[:10])
return all[:nbest]
np.fill_diagonal( dist, -1e10 ) # so cliqwt( c, p in c ) << 0
C = (r+1) * [(0, None)] # [(cliqweight, cliq-tuple) ...]
# C[1] = [(0, (p,)) for p in xrange(N)]
C[2] = [(w, list(pair)) for w, pair in maxarray2( dist, nbest[2] )]
for j in range( 3, r+1 ):
C[j] = growcliqs( C[j-1], nbest[j] )
return C
#...............................................................................
if __name__ == "__main__":
import sys
N = 100
r = 5 # max clique size
nbest = 10
verbose = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # N= ...
np.random.seed(seed)
nbest = [0, 0, N//2] + (r - 2) * [nbest] # ?
print "%s N=%d r=%d nbest=%s" % (me, N, r, nbest)
# random graphs w cluster parameters ?
dist = np.random.exponential( 1, (N,N) )
dist = (dist + dist.T) / 2
for j in range( 0, N, r ):
dist[j:j+r, j:j+r] += 2 # see if we get r in a row
# dist = np.ones( (N,N) )
cliqs = maxweightcliques( dist, nbest, r, verbose )[-1] # [ (wt, cliq) ... ]
print "Clique weight, clique, distances within clique"
print 50 * "-"
for w,c in cliqs:
print "%5.3g %s %s" % (
w, _str( c, fmt="%d" ), _str( cliqdistances( c, dist )[:10]))
精彩评论