Combinatorial optimisation of a distance metric

2022-12-30 09:38 问答作者：

I have a set of trajectories, made up of points along the trajectory, and with the coordinates associated with each point. I store these in a 3d array ( traj开发者_StackOverflow社区ectory, point, param). I want to find the set of r trajectories that have the maximum accumulated distance between the possible pairwise combinations of these trajectories. My first attempt, which I think is working looks like this:

max_dist = 0
for h in itertools.combinations ( xrange(num_traj), r):
    for (m,l) in itertools.combinations (h, 2):
        accum = 0.
        for ( i, j ) in itertools.izip ( range(k), range(k) ):
            A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
                    for z in xrange(k) ]
            A = numpy.array( numpy.sqrt (A) ).sum()
            accum += A
    if max_dist < accum:
        selected_trajectories = h

This takes forever, as num_traj can be around 500-1000, and r can be around 5-20. k is arbitrary, but can typically be up to 50.

Trying to be super-clever, I have put everything into two nested list comprehensions, making heavy use of itertools:

chunk = [[ numpy.sqrt((my_mat[m, i, :] - my_mat[l, j, :])**2).sum() \
        for ((m,l),i,j) in \
        itertools.product ( itertools.combinations(h,2), range(k), range(k)) ]\
        for h in itertools.combinations(range(num_traj), r) ]

Apart from being quite unreadable (!!!), it is also taking a long time. Can anyone suggest any ways to improve on this?

Rather than recalculate the distance between each pair of trajectories on-demand, you can start by calculating the distance between all pairs of trajectories. You can store those in a dictionary and look them up as needed.

This way your inner-loop for (i,j) ... will be replaced with a constant-time lookup.

You can ditch the square root calculation on the distance calculation... the maximal sum will also have the maximal squared sum, although that only yields a constant speedup.

Here are few points of interest and suggestions in addition to what everyone else has mentioned. (By the way, mathmike's suggestion of generating a look-up list all all pair distances is one that you should put in place immediately. It gets rid of a O(r^2) from your algorithm complexity.)

First, the lines

for ( i, j ) in itertools.izip ( range(k), range(k) ):
    A = [ (my_mat[m, i, z] - my_mat[l, j, z])**2 \
        for z in xrange(k) ]

can be replaced with

for i in xrange(k):
    A = [ (my_mat[m, i, z] - my_mat[l, i, z])**2 \
        for z in xrange(k) ]

because i and j are always the same in every loop. There's no need to use izip at all here.

Second, regarding the line

A = numpy.array( numpy.sqrt (A) ).sum()

Are you sure this is how you want to calculate it? Possibly so, but it just struck me as odd because if this was more of a Euclidean distance between vectors then the line would be:

A = numpy.sqrt (numpy.array( A ).sum())

or just

A = numpy.sqrt(sum(A))

because I would think that converting A to a numpy array to use numpy's sum function would be slower than just using the built-in Python sum function, but I could be wrong. Also, if it truly is a Euclidean distance that you want, then you will be doing less sqrt's this way.

Third, do you realize how many potential combinations you may be trying to iterate over? For the worst case with num_traj = 1000 and r = 20, that is approximately 6.79E42 combinations by my estimate. That's quite intractable with your current method. Even for the best case of num_traj = 500 and r = 5, that's 1.28E12 combinations which is quite a few, but not impossible. This is the real problem you're having here because by taking mathmike's advice, the first two points that I have mentioned aren't very important.

What can you do then? Well, you will need to be a little more clever. It isn't clear to me yet what would be a great method use for this. I'm guessing that you will need to make the algorithm heuristic in some way. One thought I had was to try a dynamic programming sort of approach with a heuristic. For each trajectory you could find the sum or mean of the distances for every pairing of it with another trajectory and use this as a fitness measure. Some of the trajectories with the lowest fitness measures could be dropped before moving on to trios. You could then do the same thing with trios: find the sum or mean of the accumulated distances for all trios (among remaining possible trajectories) that each trajectory is involved with and use that as the fitness measure to decide which ones to drop before moving on to foursomes. It doesn't guarantee the optimal solution, but it should be quite good and it will greatly lower the time complexity of the solution I believe.

It's likely to take forever anyhow, as your algorithm takes about ~ O( C( N, r ) * r^2 ), where C( N, r ) is N choose r. For smaller r's (or N) this might be okay, but if you absolutely need to find the max, as opposed to using an approximation heuristic, you should try branch and bound with different strategies. This might work for smaller r's, and it saves you quite a bit on unnecessary recalculations.

This sounds like a "weighted clique" problem: find e.g. r=5 people in a network with maximim compatibility / max sum of C(5,2) pair weights.
Google "weighted clique" algorithm -"clique percolation" → 3k hits.
BUT I would go with Justin Peel's method because it's understandable and controllable
(take the n2 best pairs, from them the best n3 triples ... adjust n2 n3 ... to easily tradeoff runtime / quality of results.)

Added 18may, a cut at an implementation follows.
@Jose, it would be interesting to see what nbest[] sequence works for you.

#!/usr/bin/env python
""" cliq.py: grow high-weight 2 3 4 5-cliques, taking nbest at each stage
    weight ab = dist[a,b] -- a symmetric numpy array, diag << 0
    weight abc, abcd ... = sum weight all pairs
    C[2] = [ (dist[j,k], (j,k)) ... ]  nbest[2] pairs
    C[3] = [ (cliqwt(j,k,l), (j,k,l)) ... ]  nbest[3] triples
    ...
    run time ~ N * (N + nbest[2] + nbest[3] ...)

keywords: weighted-clique heuristic python
"""
# cf "graph clustering algorithm"

from __future__ import division
import numpy as np

__version__ = "denis 18may 2010"
me = __file__.split('/') [-1]

def cliqdistances( cliq, dist ):
    return sorted( [dist[j,k] for j in cliq  for k in cliq if j < k], reverse=True )

def maxarray2( a, n ):
    """ -> max n [ (a[j,k], (j,k)) ...]  j <= k, a symmetric """
    jkflat = np.argsort( a, axis=None )[:-2*n:-1]
    jks = [np.unravel_index( jk, a.shape ) for jk in jkflat]
    return [(a[j,k], (j,k)) for j,k in jks if j <= k] [:n]

def _str( iter, fmt="%.2g" ):
    return " ".join( fmt % x  for x in iter )

#...............................................................................

def maxweightcliques( dist, nbest, r, verbose=10 ):

    def cliqwt( cliq, p ):
        return sum( dist[c,p] for c in cliq )  # << 0 if p in c

    def growcliqs( cliqs, nbest ):
        """ [(cliqweight, n-cliq) ...] -> nbest [(cliqweight, n+1 cliq) ...] """
            # heapq the nbest ? here just gen all N * |cliqs|, sort
        all = []
        dups = set()
        for w, c in cliqs:
            for p in xrange(N):
                    # fast gen [sorted c+p ...] with small sorted c ?
                cp = c + [p]
                cp.sort()
                tup = tuple(cp)
                if tup in dups:  continue
                dups.add( tup )
                all.append( (w + cliqwt(c, p), cp ))
        all.sort( reverse=True )
        if verbose:
            print "growcliqs: %s" % _str( w for w,c in all[:verbose] ) ,
            print " best: %s" % _str( cliqdistances( all[0][1], dist )[:10])
        return all[:nbest]

    np.fill_diagonal( dist, -1e10 )  # so cliqwt( c, p in c ) << 0
    C = (r+1) * [(0, None)]  # [(cliqweight, cliq-tuple) ...]
        # C[1] = [(0, (p,)) for p in xrange(N)]
    C[2] = [(w, list(pair)) for w, pair in maxarray2( dist, nbest[2] )]
    for j in range( 3, r+1 ):
        C[j] = growcliqs( C[j-1], nbest[j] )
    return C

#...............................................................................
if __name__ == "__main__":
    import sys

    N = 100
    r = 5  # max clique size
    nbest = 10
    verbose = 0
    seed = 1
    exec "\n".join( sys.argv[1:] )  # N= ...
    np.random.seed(seed)
    nbest = [0, 0, N//2] + (r - 2) * [nbest]  # ?

    print "%s  N=%d  r=%d  nbest=%s"  % (me, N, r, nbest)

        # random graphs w cluster parameters ?
    dist = np.random.exponential( 1, (N,N) )
    dist = (dist + dist.T) / 2
    for j in range( 0, N, r ):
        dist[j:j+r, j:j+r] += 2  # see if we get r in a row
    # dist = np.ones( (N,N) )

    cliqs = maxweightcliques( dist, nbest, r, verbose )[-1]  # [ (wt, cliq) ... ]

    print "Clique weight,  clique,  distances within clique"
    print 50 * "-"
    for w,c in cliqs:
        print "%5.3g  %s  %s" % (
            w, _str( c, fmt="%d" ), _str( cliqdistances( c, dist )[:10]))

继续阅读：algorithm combinatorics numpy python python-itertools

Combinatorial optimisation of a distance metric

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？