开发者

How to find a pair (or trio or more) of objects that TOGETHER are most similar to a given criteria?

The objects would consist of various data but we'd only be using the numeric data for the algorithm, which should simplify things a little bit.

I'll explain my problem with an example. Let's say we have a list of 200+ basketball players and their average game stats. Let's say I want to find a player with stats most similar to another player. That's doable in a reasonable amount of time.

What I'm struggling to figure out is how I can find two players that when combined have stats more similar to a certain single basketball player than all other pairs of basketball players possible. For example:

I'm looking for a pair of players that when combined have average stats of:

The algorithm should return Player A and Player B that combined are closer to those average stats than any other two players.

Is this doable in a timely manner, especially considering there would be between 9 to 11 different numeric values to consider for each basketball player? And the list of basketball players would be at least 200. And on top of all that, is it possible to scale this problem to not only finding pairs of players, but trios or more?

I'm working on a personal project and would love some insight on how to tackle this issue. Thank you!


Here is a proposed algorithm.

Call M the ideal average that you want to get close to.

For every player P:

  • Find player Q = Q(P) such that (P+Q)/2 is closest to M;
  • Note d(P) the distance between (P+Q)/2 and M

Return pair (P, Q(P)) that minimises d(P)

This algorithm will be efficient, provided that given a player P, the operation "find player Q such that (P+Q)/2 is closest to M" is efficient.

Note that the two following operations are equivalent:

  • Find player Q such that (P+Q)/2 is closest to M;
  • Find player Q which is closest to M + (M - P).

Geometrically speaking, it is easy to understand this equivalence. The point R = M+(M-P) is the symmetric of P in the central symmetry around point M; in other words, R is the point such that M is exactly the middle of segment PR.

Now, given point R = M + (M - P), the operation "find player Q which is closest to R" can be done efficiently if the points are stored in a data structure specially designed to make this operation efficient. One such data structure is the k-d tree.

Here is a piece of code in python using scikit-learn's sklearn.neighbors.KDTree:

from sklearn.neighbors import KDTree
from numpy import argmin

# X should be numpy array with
#   one row per player
#   one column per feature
def find_bestavg_pair(X, m):
    tree = KDTree(X, leaf_size=5)    # KDTree makes it easy to find nearest neighbour of a point
    # REMOVE leaf_size=5 IF NUMBER OF PLAYERS IS LARGE
    distances, best_indices = tree.query(m + (m-X), k=1)
    i = argmin(distances)
    return i, best_indices[i].item()

Testing on random data:

from numpy.random import random

n_players = 100
n_features = 4

players = random((n_players, n_features))
m = random(n_features)

print('Players:')
print(players)
print('Ideal average:')
print('    ', m)

i_p, i_q = find_bestavg_pair(players, m)
p, q = players[i_p], players[i_q]

print()
print('Found best pair:')
print('    ', p)
print('    ', q)
print('Average:')
print('    ', (p+q)/2)
print('Distance from average to ideal average:')
print('    ', norm(m - (p+q)/2))

Output:

Players:
[[0.92374607 0.17773757 0.86935844 0.23778054]
 [0.66040649 0.24189426 0.28202681 0.29389967]
 [0.5913366  0.67835115 0.68202081 0.34955092]
 ...
 [0.89292432 0.25801491 0.4732747  0.39460182]
 [0.97629877 0.32180046 0.32901506 0.64657499]]
Ideal average:
     [0.8075199  0.41172765 0.94341118 0.87773564]

Found best pair:
     [0.9056168  0.72582579 0.97190164 0.77675535]
     [0.56901592 0.06824099 0.97213458 0.81913477]
Average:
     [0.73731636 0.39703339 0.97201811 0.79794506]
Distance from average to ideal average:
     0.11103762079607322

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜