python statistical analysis
Given 15 players - 2 Goalkeepers, 5 defenders, 5 midfielders and 3 strikers, and the fact that each has a value and a score, I want to calculate the highest scoring team for the money I have. Each team must consist of 1 GK then a formation e.g. 4:4:2, 4:3:3 etc. I started with sample data such as this
player role points cost
I then did the following to evaluate all combinations
read each line into a list (for each role) then use itertools in a nested run to get all combinations
if line[1] == "G": G.append(line[0])
if line[1] == "D": D.append(line[0])
if line[1] == "M": M.append(line[0])
if line[1] == "S": S.append(line[0])
for gk in itertools.combinations(G,1):
for de in itertools.combinations(D,4):
for mi in itertools.combinations(M,4):
for st in itertools.combinations(S,2):
开发者_如何学Goteams[str(count)]= " ".join(gk)+" "+" ".join(de)+" "+" ".join(mi)+" "+" ".join(st)
count +=1
Having got the teams, I calculate their points value, and the team cost. If it's lower than the threshold, I print it.
But if I now make this 20 goalkeepers, 150 defenders, 150 midfielders and 100 strikers, I understandably get out of memory. What could I do to perform this analysis? Is it a generator rather than a recursive function that I need?Many thanks
You might be able to solve this problem with recursion. The following shows the basic outline, but ignores details like a team being composed of a certain number of certain types of players.
players=[{'name':'A','score':5,'cost':10},
{'name':'B','score':10,'cost':3},
{'name':'C','score':6,'cost':8}]
def player_cost(player):
return player['cost']
def player_score(player):
return player['score']
def total_score(players):
return sum(player['score'] for player in players)
def finance_team_recurse(budget, available_players):
affordable_players=[]
for player in available_players:
if player_cost(player)<=budget:
# Since we've ordered available players, the first player appended
# will be the one with the highest score.
affordable_players.append(player)
result=[]
if affordable_players:
candidate_player=affordable_players[0]
other_players=affordable_players[1:]
# if you include candidate_player on your team
team_with_candidate=finance_team_recurse(budget-player_cost(candidate_player),
other_players)
team_with_candidate.append(candidate_player)
score_of_team_with_candidate=total_score(team_with_candidate)
if score_of_team_with_candidate>total_score(other_players):
result=team_with_candidate
else:
# if you exclude candidate_player from your team
team_without_candidate=finance_team_recurse(budget, other_players)
score_of_team_without_candidate=total_score(team_without_candidate)
if score_of_team_with_candidate>score_of_team_without_candidate:
result=team_with_candidate
else:
result=team_without_candidate
return result
def finance_team(budget, available_players):
tmp=available_players[:]
# Sort so player with highest score is first. (Greedy algorithm?)
tmp.sort(key=player_score, reverse=True)
return finance_team_recurse(budget,tmp)
print(finance_team(20,players))
# [{'score': 6, 'cost': 8, 'name': 'C'}, {'score': 10, 'cost': 3, 'name': 'B'}]
20 choose 1 = 20 combinations
150 choose 4 = 20260275 combinations
100 choose 2 = 4950 combinations
So there are a total of 20*20260275*20260275*4950 = 40637395564486875000L
items in the teams
dict. That takes a lot of memory.
for gk in itertools.combinations(G,1):
for de in itertools.combinations(D,4):
for mi in itertools.combinations(M,4):
for st in itertools.combinations(S,2):
#Don't collect the results into a dict.
#That's what's killing you (memory-wise).
#Just compute the cost and
#Just print the result here.
PS. 40637395564486875000L is on the order of 10**19
. Assuming your program can process 10**6
combinations per second, it will take about 1.3 millions years for the program to complete...
Functions and generators help a lot:
def make_teams(G, D, M, S):
""" returns all possible teams """
for gk in itertools.combinations(G,1):
for de in itertools.combinations(D,4):
for mi in itertools.combinations(M,4):
for st in itertools.combinations(S,2):
yield gk, de, mi, st
def get_cost( team ):
return sum( member.cost for member in team )
def good_teams( min_score=0):
for team in make_teams(G, D, M, S):
if get_cost( team ) > min_score:
yield team
for team in good_teams(min_score=100):
print team
It still generates all possible combinations, so you'll probably run out of time now, instead of memory.
What you're doing seems like a variation of the knapsack problem - you can do better than try all possible combination, but not much better.
One way to get a good solution fast is to sort the players by their score per money. You should get the top scoring teams first, but there are no guarantees that you get the best possible solution. Wikipedia calls this the "Greedy approximation algorithm".
def score_per_cost( player ):
return player.score / player.cost
def sorted_combinations(seq, n):
return itertools.combinations(
sorted(seq, key=score_per_cost, reverse=True),n)
def make_teams(G, D, M, S):
""" returns all possible teams """
for gk in sorted_combinations(G,1):
for de in sorted_combinations(D,4):
for mi in sorted_combinations(M,4):
for st in sorted_combinations(S,2):
yield gk, de, mi, st
def get_cost( team ):
return sum( member.cost for member in team )
def top_teams(n):
return itertools.islice(make_teams(G, D, M, S),n)
for team in top_teams(100):
print team
I'll leave adding the "cost per team < threshold" requirement to the reader (hint: it's one line in make_teams
:p).
精彩评论