开发者

Predicting game outcomes at a point in time

Hi I am trying to gauge from past information I have on an MSSQL database the predicted outcomes of soccer games (win, tie or loss for the home team) at any point in time based on the minutes played and the scoreline

What I had envisaged as output was something like fangraphs does for baseball http://www.fangraphs.com/scoreboard.aspx?date=2010-11-01

although with two lines as there are three rather than two possible outcomes

From the data and the existing tables I can create game records like this

Time   TeamID Venue   MatchID  Result
 6    TOT     H       5        W
27    ASV     A       5        W
58    ASV     A       5        W
66    TOT     H       5        W
77    TOT     H       5        W

So for the graph for this game the home team TOT would start with the win line at around 45% (based on the historical probability of a home win) it would spike when they score their goal, dip significantly after ASV score twice but be probably above 90% when they score to go 3-2 up and then rise gently to 100% at the cloing 90 minute mark

So I want to go through the 7500 games I have data on and based on them establish for every minute of a 90 minute game what are the chances of a win, tie or loss for the home team based on the these results

For instance, in the simplest situation after 1 minute of play in actuality 44 of the home teams scored, 33 of them went on to win, 6 tied and 5 lost. The corresponding case where the away team scored has been 9 wins, 8 ties 23 losses for the home team. However, I am having trouble getting my head around how to get all 90开发者_C百科 minutes scorelines and compare them with the final result (Only one goal can be scored in any specific minute)

TIA for any help


There will always be things you can add to the model, but the first thing I would do is, for each game, pull out the score at each minute, and assume that the probability of winning doesn't depend on when the goal was scored, but depends on what the score is now.

So now you would have 90 data points per game.

game1: 
    Minute:   0    10    20    30    40    50    60    70    80    90
    Score :[0 0] [0 0] [0 1] [0 1] [1 1] [1 1] [1 1] [2 1] [2 1] [3 1]

The next thing I would do is, for each minute slice, add up the number of wins, losses, and draws over all games, for each configuration of scores.

So each entry in that table might correspond to something like this:

@minute 27, for score {home:5, away:2} : {homeWins: 9, draws:1, homeLosses:0}

you might want to try using the difference in score instead of the actual score values..

Either way, Once you have the data formatted that way, getting a reasonable solution is easy.

If a game is on, and it's minute 77, and the score is {home:5, away:2}, the (MLE) estimate is 90% wins, 10% draw, 0% looses (according to the example table entry above).

So you see already how it will help to include "laplacian smoothing": adding +1 to the final values of each of those win/lose/draw counters. This way if you've never seen a loss in this exact situation you don't say it's impossible, impossible is a very strong word (look for beta or drichlet distributions for background).

The obvious problem with this approach is that if you've never seen a particular score combination before it will predict (33%,33%,33%), which is obviously wrong in some cases.

The simplest fix would be to enforce a rule like "leading by 6 points is at least as good as leading by 5 points". It's ugly, but it's a start.

To avoid that sort of special-case logic you could try averaging that approach with a monte-carlo approximation.

The simplest of those approaches is to say : over all my data I expect about a 1 in 30 chance of a goal by each team in each minute of play -> simulate the game 10000 times form the current point, count the number of wins/losses/draws and you're done.

If that's too random, or processor intense, switch to Markov Chains.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜