Professional Documents
Culture Documents
Date: 06/08/07
Abstract:
This report discusses using a machine learning approach (neural networks) to predict
the outcomes of 2-team matches, specifically applied to the English Premier League
2006-07 Soccer matches. The data obtained from the past matches in the league are
used to make better predictions for the future matches.
Introduction:
This report explores the usage of various machine learning techniques for predicting the
outcomes of matches in which 2 teams play against each other. There are 3 possible
outcomes: a win, a lose or a draw. The soccer matches of the English Premier League
played during the 2006-07 season were chosen as the domain.
The league consists of 20 teams playing against each other over a span of close to one
year. Each team plays twice against every other team in the league, one at home
(home teams ground) and the other away (opponent teams ground). 3 points are
awarded for a win, 1 point each for a draw and no points for a loss. The league does
not have any knockout matches, and the final winner of the league is decided towards
the end as the team amassing the maximum number of points in the season.
Soccer matches or for that matter any matches which are competitive are by
inherently unpredictable in nature. The way a team plays on a given day depends on a
number of features like the attacking, defensive and mid-field abilities of the team
(team ratings), the capabilities of the individual players (player ratings) and even their
psychology going into the game or the cheering received from the crowd. But on any
given day, it is difficult to predict the outcome of any game played between 2 equally
poised teams. This applies more in the game of football, because it is such a low-scoring
game (with only 2 to 3 goals scored by any team on average) and a moment of
brilliance or stupidity can cost the team a match. Hence, it is a great challenge to pick
good features and be able to predict the outcome of the game with good accuracy
(close to the pundits).
The prediction results if found to be good, can be used in match betting (on online
websites like the TopTipper).
Learning Algorithm:
Neural Networks has been used as the primary learning algorithm for learning the
patterns in the data. An artificial neural network, is modelled in the way the neurons in
the human body pass/suppress signals. Weights are associated with each of the input
(26 features in our case) and activation functions are applied to their weighted sum
(including a bias term) to get new set of inputs. We can have multiple such hidden
layers before reaching a final layer with the required classification (win/draw/loss in our
case). Using back-propagation the weights are adjusted in the direction of the
minimum mean-squared error on the training data.
This type of network topology is well-suited for our domain, because we have multiple
discrete features which need to be assigned different weights according to their
individual contribution to the solution. The usage of multiple hidden layers, facilitates
capturing complex relations between the input features and the output classes (win,
draw or loss). Neural Networks is a state-of-the-art learning algorithm which works well in
such domains, and hence it is chosen is the primary learning algorithm. Also, since the
dataset is not too big (380 matches with 26 features each), it is realistic to train the
neural network in reasonable time.
The ensemble methods Bagging and Boosting are applied over the Neural Networks, to
try to achieve better results. Other learning algorithms (Decision Trees, k-Nearest
Neighbors and Nave Bayes) have also been used to evaluate the data.
Weka package was used for running the various learning algorithms on the data.
Experimental Results:
The processed data belonging to 380 soccer matches in the English Premier League
2006-07 season, has been tested using the various machine learning algorithms (as
listed in Table-1) and with different tuning parameters. A 10-fold cross-validation is used
for all the experiments.
#
Learning Algorithm
%
Correct
Predictions
%
#
Incorrect Correct
#
Incorrect
Neural Networks
60.79
39.21
231
149
Neural Networks
63.42
36.58
241
139
2b
Neural Networks
63.42
36.58
241
139
Algorithm parameters
(tuning)
Hidden layers - 1
Training Time - 1,000
Hidden layers - 5
Training Time - 1,000
Hidden layers - 5
Training Time - 10,000
2c
Neural Networks
+ Bagging
65.53
34.47
249
131
2d
Neural Networks
+ Boosting (AdaboostM1)
64.74
35.26
246
134
Neural Networks
62.63
37.37
238
142
Neural Networks
63.16
36.84
240
140
5
6
Neural Networks
Decision Trees (J48)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
Nave Bayes Simple
62.89
55.26
37.11
44.74
239
210
141
170
55.00
45.00
209
171
57.37
42.63
218
162
62.37
37.63
237
143
62.11
52.11
37.89
47.89
236
198
144
182
7
8
9
10
11
Hidden layers - 5
Training Time - 1,000
Bagging: numIterations
30
Hidden layers - 5
Training Time - 1,000
Boosting: maxIterations 30,
weightThreshold - 1,000
Hidden layers - 10
Training Time - 1,000
Hidden layers - 20
Training Time - 1,000
Hidden layers - 40
Training Time - 1,000
Unpruned True
KNN - 1
Cross-validate True
KNN - 5
Cross-validate True
KNN - 10
Cross-validate True
KNN - 20
Cross-validate True
-
Conclusions:
Though soccer and other sports matches are said to be unpredictable, we see from our
experiments that there is some level up to which we can use the past history to predict
the outcomes of future matches. The important issue is the appropriate feature
extraction from the available data.
Some features like team rankings have been intentionally left out of the set of
features, so that there is no bias for any team to start with, so that we have a good
improvement in learning as data from more matches is observed. Also, be keeping
distinct, certain features like goals scored/conceded at home/away, instead of
aggregating them, resulted in extracting more information from such data for using it in
our predictions. The above considerations might have contributed in getting better
results than McCabe who used the same learning algorithm on similar data.
It can be observed that the results of sports matches are not entirely dependent on the
past data (which would otherwise make the game very boring) and are a lot
unpredictable unlike character/face recognition - where the percentage accuracy
can be in the high 90's. Hence, a prediction accuracy close to 65% can be considered
quite good, considering the fact that soccer matches have 3 possible outcomes (win,
loss and draw) and hence an accuracy of 33% in the random case.
Future Work:
Certain important qualitative data like the rankings of the players in each of the teams,
provide valuable insights in to the strengths of the 2 teams playing a match. Any
missing/injured lead players are considered to have a major impact on the outcome of
a game according to the experts. Though quantifying the importance of players is
subjective, the ratings of the players can give a fairly useful picture. Head-to-head
statistics (for last n games) of the teams going into a game is an important feature as
well. Incorporating all these features can be a good future direction for gaining better
accuracy in predicting the outcomes of the games.
The same algorithm with some changes in the feature selection, can be used to predict
the outcomes matches in other leagues like Football, Cricket etc. The more wellbehaved the games, the more accurate will be our predictions.
References:
1. Alan McCabe, "An Artificially Intelligent Sports Tipper," in Proceedings : 15th
Australian Joint Conference on Artificial Intelligence (2002)
2. A. P. Rotshtein, M. Posner and A. B. Rakityanskaya, Football Predictions Based on a
Fuzzy Model with Genetic and Neural Tuning," Cybernetics and Systems Analysis
Journal (2005)