You are on page 1of 5

Submitted by: Radha-Krishna Balla

Date: 06/08/07

Soccer Match Result Prediction using Neural Networks


(CS534 Project Report)

Abstract:
This report discusses using a machine learning approach (neural networks) to predict
the outcomes of 2-team matches, specifically applied to the English Premier League
2006-07 Soccer matches. The data obtained from the past matches in the league are
used to make better predictions for the future matches.

Introduction:
This report explores the usage of various machine learning techniques for predicting the
outcomes of matches in which 2 teams play against each other. There are 3 possible
outcomes: a win, a lose or a draw. The soccer matches of the English Premier League
played during the 2006-07 season were chosen as the domain.
The league consists of 20 teams playing against each other over a span of close to one
year. Each team plays twice against every other team in the league, one at home
(home teams ground) and the other away (opponent teams ground). 3 points are
awarded for a win, 1 point each for a draw and no points for a loss. The league does
not have any knockout matches, and the final winner of the league is decided towards
the end as the team amassing the maximum number of points in the season.
Soccer matches or for that matter any matches which are competitive are by
inherently unpredictable in nature. The way a team plays on a given day depends on a
number of features like the attacking, defensive and mid-field abilities of the team
(team ratings), the capabilities of the individual players (player ratings) and even their
psychology going into the game or the cheering received from the crowd. But on any
given day, it is difficult to predict the outcome of any game played between 2 equally
poised teams. This applies more in the game of football, because it is such a low-scoring
game (with only 2 to 3 goals scored by any team on average) and a moment of
brilliance or stupidity can cost the team a match. Hence, it is a great challenge to pick
good features and be able to predict the outcome of the game with good accuracy
(close to the pundits).
The prediction results if found to be good, can be used in match betting (on online
websites like the TopTipper).

Feature selection and Data preprocessing:


The data for the soccer matches played in the English Premier League during the 200607 season, has been extracted from the BBC Sport website.
The major chunk of the time in this project was spent in pre-processing the data. The
data available on the website was parsed and fed into an SQL Server database.
Queries were written to extract meaningful features from this data. The following
features have been extracted: MatchesPlayed, LeaguePoints, Home_Won,
Home_Drawn, Home_Lost, Home_GoalsScored, Home_GoalsConceded, Away_Won,
Away_Drawn,
Away_Lost,
Away_GoalsScored,
Away_GoalsConceded,
RecentForm_Points.
The above features were extracted for both the home team and the away team for
every match based on the statistics available until that time point. The data for home
and away games was treated separately, because it is observed from the previous
league competitions that majority of the games are either won/lost by the home teams,
and hence the home factor is very important in predicting the outcome of any match.
The last feature (RecentForm_Points) is obtained by aggregating the points obtained
during the last 6 games, which indicates the form of the team. It is a common belief
that the current form of a team has a high impact on the way the team performs in a
current match.

Learning Algorithm:
Neural Networks has been used as the primary learning algorithm for learning the
patterns in the data. An artificial neural network, is modelled in the way the neurons in
the human body pass/suppress signals. Weights are associated with each of the input
(26 features in our case) and activation functions are applied to their weighted sum
(including a bias term) to get new set of inputs. We can have multiple such hidden
layers before reaching a final layer with the required classification (win/draw/loss in our
case). Using back-propagation the weights are adjusted in the direction of the
minimum mean-squared error on the training data.

This type of network topology is well-suited for our domain, because we have multiple
discrete features which need to be assigned different weights according to their
individual contribution to the solution. The usage of multiple hidden layers, facilitates
capturing complex relations between the input features and the output classes (win,
draw or loss). Neural Networks is a state-of-the-art learning algorithm which works well in
such domains, and hence it is chosen is the primary learning algorithm. Also, since the
dataset is not too big (380 matches with 26 features each), it is realistic to train the
neural network in reasonable time.
The ensemble methods Bagging and Boosting are applied over the Neural Networks, to
try to achieve better results. Other learning algorithms (Decision Trees, k-Nearest
Neighbors and Nave Bayes) have also been used to evaluate the data.
Weka package was used for running the various learning algorithms on the data.

Experimental Results:
The processed data belonging to 380 soccer matches in the English Premier League
2006-07 season, has been tested using the various machine learning algorithms (as
listed in Table-1) and with different tuning parameters. A 10-fold cross-validation is used
for all the experiments.
#

Learning Algorithm

%
Correct

Predictions
%
#
Incorrect Correct

#
Incorrect

Neural Networks

60.79

39.21

231

149

Neural Networks

63.42

36.58

241

139

2b

Neural Networks

63.42

36.58

241

139

Algorithm parameters
(tuning)
Hidden layers - 1
Training Time - 1,000
Hidden layers - 5
Training Time - 1,000
Hidden layers - 5
Training Time - 10,000

2c

Neural Networks
+ Bagging

65.53

34.47

249

131

2d

Neural Networks
+ Boosting (AdaboostM1)

64.74

35.26

246

134

Neural Networks

62.63

37.37

238

142

Neural Networks

63.16

36.84

240

140

5
6

Neural Networks
Decision Trees (J48)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
k-Nearest Neighbours
(IBk)
Nave Bayes Simple

62.89
55.26

37.11
44.74

239
210

141
170

55.00

45.00

209

171

57.37

42.63

218

162

62.37

37.63

237

143

62.11
52.11

37.89
47.89

236
198

144
182

7
8
9
10
11

Hidden layers - 5
Training Time - 1,000
Bagging: numIterations
30
Hidden layers - 5
Training Time - 1,000
Boosting: maxIterations 30,
weightThreshold - 1,000
Hidden layers - 10
Training Time - 1,000
Hidden layers - 20
Training Time - 1,000
Hidden layers - 40
Training Time - 1,000
Unpruned True
KNN - 1
Cross-validate True
KNN - 5
Cross-validate True
KNN - 10
Cross-validate True
KNN - 20
Cross-validate True
-

Table-1: Prediction accuracy on Soccer matches


Neural Networks algorithm with 5 hidden layers (#2) is found to have produced the best
results (63.42%). Increasing the number of hidden layers (#3, #4 and #5) did not
produce significant gains in the prediction accuracy. Increasing the training time from
1000 to 10000 (#2b), also did not achieve any performance gain, and training any
further might cause overfitting of the data.
Applying ensemble methods like bagging and boosting (#2c and #2d) on this
algorithm, has been observed to increase the accuracy by 1 to 2 percent. The
maximum accuracy achieved was using Neural Networks with Bagging, in which a
prediction accuracy of 65.53% was achieved.
The non-ensemble methods took close to one minute to train the neural networks,
where bagging took 30 close to minutes and boosting took 31 minutes. Since the
dataset is not too big, it was possible to train the neural networks in reasonable time.
This dataset is the same as the one used by McCabe1, in which he used Neural
Networks algorithm but with a somewhat different set of features. McCabe reported an
accuracy of 53.2% (202 correct out of 380 games). The best accuracy achieved during
our experiments was 65.53% (249 correct out of 380 games) which is significantly higher
than that of (1).

Conclusions:

Though soccer and other sports matches are said to be unpredictable, we see from our
experiments that there is some level up to which we can use the past history to predict
the outcomes of future matches. The important issue is the appropriate feature
extraction from the available data.
Some features like team rankings have been intentionally left out of the set of
features, so that there is no bias for any team to start with, so that we have a good
improvement in learning as data from more matches is observed. Also, be keeping
distinct, certain features like goals scored/conceded at home/away, instead of
aggregating them, resulted in extracting more information from such data for using it in
our predictions. The above considerations might have contributed in getting better
results than McCabe who used the same learning algorithm on similar data.
It can be observed that the results of sports matches are not entirely dependent on the
past data (which would otherwise make the game very boring) and are a lot
unpredictable unlike character/face recognition - where the percentage accuracy
can be in the high 90's. Hence, a prediction accuracy close to 65% can be considered
quite good, considering the fact that soccer matches have 3 possible outcomes (win,
loss and draw) and hence an accuracy of 33% in the random case.

Future Work:
Certain important qualitative data like the rankings of the players in each of the teams,
provide valuable insights in to the strengths of the 2 teams playing a match. Any
missing/injured lead players are considered to have a major impact on the outcome of
a game according to the experts. Though quantifying the importance of players is
subjective, the ratings of the players can give a fairly useful picture. Head-to-head
statistics (for last n games) of the teams going into a game is an important feature as
well. Incorporating all these features can be a good future direction for gaining better
accuracy in predicting the outcomes of the games.
The same algorithm with some changes in the feature selection, can be used to predict
the outcomes matches in other leagues like Football, Cricket etc. The more wellbehaved the games, the more accurate will be our predictions.

References:
1. Alan McCabe, "An Artificially Intelligent Sports Tipper," in Proceedings : 15th
Australian Joint Conference on Artificial Intelligence (2002)
2. A. P. Rotshtein, M. Posner and A. B. Rakityanskaya, Football Predictions Based on a
Fuzzy Model with Genetic and Neural Tuning," Cybernetics and Systems Analysis
Journal (2005)

You might also like