You are on page 1of 5

Result Prediction for European Football Games

Xiaowei Liang Zhuodi Liu Rongqi Yan


A53204220 A53214122 A53203966
Apt 312, 9450 Gilman Drive Apt 615, 9450 Gilman Drive Apt 1106, One Miramar Street
La Jolla, California 92092 La Jolla, California 92092 La Jolla, California 43017-6221
xil568@eng.ucsd.edu zhl384@eng.ucsd.edu roy007@eng.ucsd.edu

KEYWORDS
Classification, Logistic Regression

1 INTRODUCTION
As one of the most popular sports among the world, football games
attracts huge attention and occupies a great share of gaming indus-
try. It is a difficult task to predict the result of one match for it may
involve a lot of circumstances: ability, chance and also luck! Even
the most famous bookmaker can make many wrong predictions. In
this paper, we aim to build a model that can give a comparatively ac-
curate prediction on game results based on data of previous matches
and relevant analysis. Our model shows ability of dealing with big
data and can beats bookmakers on game results classification task.
Figure 1: Probability of match result
2 DATABASE AND TASK
2.1 Database Introduction
Our data comes from a public database called European Soccer
Database. It contains information of more than 25,000 matches
and more than 10,000 players. There are 7 tables in this data-
base: Country, League, Match, Player, Player Attributes, Team,
and Team Attributes.
2.1.1 Country and League. In these tables, all of the 9 countries
and 9 leagues where match data were collected are recorded.
2.1.2 Match. The main information of the database is in Match
table. In every line, it contains detailed data for one match: country
and league, both teams, starting players, basic technical statistics,
betting odd and game result. It only contains id of team and player
that can be used to connect to other tables.
Figure 2: Probability of match result in different leagues
2.1.3 Player and Player Attributes. It contains the personal in-
formation for more than 10,000 players. All of these information
are from FIFA, which includes and does not restricts to: birthday, defeat 25% respectively.
overall rating, detailed ability, etc.
2.1.4 Team and Team Attributes. It contains team technique 2.2.2 League Difference. In different league, match results may
information for several hundred teams. These information can be vary by factors as coach, game schedule, etc. In Figure 2, we show
used to reveal the ability of every team. the match results difference by league.

2.2 Database Exploration


We explore the data-set in following aspects. 2.2.3 Odd. Bookmakers hire expert data analyst to generate
odds for matches. Thus, odd is a useful feature to judge the the
2.2.1 Label Information. The label for match is one of ’win’, ability of both teams. In Figure 3, we show the odd distribution for
’draw’ and ’defeat’ for our task is to predict the result of the match. different match results. Apparently, games winning by home teams
We generate this label by compare the information of home-goal are always with high odds and games winning by away teams are
and away-goal from the database. As shown in Figure 1, home team with low odds.
wins around 46% games from all 25,000+ matches, draw 29% and
Figure 3: Odd distribution by probability Figure 5: Team ability distribution

Figure 6: Probability of match result in different leagues


Figure 4: Player ability distribution

2.2.7 Rest Time after Last Match. When a team encounters hard
2.2.4 Starting Player Ability. The main factor to determine the match schedules, it can hardly perform its normal level in the match.
ability of one team is the ability of its eleven starting players. To Even the best team may fall in struggling situation in tiredness.
judge the ability of players, we get the overall rating for players in In Figure 6, we show the average rest time for teams in different
table Player Attributes, which is from FIFA data. In Figure 4, we league.
show the variance of ability of different players.

2.3 Task
2.2.5 Team Technical Ability. Best players cannot ensure best
We aim to predict the result of match with useful features we get
team without a suitable build of team. Thus, we add some team
from the database. The prediction is in three labels: win, draw and
technical indicators to show the tactical ability of the team. We
defeat. We evaluate models by the percentage of correct predictions.
measure it in seven aspects: build up play speed, build up play
passing, chance creation passing, chance creation crossing, chance
2.4 Related Work
creation shooting, defense pressure and defense aggression. We
show these data in Figure 5. In Kaggle, there are many discussion and analysis about the foot-
ball match dataset. In general, they can be separated into two
categories: match outcome prediction and player ability prediction.
2.2.6 Results of Last Several Matches. Results of last several For match outcome prediction which is also what we focus on in
matches can reveal the competing condition of one team and his- this paper, many classifiers such as Logistic Regression and Random
tory results between two teams can reveal . We calculate the average Forest are compared to each other to find the best classification
match goal and average points win for both teams in last several approach. Many different features are considered including betting
matches and calculate their average goal in the matches they met information, starting playersfi overall abilities, recent match results,
before. In following figure, we show corresponding data. etc. There are also some work trying to predict goals or scores of
matches. Another category player ability prediction tries to predict
2
whether a football player can be a good player or not. In fiCan 3.2 Principle Component Analysis
you be a good football player?fi, the author analyzes playersfi FIFA Principal component analysis (PCA) is a statistical procedure that
ratings according to different leagues, playersfi height and weight, uses an orthogonal transformation to convert a set of observations
position, age and different technique ability such as crossing and of possibly correlated variables into a set of values of linearly un-
finishing. Then the author uses Decision Tree to classify good play- correlated variables called principal components. The number of
ers whose FIFA rating is over 70. The author finally achieves 0.668 principal components is less than or equal to the number of original
accuracy by using Decision Tree. variables or the number of observations. This transformation is
Some users in Kaggle also present interesting findings on this defined in such a way that the first principal component has the
dataset. In fiThe fans stay loyal, the players move onfi, the author largest possible variance (that is, accounts for as much of the vari-
explores the player transfers among different teams in different ability in the data as possible), and each succeeding component in
leagues. He also presents several graphs of player transfers in some turn has the highest variance possible under the constraint that it
big football teams like Manchester United and Arsenal. He lists is orthogonal to the preceding components. The resulting vectors
some players who change teams frequently as well. Some research are an uncorrelated orthogonal basis set. PCA is sensitive to the
concentrate on home advantage analysis and conclude that home relative scaling of the original variables.
team has an obvious advantage over away team which satisfies The number of dimension of our feature is fifty-three, which is
our intuition. In fiThe Most Predictable Leaguefi, the author uses relatively large. And since that we have features like odd from
B365 betting information to calculate entropy of each league and BET365 and BW, or player ability and team ability, which may be
finds that Spain League (La Liga) is the most predictable league correlated to each other. Thus, we firstly do PCA operation on
and French League is the least predictable one. The main reason he our origin features to generate low-dimensional and uncorrelated
finds is that the match results of two giants of La Liga Barcelona features, and make the model easier and quicker to be established.
FCB and Real Madrid are very predictable. In our experiment, we select top 10 principle components from the
features and use them as new features.
3 MODEL
We select some features of match from the database, as we have
illustrated in last section, and use PCA method to extract a smaller 3.3 Logistic Regression
number of uncorrelated features from them. With these features, Logistic regression is a regression model that uses sigmoid function
we train a classifier based on logistic regression to predict the result to calculate the prediction of data point and map this to binary
of match. classification. Logistic regression was developed by statistician
David Cox in 1958[4], and it is used to estimate the probability of
3.1 Selected Features a binary response based on one or more predictor variables. It is
3.1.1 Feature of League. We use one-hot to express this feature. widely used in machine learning, medical fields and social science.
The dimension of feature is 9. The entry of corresponding league Logistic regression takes probability distribution into account. It
that the match is from equals 1 and others are 0. has good performance and is still cheap to train compared with
some complex models. Thus, we choose logistic regression as our
3.1.2 Feature of Odd. We use a 6-dimensional feature to express main method in this approach.
odd information. There are win by home team odd, draw odd and In our experiment, for we aim to do a prediction task with three
win by away team odd from two main bookmaker BET365 and labels, we use multi-class logistic regression classifier. Multi-class
BW. Notice that we change the odd into possibility when gener- logistic regression is a classification method that generalizes logistic
ating features instead of using origin odd number. This operation regression to multi-class problems. We use one-vs-rest scheme to
can eliminate the interference of difference of benefits of different solve the problem of multi-classification. That is, for each label, we
bookmakers. train a logistic regression model to determine the choice of using
this label and not-using this label. After that, we choose the most
3.1.3 Feature of Players. This is a 22-dimensional feature, which possible label for the data point. Since we have three labels in our
contains overall ability for all starting players of both teams. prediction, we in fact train three logistic regression models behind
this multi-class logistic regression model.
3.1.4 Feature of Team. We measure the ability of team in a We can use methods of gradient descent to train the model. In our
7-dimensional feature. In each column, it is the minus ability of experiment, we use ’liblinear’ method in numpy to train the model.
corresponding team ability as we illustrate in last section from
home team to away team.
4 EXPERIMENTS
3.1.5 Feature of Last Matches. This is a 6-dimensional feature,
four of which are average goals and average points win for both 4.1 Compared methods
team in last 10 matches, and the other two considering their record [1]
in matches directly between these two teams.
4.1.1 Baseline. The dataset contains the gambling odds of sev-
3.1.6 Feature of Rest Time. This is a feature with only one di- eral gambling companies assigned to each football team. So if we
mension that shows the rest time difference between two teams. just naively predict the result of football game according to the odd
3
values, we can get an accuracy as the baseline of model, which is
about 46.04%.

4.1.2 Random Forest. Random Forest can be take as an ensem-


ble of decision trees, it use averaging to averaging to improve the
predictive accuracy and control over-fitting. Random forest are
good to use at the first stage when you donfit know the underlying
model, or when you want to build a decent model in a short time
because it has very few parameters to tune and can be used quite
efficiently with default parameter settings. By adjusting the num-
ber of trees that were used in the model, we found that the best
performance of 53.50% was obtained when the number of trees was
set to 200.

4.1.3 AdaBoost. AdaBoost algorithm trains a sequence of learn-


ers such as decision trees on repeatedly modified versions of data,
then combine the predictions of all learners through a weighted
majority sum to produce the final prediction. At first, we set equal Figure 7: Prediction Probability
weights to each boosting iteration, in each successive iteration, the
weights are individually modified, those training examples that
were incorrectly predicted by the boosted model have their weights 4.2 Result
increased, whereas the weights are decreased for those that were After comparing the accuracy of different models[5], we get the
predicted correctly. Thus, AdaBoost is a precise classifier but its best performance when using logistic regression. After analyzing
prediction will be affected by noise in dataset and it is more time- and comparing the features of different models, we get some disad-
consuming. In our experiment, when the maximum number of vantages of other models in this problem. The prediction of random
estimators of AdaBoost is set to 150, we get the best prediction forest is kind of unbalanced because we have unbalanced rate of
accuracy of 51.04% ’Win’, ’Draw’ and ’Default’ labels in dataset, the classifier tends
to assign more weights to the larger class. The performance of
4.1.4 K-Nearest Neighbour. K-Nearest Neighbors algorithm is AdaBoost classifier is limited as the dataset contains some noise
an instanced-based method used for classification and regression by like the data from FIFA and ability data of players. K-Nearest Neigh-
calculating the distance between points, its output is a class mem- bours assume that every feature in dataset has equal weight, thus
bership. An object is classified by majority vote of its neighbors, will case some error when making prediction. Gaussian Naive Bayes
with the object being assigned to the class most common among didn’t produce satisfied result because some features in dataset may
its k nearest neighbors. There is no explicit training phase or it is not be independent to each other, and our dataset is a high dimen-
very minimal, which means the training phase is pretty fast. All sional dataset. For support vector machine model, as we have a
the training data is needed during the testing phase. We run K-NN large dataset and our problem is likely to be a nonlinear problem,
model by varying the number of neighbours that were used and it is hard to find a suitable sigmoid function of SVM, even though
we found that the best accuracy of 43.47% was obtained when the the accuracy of SVM is close to the accuracy of logistic regression.
number of neighbours was equal to 20.

4.1.5 Gaussian Naive Bayes. Bayes theorem[3] is based on con-


5 CONCLUSIONS AND FURTHER WORK
ditional probability. The naive Bayes classifier assumes all the In this paper, we present a efficient multi-class classifier to predict
features are independent to each other. Even if the features de- the result of European football matches. Firstly, we explore the
pend on each other or upon the existence of the other features. A database and generate useful features for training model. We takes
Gaussian Naive Bayes algorithm is a special type of NB algorithm. the features of league difference, odd, player ability, team ability,
Itfis specifically used when the features have continuous values. result of last several matches and rest time after last match into
Itfis also assumed that all the features are following a gaussian account and appropriately modify these features to make them suit
distribution i.e, normal distribution. Our Gaussian Naive Bayes training. Secondly, we use PCA method to compress the data to a
model gives an accuracy of 47.55%. low-dimensional scheme. After all, we train a multi-class logistic
regression classifier to make prediction of the match. Our method
4.1.6 Support Vector Machine. Support Vector Machine is a su- outperforms some other models and get satisfying results. We beats
pervised machine learning technique used for both both classifica- the bookmakers by about 9 percent. In fact, if we buy odd as our
tion or regression problems.[2] The linear SVM can be extended prediction, when we spend one dollar at one match, we can earn
to a nonlinear classifier by first using a kernel function to map the about 3,360 dollar after 6,000 attempts.
input pattern into a higher dimensional space. The nonlinear SVM Due to time limit, we cannot implement all of our ideas in the
classifier so obtained is linear in terms of the transformed data but experiment. Thus, there are still several aspects that we can improve
nonlinear in terms of the original data. The best accuracy of SVM our prediction. Firstly, we can take rest time of player into account.
is 53.40%. Some essential players may rest by rotation so they can perform
4
better in hard schedules. Secondly, ability of coach, weather, and
formation can be features that influence the result of match. Thirdly,
in our experiment, we consider the ability of player independently.
However, in real match, players should be judged with player in
corresponding place in the other team (for example one striker
versus defenders in the other team). Fourthly, we can use data from
Football Manager instead of FIFA, which may be more detailed and
more precise. Besides, we can de some prediction on the second
round of two-round knockout games, which are more attracting in
football games.
However, even if we design a extraordinary classifier, we can never
predict a miracle like FC Barcelona did in ECL quarter final.

REFERENCES
[1] Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Interna-
tional workshop on multiple classifier systems. Springer, 1–15.
[2] Josip Hucaljuk and Alen Rakipović. 2011. Predicting football scores using ma-
chine learning techniques. In MIPRO, 2011 Proceedings of the 34th International
Convention. IEEE, 1623–1627.
[3] Anito Joseph, Norman E Fenton, and Martin Neil. 2006. Predicting football results
using Bayesian nets and other machine learning techniques. Knowledge-Based
Systems 19, 7 (2006), 544–553.
[4] David S Stoller. 1958. Some queuing problems in machine maintenanc. Naval
Research Logistics (NRL) 5, 1 (1958), 83–87.
[5] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining:
Practical machine learning tools and techniques. Morgan Kaufmann.

You might also like