Professional Documents
Culture Documents
KEYWORDS
Classification, Logistic Regression
1 INTRODUCTION
As one of the most popular sports among the world, football games
attracts huge attention and occupies a great share of gaming indus-
try. It is a difficult task to predict the result of one match for it may
involve a lot of circumstances: ability, chance and also luck! Even
the most famous bookmaker can make many wrong predictions. In
this paper, we aim to build a model that can give a comparatively ac-
curate prediction on game results based on data of previous matches
and relevant analysis. Our model shows ability of dealing with big
data and can beats bookmakers on game results classification task.
Figure 1: Probability of match result
2 DATABASE AND TASK
2.1 Database Introduction
Our data comes from a public database called European Soccer
Database. It contains information of more than 25,000 matches
and more than 10,000 players. There are 7 tables in this data-
base: Country, League, Match, Player, Player Attributes, Team,
and Team Attributes.
2.1.1 Country and League. In these tables, all of the 9 countries
and 9 leagues where match data were collected are recorded.
2.1.2 Match. The main information of the database is in Match
table. In every line, it contains detailed data for one match: country
and league, both teams, starting players, basic technical statistics,
betting odd and game result. It only contains id of team and player
that can be used to connect to other tables.
Figure 2: Probability of match result in different leagues
2.1.3 Player and Player Attributes. It contains the personal in-
formation for more than 10,000 players. All of these information
are from FIFA, which includes and does not restricts to: birthday, defeat 25% respectively.
overall rating, detailed ability, etc.
2.1.4 Team and Team Attributes. It contains team technique 2.2.2 League Difference. In different league, match results may
information for several hundred teams. These information can be vary by factors as coach, game schedule, etc. In Figure 2, we show
used to reveal the ability of every team. the match results difference by league.
2.2.7 Rest Time after Last Match. When a team encounters hard
2.2.4 Starting Player Ability. The main factor to determine the match schedules, it can hardly perform its normal level in the match.
ability of one team is the ability of its eleven starting players. To Even the best team may fall in struggling situation in tiredness.
judge the ability of players, we get the overall rating for players in In Figure 6, we show the average rest time for teams in different
table Player Attributes, which is from FIFA data. In Figure 4, we league.
show the variance of ability of different players.
2.3 Task
2.2.5 Team Technical Ability. Best players cannot ensure best
We aim to predict the result of match with useful features we get
team without a suitable build of team. Thus, we add some team
from the database. The prediction is in three labels: win, draw and
technical indicators to show the tactical ability of the team. We
defeat. We evaluate models by the percentage of correct predictions.
measure it in seven aspects: build up play speed, build up play
passing, chance creation passing, chance creation crossing, chance
2.4 Related Work
creation shooting, defense pressure and defense aggression. We
show these data in Figure 5. In Kaggle, there are many discussion and analysis about the foot-
ball match dataset. In general, they can be separated into two
categories: match outcome prediction and player ability prediction.
2.2.6 Results of Last Several Matches. Results of last several For match outcome prediction which is also what we focus on in
matches can reveal the competing condition of one team and his- this paper, many classifiers such as Logistic Regression and Random
tory results between two teams can reveal . We calculate the average Forest are compared to each other to find the best classification
match goal and average points win for both teams in last several approach. Many different features are considered including betting
matches and calculate their average goal in the matches they met information, starting playersfi overall abilities, recent match results,
before. In following figure, we show corresponding data. etc. There are also some work trying to predict goals or scores of
matches. Another category player ability prediction tries to predict
2
whether a football player can be a good player or not. In fiCan 3.2 Principle Component Analysis
you be a good football player?fi, the author analyzes playersfi FIFA Principal component analysis (PCA) is a statistical procedure that
ratings according to different leagues, playersfi height and weight, uses an orthogonal transformation to convert a set of observations
position, age and different technique ability such as crossing and of possibly correlated variables into a set of values of linearly un-
finishing. Then the author uses Decision Tree to classify good play- correlated variables called principal components. The number of
ers whose FIFA rating is over 70. The author finally achieves 0.668 principal components is less than or equal to the number of original
accuracy by using Decision Tree. variables or the number of observations. This transformation is
Some users in Kaggle also present interesting findings on this defined in such a way that the first principal component has the
dataset. In fiThe fans stay loyal, the players move onfi, the author largest possible variance (that is, accounts for as much of the vari-
explores the player transfers among different teams in different ability in the data as possible), and each succeeding component in
leagues. He also presents several graphs of player transfers in some turn has the highest variance possible under the constraint that it
big football teams like Manchester United and Arsenal. He lists is orthogonal to the preceding components. The resulting vectors
some players who change teams frequently as well. Some research are an uncorrelated orthogonal basis set. PCA is sensitive to the
concentrate on home advantage analysis and conclude that home relative scaling of the original variables.
team has an obvious advantage over away team which satisfies The number of dimension of our feature is fifty-three, which is
our intuition. In fiThe Most Predictable Leaguefi, the author uses relatively large. And since that we have features like odd from
B365 betting information to calculate entropy of each league and BET365 and BW, or player ability and team ability, which may be
finds that Spain League (La Liga) is the most predictable league correlated to each other. Thus, we firstly do PCA operation on
and French League is the least predictable one. The main reason he our origin features to generate low-dimensional and uncorrelated
finds is that the match results of two giants of La Liga Barcelona features, and make the model easier and quicker to be established.
FCB and Real Madrid are very predictable. In our experiment, we select top 10 principle components from the
features and use them as new features.
3 MODEL
We select some features of match from the database, as we have
illustrated in last section, and use PCA method to extract a smaller 3.3 Logistic Regression
number of uncorrelated features from them. With these features, Logistic regression is a regression model that uses sigmoid function
we train a classifier based on logistic regression to predict the result to calculate the prediction of data point and map this to binary
of match. classification. Logistic regression was developed by statistician
David Cox in 1958[4], and it is used to estimate the probability of
3.1 Selected Features a binary response based on one or more predictor variables. It is
3.1.1 Feature of League. We use one-hot to express this feature. widely used in machine learning, medical fields and social science.
The dimension of feature is 9. The entry of corresponding league Logistic regression takes probability distribution into account. It
that the match is from equals 1 and others are 0. has good performance and is still cheap to train compared with
some complex models. Thus, we choose logistic regression as our
3.1.2 Feature of Odd. We use a 6-dimensional feature to express main method in this approach.
odd information. There are win by home team odd, draw odd and In our experiment, for we aim to do a prediction task with three
win by away team odd from two main bookmaker BET365 and labels, we use multi-class logistic regression classifier. Multi-class
BW. Notice that we change the odd into possibility when gener- logistic regression is a classification method that generalizes logistic
ating features instead of using origin odd number. This operation regression to multi-class problems. We use one-vs-rest scheme to
can eliminate the interference of difference of benefits of different solve the problem of multi-classification. That is, for each label, we
bookmakers. train a logistic regression model to determine the choice of using
this label and not-using this label. After that, we choose the most
3.1.3 Feature of Players. This is a 22-dimensional feature, which possible label for the data point. Since we have three labels in our
contains overall ability for all starting players of both teams. prediction, we in fact train three logistic regression models behind
this multi-class logistic regression model.
3.1.4 Feature of Team. We measure the ability of team in a We can use methods of gradient descent to train the model. In our
7-dimensional feature. In each column, it is the minus ability of experiment, we use ’liblinear’ method in numpy to train the model.
corresponding team ability as we illustrate in last section from
home team to away team.
4 EXPERIMENTS
3.1.5 Feature of Last Matches. This is a 6-dimensional feature,
four of which are average goals and average points win for both 4.1 Compared methods
team in last 10 matches, and the other two considering their record [1]
in matches directly between these two teams.
4.1.1 Baseline. The dataset contains the gambling odds of sev-
3.1.6 Feature of Rest Time. This is a feature with only one di- eral gambling companies assigned to each football team. So if we
mension that shows the rest time difference between two teams. just naively predict the result of football game according to the odd
3
values, we can get an accuracy as the baseline of model, which is
about 46.04%.
REFERENCES
[1] Thomas G Dietterich. 2000. Ensemble methods in machine learning. In Interna-
tional workshop on multiple classifier systems. Springer, 1–15.
[2] Josip Hucaljuk and Alen Rakipović. 2011. Predicting football scores using ma-
chine learning techniques. In MIPRO, 2011 Proceedings of the 34th International
Convention. IEEE, 1623–1627.
[3] Anito Joseph, Norman E Fenton, and Martin Neil. 2006. Predicting football results
using Bayesian nets and other machine learning techniques. Knowledge-Based
Systems 19, 7 (2006), 544–553.
[4] David S Stoller. 1958. Some queuing problems in machine maintenanc. Naval
Research Logistics (NRL) 5, 1 (1958), 83–87.
[5] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining:
Practical machine learning tools and techniques. Morgan Kaufmann.