Modeling Basketball Match Scores Through Team Specific Strength Factors.

Modeling basketball match scores through team-specic strength factors
Mateo Restrepo Meja, NEAS Regression Analysis Course (Summer 2013)
Abstract
We develop several simple regression models of basketball match scores. To
evaluate the models we use them to t the scores of the 140 basketball games played among the 15 teams in the Big East conference during the 2011-2012 NCAA basketball season. team interaction. We test several eects for statistical signicance, such as local vs. visitor, the eect of time within the season and the time and
Keywords:
basketball, linear regression, game score
1. Introduction
In this paper, we develop several simple regression models with the aim of predicting the score of a basketball match given only the identities of the teams that played it and the date it was played. The basic assumption underlying all models is that a team's strength at any moment in time can be summarized by one or two numbers and that the number of points scored by team
a b
when playing against team
will be given by the dierence between
and
strengths, plus an intercept.
The methodology is similar to, although per-
haps much simpler than, that employed by USA Today's sports statistician Je Sagarin(author?) [Sagarin], in that the expected score dierence of a match depends on the teams that play it, only through a single numeric summary (what Sagarin calls a ranking.) We shall start with a simple model in which We shall build on the strength of a team is summarized in a single number.
this model by incorporating a local-vs-visitor eect, that is expected to bump a teams score when it plays at home. Later we will explore a bigger model in which each team is described by two numbers: an attack strength and a defense strength. Finally, we will explore the possible eect that time and time-team interaction have on the scores. We shall verify that the residuals produced by our linear ts are in fact normally distributed and present no real outliers.
Email address:
mateini@gmail.com
(NEAS Regression Analysis Course (Summer 2013))
Preprint submitted to Elsevier
September 23, 2013
2. Model descriptions
In order to precisely describe the models, we introduce a bit of notation and terminology. We shall focus on games played among same conference. We let For our purposes game
nT teams belonging to the g be the number of games and index the games by i. Gi will be described by the tuple Gi = (l(i), v (i), Ti , Li , Vi ),
where:
l(i) {1, . . . , nT } v (i) {1, . . . , nT } Ti Li Vi
is the index of the team playing as local in game
Gi . Gi .
is the index of the team playing as visitor in game
is the time the game was played, measured as a (generally fractional)
number of months since an arbitrarily chosen time origin. is the number of points scored by the local team, li . is the number of points scored by the visitor team,
vi .
game, from this,
All our models will try to predict both scores,
the values of the independent variables li , vi and
each game as two separate games:

Game
Li and Vi , for each Ti . In order to do Li
we see
i,
in which team
l(i)
(as attacker) scores points
against team
v (i) v (i)
(defender) and
another
mirror game, which we shall index with
g+i
in which team
(as attacker) scores
Vi
points against team
l(i)
(defender).
Thus the response variable will have
2g
values (1)
Yi Yg+i
:= Li := Vi .
The total number of data points that we will consider is thus
n=2g
It shall prove convenient to extend the denition of
l(i) and v (i) to the range
{g + 1, g + 2, . . . , 2g }
as follows:
l(g + i) := v (i) v (g + i) := l(i),

reecting the fact that for the mirror games, (2)
g + 1, . . . , 2g , the roles of the teams
are reversed. With this notation we are ready to introduce the models.
2.1 Model 0: a single strength-factor per team
2.1. Model 0: a single strength-factor per team

The very rst model posits that team by a single number,
has a playing strength described
and that the number of points scored by local team
playing against visitor team
will be (3) This
L = + l v + ,
where
is an intercept constant and
is a normally distributed error.
simple model implies also that the number of points scored by the visitor team
will be
V = + v l + ,
where another normally distributed error, independent of
(4)
The possible
eect of time is neglected by this model. This will be address in later models. As written in equations (3) and (4) model is of course over determined. This will be address shortly. In order to put the model equations (3) and (4) in the standard form used for linear regression we introduce
nT 1
dummy regressor variables
Bm , m
{1, . . . , nT 1}
, dened as
Bim
1 := 1 0
if if
l(i) = m & m < nT v (i) = m & m < nT
otherwise
We take advantage of the notational convention (2) to make the last denition hold for
i {1, . . . , 2g }.
nT 1
The model formula is thus
Yi = +
m=1
m Bim + i , Yi
(5)
Hence, by virtue of the denition (1) of
for mirror games, the formula ef-
fectively replaces both formulas (3) and (4). Notice that the over-determination present in the former formulas is taken care of in the latter by using a set of
nT 1
team.
dummy regressors, which eectively makes the last team,
nT ,
the base
2.2. Model 0a: local vs. visitor eect

This model is a slight extension of model 0 of the previous section that tries to capture the local vs. coecient visitor eect on the score, i.e. it includes an extra
that will add to the score of a team when it is playing at home.
Equations (3) and (4) are modied to
L = + + l v + V = + v l +
To formulate this in matrix form, we introduce an extra dummy regressor
2.3 Model 1: attack strength vs. defense strength.
Ci :=
and thus we get
1 0
if
ig
otherwise
nT 1
Yi = + Ci +
m=1
m Bim +
(6)
2.3. Model 1: attack strength vs. defense strength.

This model eectively doubles the number of (non-intercept) coecients in Model 0. It models the strength of team its attack strength game between
by means of two separate numbers;
m and its defense l and v would then be
strength
m .
The predicted scores for a
L = + l v + V = + v l + .
Since there are two separate sets of coecients we require two sets of dummy variables:
Aim :=
and
1 0
if
l(i) = m & m < nT
otherwise
Dim :=
The model formula is thus
1 0
if
v (i) = m & m < nT
otherwise
nT 1
nT 1
Yi = +
m=1
m Aim +
m=1
m Dim + i . m = m = m.
Notice that this model eectively subsumes Model 0, by making
m .
In a later section we shall test the hypothesis that
m = m
for all
2.4. Model 1a: Model 1 + local-vs-visitor eect

The title says it all, we add the local vs. visitor eect to Model 1 in the same way that we obtained Model 0a from Model 0. The resulting model is
nT 1
nT 1
Yi = + Ci +
m=1
m Aim +
m=1
m Dim + i .
2.5 Model 2: Model 1a + global eect of time

Team cincinnati connecticut depaul georgetown louisville marquette notre-dame pittsburgh providence rutgers seton-hall south-a st-johns-ny syracuse villanova Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Table 1: Teams in the Big-East conference and their indices in models.
2.5. Model 2: Model 1a + global eect of time

We enhance the previous model by an extra term that attempts to capture the eect that time in the season has over the score. The model formula is
nT 1
Yi = + Ci +
m=1
where
(m Aim + m Dim ) + T + i ,
is the time in months measured from a judiciously chosen origin.
2.6. Model 2a: Model1a + time-team interactions

This shall be our last model. It is an enhancement of model 1a by allowing interactions of the time factor with the team strength factor, allowing for the possibility that the attack/defense strength of a team varies linearly throughout the season:
nT 1
nT 1
Yi = + Ci +
m=1
(m Aim + m Dim ) + T +
m=1
A D m Aim T + m Dim T + i .
3. Data Fits
To test the models developed in the preceding section, we shall t them to the scores of
140
games played among the 15 teams belonging to the Big-East
conference during the 2012 season ( rst game was played on 2011/12/27 and last game on 2012/03/10 ). In the notation of last section, we have
g = 140
and the total number of data points to be t is
nT = 15, n = 2 g = 280.
3.1 Analysis of residuals

Model mod.0 mod.0a mod.1 mod.1a
R2
0.09721 0.1207 0.416 0.4394 0.4449 0.5112
2 R
0.04952 0.07069 0.3508 0.3744 0.3781 0.3829
k
14 15 28 29 30 58
nk1
265 264 251 250 249 221
F -statistic
2.038 2.415 6.384 6.757 6.653 3.984
p-value
0.01556 0.002653 0 0 0 6.517e-14
mod.2*
mod.2a
Table 2: Summary of model-wide statistics for models 0 through 2a
Table 1 shows a list of all 15 teams in the Big-East conference and the corresponding indices into the regressor matrices. As explained above, to avoid over-determination there is only 14 columns in each matrix, and team lanova) plays the role of the base team. Table 2 shows a summary of the model-wide t statistics for each of the models. We see that models 1, 1a, 2 and 2a yield very similar values of adjusted
15
(Vil-
R2 .
It is clear from this table that allowing each team to be described by two
strength numbers (attack and defense) as in models 1x and 2x, as opposed to a single number (models 0x), provides for a very signicant improvement in the
R2 ,
namely around 30%. By comparing mod0a to mod0 and mod1a to mod1a
we see that allowing for a correction based on whether the scoring team is local or visitor yields a modest improvement in the the adjusted section we shall test the signicance of this improvement. Finally, we observe that there is only a very minor improvement in the (
R2 .
In the next
R2
0.3%)
obtained by including the interaction of time with the team strength
factors (model 2 vs. model 2a) . This minor improvement comes at the expense of almost doubling the number of coecients and making the total number of coecients a sizable fraction of the total number of points. In view of this, and in the benet of parsimony, we shall designate model 2 as our best model. If we had a larger number of games, played by the same teams over many dierent seasons, we would probably revisit this issue. In any case, we shall look at an incremental
test comparing models 2 and 2a in the next section.
Table 3 shows the results of tting the data to Model 2a. We see that most coecients are signicant to the 10% level (.) and many of them are signicant to the
5%
(*) and even to the
1%
(**) level.
In particular, the coecient
(local vs. visitor eect) is signicant with (global) time eect is borderline 10%
p = 0.0013. The coecient signicant (with p = 0.1162).
for the
3.1. Analysis of residuals

Figures 1 and 2 show a histogram and a qq-plot against normal quantiles of the residuals of model 2. The qq-plot shows that the assumption of normality is justied. Figure 3 shows a plot of studentized residuals (as obtained from function the tted values for model 2. There
studres from the MASS package in R) vs.
Estimate
Std..Error 3.268 1.115 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 0.7583
t-value
23.9 3.243 -1.769 -1.69 0.3217 -2.545 -1.733 1.284 -2.768 -1.933 -1.203 -2.864 -2.371 -4.58 -1.929 -0.1732 3.659 2.842 -2.224 4.911 3.917 1.274 4.083 1.989 0.1947 2.538 3.073 5.849 -0.2043 3.994 -1.576
p-value
2.09e-66 0.001345 0.07805 0.09237 0.7479 0.01153 0.08428 0.2003 0.006068 0.05432 0.23 0.004537 0.0185 7.365e-06 0.05485 0.8627 0.0003093 0.004852 0.02701 1.642e-06 0.000116 0.2037 6.001e-05 0.04781 0.8458 0.01175 0.002352 1.549e-08 0.8383 8.547e-05 0.1162
Signicance *** ** . . * . ** . ** * *** . *** ** * *** *** *** * * ** *** ***

C A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 T
78.11 3.615 -5.325 -5.146 0.9915 -7.719 -5.137 3.966 -8.492 -5.94 -3.699 -8.982 -7.199 -14.03 -5.939 -0.5268 11.01 8.656 -6.855 14.89 11.61 3.936 12.53 6.111 0.5986 7.959 9.332 17.92 -0.6289 12.15 -1.195
Table 3: Fitted coecients for Model 2
Histogram of residuals(mod.2)
Frequency
0 30
20
40
60
20
10
10
20
residuals(mod.2)
Figure 1: Histogram of residuals for mod.2
Normal QQ Plot
G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G
Sample Quantiles
30
20 3
10
10
20
Theoretical Quantiles
Figure 2: qq-plot of residuals against normal quantiles
G G G G G G G G G
2 mod2.studres 1
G G G G G G G G G GG G G G GG G G G GG G G G G G G G G G G G GG G G GG G G G G GG G G G GG G G G G G G GG G G GG G G G G G GG G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G GG G G G G GG G G G G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G G G G G GG G GG G G GG G G G G G G G G G G G GG GG G G G GG G G G G G GG G G G G G G G G G GG G G G G G G G G G G GG G GG G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G
50
60
70 mod2$fitted.values
80
Figure 3: Studentized residuals against tted values
Lowess line and
[2, +2]
range.
are 15 studentized residuals with values outside of expectations.
[2, 2]
which is in line with
4. Hypothesis tests
In this section we shall apply incremental
F -tests to a few hypotheses,
given
further support to some of the observations made in the last section. Table 4 shows the regression sum of squares and the corresponding degrees of freedom for the models introduced in Section 2, plus a few others that we will need to compute incremental
F -ratios. F -test to 4 dierent RSS used in the de(with 280 59 1 = 220

which explains by
Table 5 shows the results of applying the incremental hypothesis. Following Fox's prescription, in all cases the nominator of the ratio was that of the full model 2a that attack and defense strengths are dierent,
degrees of freedom ). As we had anticipated, there is strong support for the fact
m = m ,
models 1+ are so much better than models 0x. There is also strong support for including the local/visitor eect. The global time eect is borderline insignicant as we had inferred from the section. Finally, with a rather large of model 2 as the best model.
t-statistic
of the
coecient in the previous
p-value
of 0.378, we fail to reject the hypothesis
that interaction between team and time is 0, which further justies the choice
10
terms mod.0 mod.0a mod.1 mod.1a mod.2 mod.2a mod.0BC mod.ADT B C + B A + D C + A + D C + A + D + T C + A + D + T * A + T * D C + B + T A + D + T
Regression SS 3754.42 4659.84 16064.2 16969.6 17183.5 19741.1 5125.01 16278.1
df
15 16 29 30 31 59 17 30
Table 4: AOV table for Models from Section 2 and a few others
H0 m = m C=0 T =0 A = 0, D = 0
Description separate attack/defense strengths local/visitor eect global time eect team-time interaction
Model
Model mod.2 mod.2 mod.2
q
14 1 1 28
inc.
F -ratio
10.08 10.6 2.504 1.069
p-value
0 0.001309 0.115 0.378
mod.0BC mod.ADT mod.1a mod.2
mod.2a
Table 5: Incremental
-ratio tests for a few hypotheses.
Acknowledgments
The author would like to warmly thank his ex-colleague James X. Frohnhofer for providing all of the raw data used for this project. This data including match scores and dates, team-conference mapping and much more.
References
[Sagarin] USA Today. Je Sagarin computer ratings, at USA Today http://usatoday30.usatoday.com/sports/sagarin.htm

Modeling Basketball Match Scores Through Team Specific Strength Factors.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modeling Basketball Match Scores Through Team Specific Strength Factors.

Uploaded by

Copyright:

Available Formats

Modeling basketball match scores through team-specic strength factors

Mateo Restrepo Meja, NEAS Regression Analysis Course (Summer 2013)

basketball, linear regression, game score

when playing against team

will be given by the dierence between

strengths, plus an intercept.

The methodology is similar to, although per-

(NEAS Regression Analysis Course (Summer 2013))

Preprint submitted to Elsevier

September 23, 2013

l(i) {1, . . . , nT } v (i) {1, . . . , nT } Ti Li Vi

is the index of the team playing as local in game

is the index of the team playing as visitor in game

is the time the game was played, measured as a (generally fractional)

All our models will try to predict both scores,

the values of the independent variables li , vi and

each game as two separate games:

Li and Vi , for each Ti . In order to do Li

(as attacker) scores points

mirror game, which we shall index with

(as attacker) scores

points against team

Thus the response variable will have

The total number of data points that we will consider is thus

l(i) and v (i) to the range

l(g + i) := v (i) v (g + i) := l(i),

g + 1, . . . , 2g , the roles of the teams

2.1 Model 0: a single strength-factor per team

2.1. Model 0: a single strength-factor per team

has a playing strength described

and that the number of points scored by local team

playing against visitor team

will be (3) This

is an intercept constant and

is a normally distributed error.

dummy regressor variables

l(i) = m & m < nT v (i) = m & m < nT

The model formula is thus

Hence, by virtue of the denition (1) of

for mirror games, the formula ef-

dummy regressors, which eectively makes the last team,

2.2. Model 0a: local vs. visitor eect

that will add to the score of a team when it is playing at home.

Equations (3) and (4) are modied to

To formulate this in matrix form, we introduce an extra dummy regressor

2.3 Model 1: attack strength vs. defense strength.

2.3. Model 1: attack strength vs. defense strength.

by means of two separate numbers;

m and its defense l and v would then be

The predicted scores for a

l(i) = m & m < nT

v (i) = m & m < nT

Notice that this model eectively subsumes Model 0, by making

In a later section we shall test the hypothesis that

2.4. Model 1a: Model 1 + local-vs-visitor eect

2.5 Model 2: Model 1a + global eect of time

Table 1: Teams in the Big-East conference and their indices in models.

2.5. Model 2: Model 1a + global eect of time

is the time in months measured from a judiciously chosen origin.

2.6. Model 2a: Model1a + time-team interactions

games played among the 15 teams belonging to the Big-East

and the total number of data points to be t is

3.1 Analysis of residuals

Modeling basketball match scores through team-specic strength factors

will be given by the dierence between

mirror game, which we shall index with

has a playing strength described

Hence, by virtue of the denition (1) of

dummy regressors, which eectively makes the last team,

2.2. Model 0a: local vs. visitor eect

Equations (3) and (4) are modied to

Notice that this model eectively subsumes Model 0, by making

2.4. Model 1a: Model 1 + local-vs-visitor eect

2.5 Model 2: Model 1a + global eect of time

Table 1: Teams in the Big-East conference and their indices in models.

2.5. Model 2: Model 1a + global eect of time

games played among the 15 teams belonging to the Big-East

and the total number of data points to be t is

In particular, the coecient

p = 0.0013. The coecient signicant (with p = 0.1162).

Signicance * . . * . . * * . * ** * * * *** * * * ***

Table 3: Fitted coecients for Model 2

Figure 3: Studentized residuals against tted values

F -ratios. F -test to 4 dierent RSS used in the de(with 280 59 1 = 220

coecient in the previous