Professional Documents
Culture Documents
Abstract
We develop several simple regression models of basketball match scores. To
evaluate the models we use them to t the scores of the 140 basketball games played among the 15 teams in the Big East conference during the 2011-2012 NCAA basketball season. team interaction. We test several eects for statistical signicance, such as local vs. visitor, the eect of time within the season and the time and
Keywords:
1. Introduction
In this paper, we develop several simple regression models with the aim of predicting the score of a basketball match given only the identities of the teams that played it and the date it was played. The basic assumption underlying all models is that a team's strength at any moment in time can be summarized by one or two numbers and that the number of points scored by team
a b
and
haps much simpler than, that employed by USA Today's sports statistician Je Sagarin(author?) [Sagarin], in that the expected score dierence of a match depends on the teams that play it, only through a single numeric summary (what Sagarin calls a ranking.) We shall start with a simple model in which We shall build on the strength of a team is summarized in a single number.
this model by incorporating a local-vs-visitor eect, that is expected to bump a teams score when it plays at home. Later we will explore a bigger model in which each team is described by two numbers: an attack strength and a defense strength. Finally, we will explore the possible eect that time and time-team interaction have on the scores. We shall verify that the residuals produced by our linear ts are in fact normally distributed and present no real outliers.
Email address:
mateini@gmail.com
2. Model descriptions
In order to precisely describe the models, we introduce a bit of notation and terminology. We shall focus on games played among same conference. We let For our purposes game
nT teams belonging to the g be the number of games and index the games by i. Gi will be described by the tuple Gi = (l(i), v (i), Ti , Li , Vi ),
where:
Gi . Gi .
number of months since an arbitrarily chosen time origin. is the number of points scored by the local team, li . is the number of points scored by the visitor team,
vi .
game, from this,
we see
i,
in which team
l(i)
against team
v (i) v (i)
(defender) and
another
g+i
in which team
Vi
l(i)
(defender).
2g
values (1)
Yi Yg+i
:= Li := Vi .
n=2g
It shall prove convenient to extend the denition of
{g + 1, g + 2, . . . , 2g }
as follows:
are reversed. With this notation we are ready to introduce the models.
L = + l v + ,
where
simple model implies also that the number of points scored by the visitor team
will be
V = + v l + ,
where another normally distributed error, independent of
(4)
The possible
eect of time is neglected by this model. This will be address in later models. As written in equations (3) and (4) model is of course over determined. This will be address shortly. In order to put the model equations (3) and (4) in the standard form used for linear regression we introduce
nT 1
Bm , m
{1, . . . , nT 1}
, dened as
Bim
1 := 1 0
if if
otherwise
We take advantage of the notational convention (2) to make the last denition hold for
i {1, . . . , 2g }.
nT 1
Yi = +
m=1
m Bim + i , Yi
(5)
fectively replaces both formulas (3) and (4). Notice that the over-determination present in the former formulas is taken care of in the latter by using a set of
nT 1
team.
nT ,
the base
L = + + l v + V = + v l +
Ci :=
and thus we get
1 0
if
ig
otherwise
nT 1
Yi = + Ci +
m=1
m Bim +
(6)
strength
m .
L = + l v + V = + v l + .
Since there are two separate sets of coecients we require two sets of dummy variables:
Aim :=
and
1 0
if
otherwise
Dim :=
The model formula is thus
1 0
if
otherwise
nT 1
nT 1
Yi = +
m=1
m Aim +
m=1
m Dim + i . m = m = m.
m .
m = m
for all
nT 1
nT 1
Yi = + Ci +
m=1
m Aim +
m=1
m Dim + i .
nT 1
Yi = + Ci +
m=1
where
(m Aim + m Dim ) + T + i ,
nT 1
nT 1
Yi = + Ci +
m=1
(m Aim + m Dim ) + T +
m=1
A D m Aim T + m Dim T + i .
3. Data Fits
To test the models developed in the preceding section, we shall t them to the scores of
140
conference during the 2012 season ( rst game was played on 2011/12/27 and last game on 2012/03/10 ). In the notation of last section, we have
g = 140
nT = 15, n = 2 g = 280.
R2
0.09721 0.1207 0.416 0.4394 0.4449 0.5112
2 R
0.04952 0.07069 0.3508 0.3744 0.3781 0.3829
k
14 15 28 29 30 58
nk1
265 264 251 250 249 221
F -statistic
2.038 2.415 6.384 6.757 6.653 3.984
p-value
0.01556 0.002653 0 0 0 6.517e-14
mod.2*
mod.2a
Table 1 shows a list of all 15 teams in the Big-East conference and the corresponding indices into the regressor matrices. As explained above, to avoid over-determination there is only 14 columns in each matrix, and team lanova) plays the role of the base team. Table 2 shows a summary of the model-wide t statistics for each of the models. We see that models 1, 1a, 2 and 2a yield very similar values of adjusted
15
(Vil-
R2 .
It is clear from this table that allowing each team to be described by two
strength numbers (attack and defense) as in models 1x and 2x, as opposed to a single number (models 0x), provides for a very signicant improvement in the
R2 ,
we see that allowing for a correction based on whether the scoring team is local or visitor yields a modest improvement in the the adjusted section we shall test the signicance of this improvement. Finally, we observe that there is only a very minor improvement in the (
R2 .
In the next
R2
0.3%)
factors (model 2 vs. model 2a) . This minor improvement comes at the expense of almost doubling the number of coecients and making the total number of coecients a sizable fraction of the total number of points. In view of this, and in the benet of parsimony, we shall designate model 2 as our best model. If we had a larger number of games, played by the same teams over many dierent seasons, we would probably revisit this issue. In any case, we shall look at an incremental
Table 3 shows the results of tting the data to Model 2a. We see that most coecients are signicant to the 10% level (.) and many of them are signicant to the
5%
1%
(**) level.
(local vs. visitor eect) is signicant with (global) time eect is borderline 10%
for the
Estimate
Std..Error 3.268 1.115 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 0.7583
t-value
23.9 3.243 -1.769 -1.69 0.3217 -2.545 -1.733 1.284 -2.768 -1.933 -1.203 -2.864 -2.371 -4.58 -1.929 -0.1732 3.659 2.842 -2.224 4.911 3.917 1.274 4.083 1.989 0.1947 2.538 3.073 5.849 -0.2043 3.994 -1.576
p-value
2.09e-66 0.001345 0.07805 0.09237 0.7479 0.01153 0.08428 0.2003 0.006068 0.05432 0.23 0.004537 0.0185 7.365e-06 0.05485 0.8627 0.0003093 0.004852 0.02701 1.642e-06 0.000116 0.2037 6.001e-05 0.04781 0.8458 0.01175 0.002352 1.549e-08 0.8383 8.547e-05 0.1162
C A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 T
78.11 3.615 -5.325 -5.146 0.9915 -7.719 -5.137 3.966 -8.492 -5.94 -3.699 -8.982 -7.199 -14.03 -5.939 -0.5268 11.01 8.656 -6.855 14.89 11.61 3.936 12.53 6.111 0.5986 7.959 9.332 17.92 -0.6289 12.15 -1.195
Histogram of residuals(mod.2)
Frequency
0 30
20
40
60
20
10
10
20
residuals(mod.2)
Normal QQ Plot
G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G
Sample Quantiles
30
20 3
10
10
20
Theoretical Quantiles
G G G G G G G G G
2 mod2.studres 1
G G G G G G G G G GG G G G GG G G G GG G G G G G G G G G G G GG G G GG G G G G GG G G G GG G G G G G G GG G G GG G G G G G GG G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G GG G G G G GG G G G G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G G G G G GG G GG G G GG G G G G G G G G G G G GG GG G G G GG G G G G G GG G G G G G G G G G GG G G G G G G G G G G GG G GG G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G
50
60
70 mod2$fitted.values
80
[2, +2]
range.
[2, 2]
4. Hypothesis tests
In this section we shall apply incremental
given
further support to some of the observations made in the last section. Table 4 shows the regression sum of squares and the corresponding degrees of freedom for the models introduced in Section 2, plus a few others that we will need to compute incremental
Table 5 shows the results of applying the incremental hypothesis. Following Fox's prescription, in all cases the nominator of the ratio was that of the full model 2a that attack and defense strengths are dierent,
degrees of freedom ). As we had anticipated, there is strong support for the fact
m = m ,
models 1+ are so much better than models 0x. There is also strong support for including the local/visitor eect. The global time eect is borderline insignicant as we had inferred from the section. Finally, with a rather large of model 2 as the best model.
t-statistic
of the
p-value
that interaction between team and time is 0, which further justies the choice
10
df
15 16 29 30 31 59 17 30
Table 4: AOV table for Models from Section 2 and a few others
H0 m = m C=0 T =0 A = 0, D = 0
Description separate attack/defense strengths local/visitor eect global time eect team-time interaction
Model
q
14 1 1 28
inc.
F -ratio
10.08 10.6 2.504 1.069
p-value
0 0.001309 0.115 0.378
mod.2a
Table 5: Incremental
Acknowledgments
The author would like to warmly thank his ex-colleague James X. Frohnhofer for providing all of the raw data used for this project. This data including match scores and dates, team-conference mapping and much more.
References
[Sagarin] USA Today. Je Sagarin computer ratings, at USA Today http://usatoday30.usatoday.com/sports/sagarin.htm