You are on page 1of 10

Modeling basketball match scores through team-specic strength factors

Mateo Restrepo Meja, NEAS Regression Analysis Course (Summer 2013)

Abstract
We develop several simple regression models of basketball match scores. To

evaluate the models we use them to t the scores of the 140 basketball games played among the 15 teams in the Big East conference during the 2011-2012 NCAA basketball season. team interaction. We test several eects for statistical signicance, such as local vs. visitor, the eect of time within the season and the time and

Keywords:

basketball, linear regression, game score

1. Introduction

In this paper, we develop several simple regression models with the aim of predicting the score of a basketball match given only the identities of the teams that played it and the date it was played. The basic assumption underlying all models is that a team's strength at any moment in time can be summarized by one or two numbers and that the number of points scored by team

a b

when playing against team

will be given by the dierence between

and

strengths, plus an intercept.

The methodology is similar to, although per-

haps much simpler than, that employed by USA Today's sports statistician Je Sagarin(author?) [Sagarin], in that the expected score dierence of a match depends on the teams that play it, only through a single numeric summary (what Sagarin calls a ranking.) We shall start with a simple model in which We shall build on the strength of a team is summarized in a single number.

this model by incorporating a local-vs-visitor eect, that is expected to bump a teams score when it plays at home. Later we will explore a bigger model in which each team is described by two numbers: an attack strength and a defense strength. Finally, we will explore the possible eect that time and time-team interaction have on the scores. We shall verify that the residuals produced by our linear ts are in fact normally distributed and present no real outliers.

Email address:

mateini@gmail.com

(NEAS Regression Analysis Course (Summer 2013))

Preprint submitted to Elsevier

September 23, 2013

2. Model descriptions
In order to precisely describe the models, we introduce a bit of notation and terminology. We shall focus on games played among same conference. We let For our purposes game

nT teams belonging to the g be the number of games and index the games by i. Gi will be described by the tuple Gi = (l(i), v (i), Ti , Li , Vi ),

where:

l(i) {1, . . . , nT } v (i) {1, . . . , nT } Ti Li Vi

is the index of the team playing as local in game

Gi . Gi .

is the index of the team playing as visitor in game

is the time the game was played, measured as a (generally fractional)

number of months since an arbitrarily chosen time origin. is the number of points scored by the local team, li . is the number of points scored by the visitor team,

vi .
game, from this,

All our models will try to predict both scores,

the values of the independent variables li , vi and

each game as two separate games:



Game

Li and Vi , for each Ti . In order to do Li

we see

i,

in which team

l(i)

(as attacker) scores points

against team

v (i) v (i)

(defender) and

another

mirror game, which we shall index with

g+i

in which team

(as attacker) scores

Vi

points against team

l(i)

(defender).

Thus the response variable will have

2g

values (1)

Yi Yg+i

:= Li := Vi .

The total number of data points that we will consider is thus

n=2g
It shall prove convenient to extend the denition of

l(i) and v (i) to the range

{g + 1, g + 2, . . . , 2g }

as follows:

l(g + i) := v (i) v (g + i) := l(i),


reecting the fact that for the mirror games, (2)

g + 1, . . . , 2g , the roles of the teams

are reversed. With this notation we are ready to introduce the models.

2.1 Model 0: a single strength-factor per team

2.1. Model 0: a single strength-factor per team


The very rst model posits that team by a single number,

has a playing strength described

and that the number of points scored by local team

playing against visitor team

will be (3) This

L = + l v + ,
where

is an intercept constant and

is a normally distributed error.

simple model implies also that the number of points scored by the visitor team

will be

V = + v l + ,
where another normally distributed error, independent of

(4)

The possible

eect of time is neglected by this model. This will be address in later models. As written in equations (3) and (4) model is of course over determined. This will be address shortly. In order to put the model equations (3) and (4) in the standard form used for linear regression we introduce

nT 1

dummy regressor variables

Bm , m

{1, . . . , nT 1}

, dened as

Bim

1 := 1 0

if if

l(i) = m & m < nT v (i) = m & m < nT

otherwise

We take advantage of the notational convention (2) to make the last denition hold for

i {1, . . . , 2g }.
nT 1

The model formula is thus

Yi = +
m=1

m Bim + i , Yi

(5)

Hence, by virtue of the denition (1) of

for mirror games, the formula ef-

fectively replaces both formulas (3) and (4). Notice that the over-determination present in the former formulas is taken care of in the latter by using a set of

nT 1
team.

dummy regressors, which eectively makes the last team,

nT ,

the base

2.2. Model 0a: local vs. visitor eect


This model is a slight extension of model 0 of the previous section that tries to capture the local vs. coecient visitor eect on the score, i.e. it includes an extra

that will add to the score of a team when it is playing at home.

Equations (3) and (4) are modied to

L = + + l v + V = + v l +

To formulate this in matrix form, we introduce an extra dummy regressor

2.3 Model 1: attack strength vs. defense strength.

Ci :=
and thus we get

1 0

if

ig

otherwise

nT 1

Yi = + Ci +
m=1

m Bim +

(6)

2.3. Model 1: attack strength vs. defense strength.


This model eectively doubles the number of (non-intercept) coecients in Model 0. It models the strength of team its attack strength game between

by means of two separate numbers;

m and its defense l and v would then be

strength

m .

The predicted scores for a

L = + l v + V = + v l + .

Since there are two separate sets of coecients we require two sets of dummy variables:

Aim :=
and

1 0

if

l(i) = m & m < nT

otherwise

Dim :=
The model formula is thus

1 0

if

v (i) = m & m < nT

otherwise

nT 1

nT 1

Yi = +
m=1

m Aim +
m=1

m Dim + i . m = m = m.

Notice that this model eectively subsumes Model 0, by making

m .

In a later section we shall test the hypothesis that

m = m

for all

2.4. Model 1a: Model 1 + local-vs-visitor eect


The title says it all, we add the local vs. visitor eect to Model 1 in the same way that we obtained Model 0a from Model 0. The resulting model is

nT 1

nT 1

Yi = + Ci +
m=1

m Aim +
m=1

m Dim + i .

2.5 Model 2: Model 1a + global eect of time


Team cincinnati connecticut depaul georgetown louisville marquette notre-dame pittsburgh providence rutgers seton-hall south-a st-johns-ny syracuse villanova Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Table 1: Teams in the Big-East conference and their indices in models.

2.5. Model 2: Model 1a + global eect of time


We enhance the previous model by an extra term that attempts to capture the eect that time in the season has over the score. The model formula is

nT 1

Yi = + Ci +
m=1
where

(m Aim + m Dim ) + T + i ,

is the time in months measured from a judiciously chosen origin.

2.6. Model 2a: Model1a + time-team interactions


This shall be our last model. It is an enhancement of model 1a by allowing interactions of the time factor with the team strength factor, allowing for the possibility that the attack/defense strength of a team varies linearly throughout the season:

nT 1

nT 1

Yi = + Ci +
m=1

(m Aim + m Dim ) + T +
m=1

A D m Aim T + m Dim T + i .

3. Data Fits
To test the models developed in the preceding section, we shall t them to the scores of

140

games played among the 15 teams belonging to the Big-East

conference during the 2012 season ( rst game was played on 2011/12/27 and last game on 2012/03/10 ). In the notation of last section, we have

g = 140

and the total number of data points to be t is

nT = 15, n = 2 g = 280.

3.1 Analysis of residuals


Model mod.0 mod.0a mod.1 mod.1a

R2
0.09721 0.1207 0.416 0.4394 0.4449 0.5112

2 R
0.04952 0.07069 0.3508 0.3744 0.3781 0.3829

k
14 15 28 29 30 58

nk1
265 264 251 250 249 221

F -statistic
2.038 2.415 6.384 6.757 6.653 3.984

p-value
0.01556 0.002653 0 0 0 6.517e-14

mod.2*
mod.2a

Table 2: Summary of model-wide statistics for models 0 through 2a

Table 1 shows a list of all 15 teams in the Big-East conference and the corresponding indices into the regressor matrices. As explained above, to avoid over-determination there is only 14 columns in each matrix, and team lanova) plays the role of the base team. Table 2 shows a summary of the model-wide t statistics for each of the models. We see that models 1, 1a, 2 and 2a yield very similar values of adjusted

15

(Vil-

R2 .

It is clear from this table that allowing each team to be described by two

strength numbers (attack and defense) as in models 1x and 2x, as opposed to a single number (models 0x), provides for a very signicant improvement in the

R2 ,

namely around 30%. By comparing mod0a to mod0 and mod1a to mod1a

we see that allowing for a correction based on whether the scoring team is local or visitor yields a modest improvement in the the adjusted section we shall test the signicance of this improvement. Finally, we observe that there is only a very minor improvement in the (

R2 .

In the next

R2

0.3%)

obtained by including the interaction of time with the team strength

factors (model 2 vs. model 2a) . This minor improvement comes at the expense of almost doubling the number of coecients and making the total number of coecients a sizable fraction of the total number of points. In view of this, and in the benet of parsimony, we shall designate model 2 as our best model. If we had a larger number of games, played by the same teams over many dierent seasons, we would probably revisit this issue. In any case, we shall look at an incremental

test comparing models 2 and 2a in the next section.

Table 3 shows the results of tting the data to Model 2a. We see that most coecients are signicant to the 10% level (.) and many of them are signicant to the

5%

(*) and even to the

1%

(**) level.

In particular, the coecient

(local vs. visitor eect) is signicant with (global) time eect is borderline 10%

p = 0.0013. The coecient signicant (with p = 0.1162).

for the

3.1. Analysis of residuals


Figures 1 and 2 show a histogram and a qq-plot against normal quantiles of the residuals of model 2. The qq-plot shows that the assumption of normality is justied. Figure 3 shows a plot of studentized residuals (as obtained from function the tted values for model 2. There

studres from the MASS package in R) vs.

3.1 Analysis of residuals

Estimate

Std..Error 3.268 1.115 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 3.01 3.046 3.082 3.033 2.964 3.089 3.068 3.072 3.074 3.136 3.036 3.063 3.079 3.042 0.7583

t-value
23.9 3.243 -1.769 -1.69 0.3217 -2.545 -1.733 1.284 -2.768 -1.933 -1.203 -2.864 -2.371 -4.58 -1.929 -0.1732 3.659 2.842 -2.224 4.911 3.917 1.274 4.083 1.989 0.1947 2.538 3.073 5.849 -0.2043 3.994 -1.576

p-value
2.09e-66 0.001345 0.07805 0.09237 0.7479 0.01153 0.08428 0.2003 0.006068 0.05432 0.23 0.004537 0.0185 7.365e-06 0.05485 0.8627 0.0003093 0.004852 0.02701 1.642e-06 0.000116 0.2037 6.001e-05 0.04781 0.8458 0.01175 0.002352 1.549e-08 0.8383 8.547e-05 0.1162

Signicance *** ** . . * . ** . ** * *** . *** ** * *** *** *** * * ** *** ***


C A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 T

78.11 3.615 -5.325 -5.146 0.9915 -7.719 -5.137 3.966 -8.492 -5.94 -3.699 -8.982 -7.199 -14.03 -5.939 -0.5268 11.01 8.656 -6.855 14.89 11.61 3.936 12.53 6.111 0.5986 7.959 9.332 17.92 -0.6289 12.15 -1.195

Table 3: Fitted coecients for Model 2

3.1 Analysis of residuals

Histogram of residuals(mod.2)

Frequency

0 30

20

40

60

20

10

10

20

residuals(mod.2)

Figure 1: Histogram of residuals for mod.2

Normal QQ Plot
G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G

Sample Quantiles

30

20 3

10

10

20

Theoretical Quantiles

Figure 2: qq-plot of residuals against normal quantiles

G G G G G G G G G

2 mod2.studres 1

G G G G G G G G G GG G G G GG G G G GG G G G G G G G G G G G GG G G GG G G G G GG G G G GG G G G G G G GG G G GG G G G G G GG G G G G G G G G G G G G GG G G GG G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G GG G G G G GG G G G G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G G G G G GG G GG G G GG G G G G G G G G G G G GG GG G G G GG G G G G G GG G G G G G G G G G GG G G G G G G G G G G GG G GG G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G

G G G G

50

60

70 mod2$fitted.values

80

Figure 3: Studentized residuals against tted values

Lowess line and

[2, +2]

range.

are 15 studentized residuals with values outside of expectations.

[2, 2]

which is in line with

4. Hypothesis tests
In this section we shall apply incremental

F -tests to a few hypotheses,

given

further support to some of the observations made in the last section. Table 4 shows the regression sum of squares and the corresponding degrees of freedom for the models introduced in Section 2, plus a few others that we will need to compute incremental

F -ratios. F -test to 4 dierent RSS used in the de(with 280 59 1 = 220


which explains by

Table 5 shows the results of applying the incremental hypothesis. Following Fox's prescription, in all cases the nominator of the ratio was that of the full model 2a that attack and defense strengths are dierent,

degrees of freedom ). As we had anticipated, there is strong support for the fact

m = m ,

models 1+ are so much better than models 0x. There is also strong support for including the local/visitor eect. The global time eect is borderline insignicant as we had inferred from the section. Finally, with a rather large of model 2 as the best model.

t-statistic

of the

coecient in the previous

p-value

of 0.378, we fail to reject the hypothesis

that interaction between team and time is 0, which further justies the choice

10

terms mod.0 mod.0a mod.1 mod.1a mod.2 mod.2a mod.0BC mod.ADT B C + B A + D C + A + D C + A + D + T C + A + D + T * A + T * D C + B + T A + D + T

Regression SS 3754.42 4659.84 16064.2 16969.6 17183.5 19741.1 5125.01 16278.1

df
15 16 29 30 31 59 17 30

Table 4: AOV table for Models from Section 2 and a few others

H0 m = m C=0 T =0 A = 0, D = 0

Description separate attack/defense strengths local/visitor eect global time eect team-time interaction

Model

Model mod.2 mod.2 mod.2

q
14 1 1 28

inc.

F -ratio
10.08 10.6 2.504 1.069

p-value
0 0.001309 0.115 0.378

mod.0BC mod.ADT mod.1a mod.2

mod.2a

Table 5: Incremental

-ratio tests for a few hypotheses.

Acknowledgments
The author would like to warmly thank his ex-colleague James X. Frohnhofer for providing all of the raw data used for this project. This data including match scores and dates, team-conference mapping and much more.

References
[Sagarin] USA Today. Je Sagarin computer ratings, at USA Today http://usatoday30.usatoday.com/sports/sagarin.htm

You might also like