You are on page 1of 32

IX.

Issues in Model Building


A. Using Fs and R2s to Judge the Quality
of the Regression Model
Consider the difference between these two terms:
Statistically significant difference a sample result for
which an outcome (disparity with the hypothesized
value) at least as extreme is unlikely if the null
hypothesis is true
Meaningful difference - sample result for which an
outcome (disparity with the hypothesized value) has
some important business, managerial, or scientific
implications.

It is possible for
a large-sample study to have statistically significant
results that do not reflect meaningful change or
difference.
or, conversely
a small sample study to yield a meaningful change or
difference (from the hypothesized value) that is not
statistically significant.

How/why are these outcomes possible?


Statistical significance is highly dependent on sample
size the power of a statistical test (i.e., the difference
between the sample result and the hypothesized value
that can be called statistically significant) increases as
the sample size increases and decrease as the sample
size decreases, ceteris peribus
Thus the likelihood a given difference between the
sample result and the hypothesized value will be
significant also increases as the sample size increases
and decrease as the sample size decreases, ceteris
peribus
Therefore, a study based on a large sample will find
smaller differences between sample results and the
hypothesized value to be significant (i.e., have greater
power).

Consider the relationship between the coefficient of


determination R2 and the calculated F-statistic for the
hypothesis 1 = L = p -1 = 0 :
This implies a critical
R2 =

1F
1F + 2

value of R2 at
significance level a and
is related directly to
R2s beta distribution

where 1 and 2 are the degrees of freedom of MSREG and


s2, respectively.
Obviously a relatively small model (i.e., 1 is small
relative to 2) with a small but significant (at some
predetermined level of significance ) F statistics can
coincide with a rather small R2.

So how do we assess the value of a regression model?


Do not consider either the coefficient of determination
R2 or the calculated F-statistic for the hypothesis
1 = L = p -1 = 0 in isolation
Do strongly consider the consistency of the results
with the theory that led to the study
Box & Wetz (1964, 1973) suggest
the range~ of values in the X-space over which
values of the response variable Y are to be
predicted should be larger than the
corresponding prediction errors
a model should generate an R2 that is a multiple
of the critical value of the F-statistic with 1 and
2 degrees of freedom at a predetermined level of
significance .

The Box & Wetz Approach (with Draper & Smiths


suggested rule of thumb multiple of 4.0) applied to the
age/income data:
The critical value of the F-statistic with 1=3 and 2=6
degrees of freedom at a predetermined level of
significance =0.05 is F3,6,0.05=4.76.
4(4.76)=19.04
The coefficient of determination that corresponds to
4F3,6,0.05=19.04 is
1 4F , ,
3 (19.04 )
R2 =
=
= 0.904943
1 4F , , + 2
3 (19.04 ) + 6
1

The actual value of the coefficient of determination for


this model is R2= 0.9245>0.904943 - by Box & Wetz
standard, the model does explain enough variation in
the response variable to be useful!

Note that Draper & Smith actually argue strongly for use
of a (more conservative) rule of thumb multiple of 10.0
when applied to the age/income data:
The critical value of the F-statistic with 1=3 and 2=6
degrees of freedom at a predetermined level of
significance =0.05 is F3,6,0.05=4.76.
10(4.76)=47.60
The coefficient of determination that corresponds to
10F3,6,0.05=47.60 is
1 10F , ,
3 (47.60 )
R2 =
=
= 0.959677
1 10F , , + 2
3 (47.60 ) + 6
1

The actual value of the coefficient of determination for


this model is R2= 0.9245>0.959677 - by Box & Wetz
standard, the model does not explain enough variation
in the response variable to be useful!

An simpler approximately equivalent alternative


approach to Box & Wetz is to find
i ) - min ( Y
i )
i ) - min ( Y
i )
n max ( Y
max ( Y

,
i.e.
2
2
ps
ps
n
As a rule of thumb, this ratio should exceed 4.0.
For the age/income data, we have

( )

( )

i - min Y
i
max Y
ps2

7.9893242 - (-6.1871484 )
4 ( 28.2983 )

10

= 4.213646 > 4.0

So the model does explain sufficient variation in the


response variable to be useful.

What is the problem with the Box & Wetz approach?


It does not take the context of the model into
consideration (in some contexts we cannot expect a large
R2, while in others we should).

(a clumsy) SAS Example: The Box & Wetz Test


DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
dfn=3; dfd=6; alpha=0.05; pct=1-alpha; FCrit=FINV(pct,dfn,dfd,0);
BoxWetz4=(dfn*(4*FCrit))/((dfn*(4*FCrit))+dfd);
BoxWetz10=(dfn*(10*FCrit))/((dfn*(10*FCrit))+dfd);
CARDS;
Jackson 25 21 4 2
.
.
.
.
.
.
.
.
.
Sanford 47 42 4 18
;
PROC PRINT;
VAR BoxWetz4 BoxWetz10;
RUN;
PROC REG;
MODEL income=age yrseduc yrsinjob;
OUTPUT OUT=regdata PREDICTED=yhat RESIDUAL=error;
RUN;
PROC MEANS DATA=regdata;
VAR error;
RUN;

SAS Output: Box & Wetz Test


Analysis of Age/Income Data

Obs
1
2
3
4
5
6
7
8
9
10

Box
Wetz4
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489

Box
Wetz10
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965

SAS Output: Box & Wetz Test


Analysis of Age/Income Data
Using PROC REG to Generate Predicted & Residual Values
The REG Procedure
Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Analysis of Variance
Analysis of Variance
Sum of
Mean
Source
DF
Squares
Square
F Value
Model
3
2078.31013
692.77004
24.48
Error
6
169.78987
28.29831
Corrected Total
9
2248.10000
Root MSE
Dependent Mean
Coeff Var

5.31962
43.30000
12.28549

Parameter Estimates
Variable
Intercept
age
yrseduc
yrsinjob

Label
DF
Intercept
1
Age in Years
1
Years of Post-Secon 1
Years in Current Po 1

Estimate
-24.56753
1.00791
6.30900
0.00815

R-Square
Adj R-Sq

Parameter
Error
9.27983
0.71229
2.78567
1.05914

Pr > F
0.0009

0.9245
0.8867

Standard
t Value
Pr > |t|
-2.65
0.0382
1.42
0.2068
2.26
0.0641
0.01
0.9941

SAS Output: Box & Wetz Test


Analysis of Age/Income Data
Using PROC REG to Generate Predicted & Residual Values
The MEANS Procedure
Analysis Variable : error Residual
N
Mean
Std Dev
Minimum
Maximum

10
1.110223E-14
4.3434481
-6.1871484
7.9893242

B. Common Model Building Strategies &


Algorithms
iis p2/n. Thus,
As noted earlier, the average variance of Y
when constructing a regression model (deciding what
independent variables to include in the model), we must
balance two tensions:

Reduce the estimate s2 of 2 as much as possible by


including many independent variables in the model
Reduce p as much as possible by including few
independent variables in the model.
As always, we must also be mindful of overfitting the
model!

Many automated approaches for deciding which


independent variable to include in a regression model
exist:
1. All Possible Regressions if we have r independent
variables (including transformations and interactions
of the original p-1 independent variable) under
consideration, there are 2r possible models each of
these regressions is assessed with regard to some
criterion. Most frequently used criterion include:
coefficient of determination R2
residual mean square s2
Mallows statistic Cp
This is unwieldy, cumbersome, and unwarranted
(particularly when r is large), and leads to overfitting.

2. Best Subset Regression generate summary


statistics from the best k regressions that include 1,
2, 3, independent variables (including
transformations and interactions of the original p-1
independent variable) under consideration this
leaves fewer than rk (usually <<< 2r) possible models.
Again, the Most frequently used criterion include:
coefficient of determination R2
residual mean square s2
Mallows statistic Cp
This is less unwieldy, cumbersome, and unwarranted
(particularly when r is large), but still leads to
overfitting.

SAS Example: All Possible Regressions and Best


Subsets Regression
DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
LABEL name='Last Name'
age='Age in Years'
income='Annual Income (in $1,000''s)'
yrseduc='Years of Post-Secondary Education'
yrsinjob='Years in Current Position';
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Omit this
Clark
41 64 7 6
option to run
Snyder
33 23 2 11
all possible
Roediger 33 31 4 8
regressions
Simon
41 50 5 12
Lewis
33 36 4 12
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare BEST=1;
RUN;

Number of Observations Read


Number of Observations Used
Number in
Model

R-Square

10
10

Variables in Model

1
0.6556
age
---------------------------------------------2
0.9245
age yrseduc
---------------------------------------------3
0.9245
age yrseduc yrsinjob

Note you can use SAS code to exert some control over
the model fitting process:
To use Mallows Cp Statistic as the model selection
criterion:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=cp;
RUN;

To use the Adjusted R2 as the model selection criterion:


PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=adjrsq;
RUN;

To include the first k independent variables from your


model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare INCLUDE=k;
RUN;

To include at least k independent variables from your


model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare START=k;
RUN;

To include no more than k independent variables from


your model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare STOP=k;
RUN;

To print the estimated regression coefficients for each


fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare B;
RUN;

3. Iterative Procedures algorithms that systematically


apply some criterion to assess whether to add
independent variables to or remove independent
variables from an existing regression.
Forward Selection (initialize with no independent
variables in the model)
a. A partial-F test is executed for each independent
variable not included in the current (reduced)
model
b. If the smallest p-value corresponding to any of
these partial-F tests is less than some preselected
significance level enter, the associated
independent variable is added to the model
c. The algorithm stops when the smallest p-value
corresponding to the partial-F tests for the
independent variables not in the model is greater
than some preselected significance level enter

Backward Elimination (initialize with all


independent variables in the model)
a. A partial-F test is executed for each
independent variable included in the current
(full) model
b. If the largest p-value corresponding to any of
these partial-F tests is greater than some
preselected significance level stay, the
associated independent variable is removed
from the model
c. The algorithm stops when the largest p-value
corresponding to any of these partial-F tests
for the independent variables in the model is
less than some preselected significance level
stay

Stepwise Regression (initialize with no


independent variables in the model)
a. A partial-F test is executed for each
independent variable not included in the
current (reduced) model
b. If the smallest p-value corresponding to any
of these partial-F tests is less than some
preselected significance level enter, the
associated independent variable is added to
the model
c. A partial-F test is then executed for each
independent variable now included in the
current (full) model

d. If the largest p-value corresponding to any of


these partial-F tests is greater than some
preselected significance level stay, the
associated independent variable is removed
from the model
e. The algorithm continues to iterate until
the largest p-value corresponding to any of
these partial-F tests for the independent
variables in the model is less than some
preselected significance level enter
and
the smallest p-value corresponding to the
partial-F tests for the independent
variables not in the model is greater than
some preselected significance level stay

Concerns and notes on iterative model building


algorithms:
One must carefully consider the choice of the
significance levels enter and stay:
Usually enter is set equal to stay
Occasionally enter is set smaller than stay (why?)
Never is enter is set larger than stay (why not?)
Smaller enter and stay result in more selective
models/larger enter and stay result in less
selective models
Some softwares use a fixed value of F no matter what
the change in degrees of freedom (this can usually be
set by the user)
The phrase F-remove is often used to refer to the
partial-F value.

If XX is not singular (or too near-singular), the


residual error variance s2 is a reliable estimate of 2 in
an asymptotic sense with respect to the number of
relevant independent variables in the model
Automatic model fitting algorithms are reputedly
very slow (especially for large r), but this difficulty
has diminished greatly as computer power has
increased
Automated model fitting algorithms can generate an
overwhelming amount of information (consider the
potential task of evaluating every model generated
for homogeneity of variances, normality, fit,
independence, multicolinearity, outliers, leverage
points, etc.)
As always, we must also be mindful of overfitting the
model!

SAS Example: Stepwise Regression


DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
LABEL name='Last Name'
age='Age in Years'
income='Annual Income (in $1,000''s)'
yrseduc='Years of Post-Secondary Education'
yrsinjob='Years in Current Position';
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Clark
41 64 7 6
Snyder
33 23 2 11
Roediger 33 31 4 8
Simon
41 50 5 12
Lewis
33 36 4 12
entry (default is 0.15)
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=stepwise SLE=0.1
SLS=0.1;
stay (default is 0.15)
RUN;

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Stepwise Selection: Step 1
Variable age Entered: R-Square = 0.6556 and C(p) = 21.3587

Source
Model
Error
Corrected Total
Variable
Intercept
age

DF
1
8
9
Parameter
Estimate
-5.38425
1.22630

Analysis of Variance
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423

Type II SS
16.97223
1473.89410

F Value
15.23

F Value
0.18
15.23

Pr > F
0.0045

Pr > F
0.6864
0.0045

Bounds on condition number: 1, 1

Stepwise Selection: Step 2


Variable yrseduc Entered: R-Square = 0.9245 and C(p) = 2.0001

Source
Model
Error
Corrected Total

DF
2
7
9

Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000

F Value
42.84

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Variable
Intercept
age
yrseduc

Parameter
Estimate
-24.60236
1.01323
6.29030

Standard
Error
7.50022
0.16300
1.26012

Type II SS
260.98995
937.19626
604.41435

F Value
10.76
38.64
24.92

Bounds on condition number: 1.0736, 4.2945

Pr > F
0.0135
0.0004
0.0016

Pr > F
0.0001

All variables left in the model are significant at the 0.1000 level.
No other variable met the 0.1000 significance level for entry into the model.

Variable
Step Entered
1 age
2 yrseduc

Variable
Removed

Summary of Stepwise Selection


Number
Partial
Label
Vars In R-Square
Age in Years
1
0.6556
Years of Post
2
0.2689

Model
R-Square
C(p) F Value
0.6556 21.3587
15.23
0.9245
2.0001
24.92

Summary of Stepwise Selection


Step Pr > F
1 0.0045
2 0.0016

What if we used an extremely liberal stay?


DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
LABEL name='Last Name'
age='Age in Years'
income='Annual Income (in $1,000''s)'
yrseduc='Years of Post-Secondary Education'
yrsinjob='Years in Current Position';
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Clark
41 64 7 6
Snyder
33 23 2 11
Roediger 33 31 4 8
Simon
41 50 5 12
Lewis
33 36 4 12
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=stepwise SLE=0.1
SLS=0.999999;
RUN;

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Stepwise Selection: Step 1
Variable age Entered: R-Square = 0.6556 and C(p) = 21.3587

Source
Model
Error
Corrected Total
Variable
Intercept
age

DF
1
8
9
Parameter
Estimate
-5.38425
1.22630

Analysis of Variance
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423

Type II SS
16.97223
1473.89410

F Value
15.23

F Value
0.18
15.23

Pr > F
0.0045

Pr > F
0.6864
0.0045

Bounds on condition number: 1, 1

Stepwise Selection: Step 2


Variable yrseduc Entered: R-Square = 0.9245 and C(p) = 2.0001

Source
Model
Error
Corrected Total

DF
2
7
9

Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000

F Value
42.84

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Variable
Intercept
age
yrseduc

Parameter
Estimate
-24.60236
1.01323
6.29030

Standard
Error
7.50022
0.16300
1.26012

Type II SS
260.98995
937.19626
604.41435

F Value
10.76
38.64
24.92

Bounds on condition number: 1.0736, 4.2945

Pr > F
0.0135
0.0004
0.0016

Pr > F
0.0001

Stepwise Selection: Step 3


Variable yrsinjob Entered: R-Square = 0.9245 and C(p) = 4.0000

Source
Model
Error
Corrected Total

DF
3
6
9

Analysis of Variance
Sum of
Mean
Squares
Square
2078.31013
692.77004
169.78987
28.29831
2248.10000

F Value
24.48

Pr > F
0.0009

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Variable
Intercept
age
yrseduc
yrsinjob

Parameter
Estimate
-24.56753
1.00791
6.30900
0.00815

Standard
Error
9.27983
0.71229
2.78567
1.05914

Type II SS
198.33685
56.66155
145.15177
0.00168

F Value
7.01
2.00
5.13
0.00

Pr > F
0.0382
0.2068
0.0641
0.9941

Bounds on condition number: 17.572, 117.17

Stepwise Selection: Step 4


Variable yrsinjob Removed: R-Square = 0.9245 and C(p) = 2.0001

Source
Model
Error
Corrected Total

DF
2
7
9

Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000

F Value
42.84

Pr > F
0.0001

The REG Procedure


Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Number of Observations Read
10
Number of Observations Used
10
Variable
Intercept
age
yrseduc

Parameter
Estimate
-24.60236
1.01323
6.29030

Standard
Error
7.50022
0.16300
1.26012

Type II SS
260.98995
937.19626
604.41435

F Value
10.76
38.64
24.92

Bounds on condition number: 1.0736, 4.2945

Pr > F
0.0135
0.0004
0.0016

All variables left in the model are significant at the 0.1000 level.
The stepwise method terminated because the next variable to be entered was just
removed.
Variable Variable
Step Entered
Removed
1 age
2 yrseduc
3 yrsinjob
4
yrsinjob

Summary of Stepwise Selection


Number
Partial
Model
Label
Vars In R-Square R-Square
C(p) F Value
Age in Year
1
0.6556
0.6556 21.3587
15.23
Years of Post
2
0.2689
0.9245
2.0001
24.92
Years in Curr
3
0.0000
0.9245
4.0000
0.00
Years in Curr
2
0.0000
0.9245
2.0001
0.00
Summary of Stepwise Selection
Step Pr > F
1 0.0045
2 0.0016
3 0.9941
4 0.9941

To use the forward selection algorithm:


entry (default is 0.50)

PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=f SLE=0.15;
RUN;

To use the backward elimination algorithm:


PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=b SLS=0.15;
RUN;
stay (default is 0.10)

The INCLUDE, START, and STOP options can also be


used with the STEPWISE, FORWARD, and BACKWARD
selection algorithms.

SAS offers two additional forward selection-based


algorithms:
Maximum R2 Improvement
Use forward selection to find a two-independent
variable model
Examine each pair of independent variables not
included in the current model; choose the pair
that generates the maximum increase in R2 over
the current (bivariate) model
Continue the forward selection process to look for
a third variable to add to the new bivariate model
(if none exists, the algorithm terminates)
Examine each trio of independent variables not
included in the current model; choose the trio that
generates the maximum increase in R2 over the
current (trivariate) model
Continue until the algorithm terminates

Minimum R2 Improvement
Use forward selection to find a two-independent
variable model
Examine each pair of independent variables not
included in the current model; choose the pair
that generates the minimum increase in R2 over
the current (bivariate) model
Continue the forward selection process to look for
a third variable to add to the new bivariate model
(if none exists, the algorithm terminates)
Examine each trio of independent variables not
included in the current model; choose the trio that
generates the minimum increase in R2 over the
current (trivariate) model
Continue until the algorithm terminates
Min R2 will give a greater variety of models
consideration than Max R2.

To use the Maximum R2 selection algorithm:


PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=maxr;
RUN;

To use the backward elimination algorithm:


PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=minr
RUN;

The INCLUDE, START, and STOP options can also be


used with the MAXR and MINR selection algorithms.
Note that the INCLUDE option overrides the START and
STOP options for each of these algorithms in SAS.

C. External Model Validation


The coefficient of determination (R2) for sample data is
naturally biased to be larger than the coefficient of
determination that would be achieved over the
population why?
1. Suppose the population model is
= x
Y
2. If for every possible sample of size n, we find
= b x where b = (i.e., the population model)
Y

the mean of all possible sample R2s is the population


model R2

3. However, for each possible sample of size n, we find


either
= b x where b = (i.e., the population model)
Y
or
= b x where b (i.e., some other model)
Y
where some model other than the population model
is selected when it produces a larger R2 than the
population model for that sample
4. If some model other than the population model is
selected for at least one sample, the mean sample R2
will exceed the R2 of the population model (i.e., is
positively biased).
This bias (and all overfitting) results from fitting the
model and evaluating the fit on the same data!

This also explains why residuals do not give an


indication of how well the model will estimate/predict
values of the dependent variable for new data
One way to overcome this problem is to partition the
data and
use only a subset of the entire data set when fitting a
model
assess the performance of the fitted model on
reaming data that were not used to fit the model
This is the concept behind this class of model evaluation
methods called cross validation.

In cross-validation, the performance of the final


regression model is assessed over data not used to fit the
model. This is generally done in one of several ways:
Prediction Coefficient of Determination calculate the
coefficient of determination for the fitted model on
data not used to fit the model. This is usually denoted
R 2prediction.
Mean Square Prediction calculate the mean squared
error for the fitted model on data not used to fit the
model. This is usually denoted MSEprediction.
Use of other performance measures such as the Akaike
Information Criteria (AIC) or Bayesian Information
Criteria (BIC) (sometimes referred to as the Schwarz
Information Criteria or SIC)
The resulting model assessment statistics are frequently
referred to as measures of generalization error.

Partitioning of the data set into the i) subset to be used to


fit the model and ii) subset to be used to assess the fit of
the model is generally done in one of two ways:
Holdout Sample Cross-Validation (a.k.a. Data
Splitting or Split Sampling) separate data sets are
used to fit the model (the modeling sample of size
nmodel) and assess the models fit (the holdout sample of
size nholdout) statistics used to assess the fit of the
model on the holdout sample include
n model + n holdout

R 2prediction = 1 -

i = n model + 1
n model + n holdout

( yi

( yi

i )
- y

- y holdout )

i = n model + 1

i is the estimated value for the ith observation


where y
from the holdout using the regression fitted on
the model sample

or
n model + n holdout

MSEprediction =

( yi

i )
- y

i = n model + 1

n holdout - p

Issues/concerns in Holdout Sample Cross-Validation


Extremely simple and intuitively pleasing
Model selection is very simple (choose the model
with the best holdout sample performance)
How do you divide the data among the mutually
exclusive and collectively exhaustive samples?
equally?
nmodel > nholdout?
nmodel < nholdout?
Montgomery, Peck, & Vining (2001) suggest a
minimum of nholdout 15 or 20

Does not use the data efficiently both nmodel


and nholdout <<< n
If we have a small data set, our model and/or
holdout sample may not be representative of the
population thus, there are two potential
sources of bias and the holdout sample estimator
of fit generally has high variance
Cross sectional data are usually randomly
partitioned
Time series data generally use the latest/most
recent observation as the holdout sample (why?)

A SAS Holdout Sample Routine


DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
seed=1575764;
ranhold=RANUNI(seed);
pctanalysis=0.5;
mergenum=1;
LABEL name='Last Name'
age='Age in Years'
income='Annual Income (in $1,000''s)'
yrseduc='Years of Post-Secondary Education'
yrsinjob='Years in Current Position';
CARDS;
Jackson 25 21 4 2
.
.
.
.
.
.
.
.
.
.
.
.
;
PROC MEANS NOPRINT;
VAR age;
OUTPUT OUT=nobs n=sampsize;
ID mergenum;
RUN;

DATA salary;
MERGE salary nobs;
BY mergenum;
DROP _TYPE_ _FREQ_ mergenum;
RUN;
PROC SORT DATA=salary;
BY ranhold;
RUN;
DATA salary;
SET salary;
nanalysis = pctanalysis*sampsize;
IF _N_<=nanalysis THEN DO;
analysis = 1;
yanalysis = income;
END;
IF analysis ne 1 THEN DO;
yholdout = income;
END;
RUN;
PROC REG;
MODEL yanalysis=age yrseduc;
OUTPUT OUT=new P=predict;
TITLE "REGRESSION EXECUTED ON MODEL DATA";
RUN;

DATA model;
SET new;
IF analysis ne 1 THEN DELETE;
error=yanalysis-predict;
DROP analysis yholdout;
RUN;
PROC PRINT DATA=model;
VAR name yanalysis predict error age yrseduc yrsinjob;
TITLE "MODEL DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=model;
VAR error yanalysis;
OUTPUT OUT=mdiagnostics VAR=verror vyanalysis;
TITLE "SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;
DATA mdiagnostics;
SET mdiagnostics;
rsquare=1-verror/vyanalysis;
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
TITLE "MORE SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;

DATA holdout;
SET new;
IF analysis=1 THEN DELETE;
error=yholdout-predict;
DROP analysis yanalysis;
RUN;
PROC PRINT DATA=holdout;
VAR name yholdout predict error age yrseduc yrsinjob;
TITLE "HOLDOUT DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=holdout;
VAR error yholdout;
OUTPUT OUT=hdiagnostics VAR=verror vyholdout;
TITLE "SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;
DATA hdiagnostics;
SET hdiagnostics;
rsquare=1-verror/vyholdout;
LABEL rsqaure="R-SQUARE PREDITION FOR THE HOLDOUT SAMPLE"
verror="MSE PREDITION FOR THE HOLDOUT SAMPLE";
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
VAR rsquare verror;
TITLE "MORE SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;

The REG Procedure


Model: MODEL1
Dependent Variable: yanalysis
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values

Source
DF
Model
2
Error
2
Corrected Total 4

Analysis of Variance
Sum of
Mean
Squares
Square
F Value
1393.65758
696.82879
24.74
56.34242
28.17121
1450.00000

Root MSE
Dependent Mean
Coeff Var

Variable
Intercept
age
yrseduc

DF
1
1
1

Parameter
Estimate
-27.85033
1.02684
6.78357

5.30766
45.00000
11.79479

R-Square
Adj R-Sq

Parameter Estimates
Standard
Error
t Value
10.90428
-2.55
0.22311
4.60
1.50291
4.51

10
5
5

Pr > F
0.0389

0.9611
0.9223

Pr > |t|
0.1252
0.0441
0.0457

MODEL DATA AND PREDICTED INCOMES


Obs
1
2
3
4
5

name
yanalysis
Roediger
31
Standifer
65
Clark
64
Sanford
42
Snyder
23

predict
33.1696
62.9478
61.7349
47.5453
19.6024

error age yrseduc yrsinjob


-2.16955 33
4
8
2.05218 62
4
27
2.26505 41
7
6
-5.54527 47
4
18
3.39758 33
2
11

SUMMARY STATISTICS FOR THE MODEL DATA


The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum

error
5 4.973799E-15
3.7530793 -5.5452663
3.3975814
yanalysis
5
45.0000000
19.0394328 23.0000000
65.0000000

MORE SUMMARY STATISTICS FOR THE MODEL DATA


Obs
1

verror
14.0856

vyanalysis
362.5

rsquare
0.96114

HOLDOUT DATA AND PREDICTED INCOMES


Obs
1
2
3
4
5

name
Jackson
Lewis
Simon
Bright
Ross

yholdout
21
36
50
44
57

predict
24.9549
33.1696
48.1678
35.2232
61.1124

error
-3.95486
2.83045
1.83219
8.77677
-4.11240

age
25
33
41
35
47

yrseduc yrsinjob
4
2
4
12
5
12
4
8
6
11

SUMMARY STATISTICS FOR THE HOLDOUT DATA


The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum

error
5
1.0744305
5.3661170
-4.1123996
8.7767746
yholdout
5 41.6000000 13.8672276
21.0000000
57.0000000

MORE SUMMARY STATISTICS FOR THE HOLDOUT DATA


Obs
1

rsquare
0.85026

verror
28.7952

So the model
^

Income = -27.85 + 1.03Age + 6.78YearsEduc

explains 85% of the variation in the holdout values of


Income ( R 2prediction = 0.85026) and has a MSEprediction of
28.7952.
Note that the model generated values of R2 = 0.9611 and
MSE = 28.17121 what does the discrepancies between
these sets of values say about the model?

Resampling
jackknife (leave-one-out) cross-validation
(invented by Quenouille and later developed by
Tukey);
o fit the model n times (omitting a different
observation each time)
o use only the omitted observation to assess the
model fitted when it was omitted using
whatever error criterion interests you
o summarize the error criterion over the n
subsets

v-fold cross-validation (developed by Geisser,


Stone, and Wahba)
o divide the data into k mutually excusive and
collectively exhaustive subsets of
(approximately) equal size
o fit the model k times (omitting a different one
of the k subsets each time)
o use only the omitted subset to assess the
model fitted when it was omitted using
whatever error criterion interests you
o summarize the error criterion over the k
subsets

Note that:
v-fold cross-validation was developed as a
compromise between holdout sample and jackknife
cross validation methods
If v = n, then v-fold cross-validation is the jackknife
Evidence suggests v-fold cross-validation is markedly
superior to holdout sample cross-validation for small
data sets (Goutte, 1997, "Note on Free Lunches and
Cross-Validation," Neural Computation, 9, 1211-1215,
ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz)
A value of 10 for k is popular for estimating
generalization error
A small change in the data can cause a large change in
the model selected under the jackknife/leave-one-out
cross-validation (Breiman, 1996, "Heuristics of
Instability and Stabilization in Model Selection,"
Annals of Statistics, 24, 2350-2383)

For choosing subsets of independent variables in


linear regression, Breiman and Spector, 1992
(Submodel Selection and Evaluation in Regression:
The X-Random Case," International Statistical Review,
60, 291-319) found 10-fold and 5-fold cross-validation
to work better than leave-one-out
For an insightful early discussion of the limitations of
cross-validation methods, see Stone, 1977
("Asymptotics for and against Cross-Validation,"
Biometrika, 64, 29-35)
For an insightful later discussion of the limitations of
cross-validation methods, see Efron & Tibshirani, 1997
("Improvements on cross-validation: The .632+
bootstrap method," Journal of the American Statistical
Association, 92, 548-560)

Ultimately do not forget Occams Razor - the simplest


explanation is probably the most likely (a quality often
strived for in science)
Parsimony is a particularly relevant issues in model
building/selection, where the modeler must make a
compromise between
model bias (the difference between the estimated
value and true unknown value of a parameter)
and
variance (the precision of these estimates)
where
a model with too many variables will have low
precision
a model with too few variables will be biased
(Burnham and Anderson 2002).

SOME questions you should be able to answer:


1. What is the difference between a statistically significant
result and a contextually meaningful result? Why should
we be concerned with this distinction? Under what
circumstances can a statistical analysis yield one but not the
other?
2. What is the relationship between the coefficient of
determination R2 and the calculated F-statistic for the
hypothesis 1 = L = p -1 = 0 ? Under what conditions
can this F-test be significant while the coefficient of
determination R2 is relatively small? Under what conditions
can this F-test be insignificant while the coefficient of
determination R2 is relatively large?

SOME questions you should be able to answer:


3. What considerations should be made when evaluating the
overall model? How is the Box & Wetz procedure used?
What are the limitations of this approach? On what
statistical theory is this approach based? How can SAS be
used to implement this approach?
4. What are the most common automated algorithms for
finding a regression? What is the rationale of each of these
approaches? What are the risks of using one of these
approaches?
5. Why is overfitting the model to the sample data a problem?
Why does overfitting occur? How is external model
validation used to assess/prevent overfitting? How does
cross-validation work? How can cross-validation be used to
assess/prevent overfitting? How does resampling work?
How can resampling be used to assess/prevent overfitting?

You might also like