Issues in Model Building PDF

IX.
Issues in Model Building

A. Using Fs and R2s to Judge the Quality
of the Regression Model
Consider the difference between these two terms:
Statistically significant difference a sample result for
which an outcome (disparity with the hypothesized
value) at least as extreme is unlikely if the null
hypothesis is true
Meaningful difference - sample result for which an
outcome (disparity with the hypothesized value) has
some important business, managerial, or scientific
implications.
It is possible for
a large-sample study to have statistically significant
results that do not reflect meaningful change or
difference.
or, conversely
a small sample study to yield a meaningful change or
difference (from the hypothesized value) that is not
statistically significant.
How/why are these outcomes possible?

Statistical significance is highly dependent on sample
size the power of a statistical test (i.e., the difference
between the sample result and the hypothesized value
that can be called statistically significant) increases as
the sample size increases and decrease as the sample
size decreases, ceteris peribus
Thus the likelihood a given difference between the
sample result and the hypothesized value will be
significant also increases as the sample size increases
and decrease as the sample size decreases, ceteris
peribus
Therefore, a study based on a large sample will find
smaller differences between sample results and the
hypothesized value to be significant (i.e., have greater
power).
Consider the relationship between the coefficient of

determination R2 and the calculated F-statistic for the
hypothesis 1 = L = p -1 = 0 :
This implies a critical
R2 =
1F
1F + 2
value of R2 at
significance level a and
is related directly to
R2s beta distribution
where 1 and 2 are the degrees of freedom of MSREG and

s2, respectively.
Obviously a relatively small model (i.e., 1 is small
relative to 2) with a small but significant (at some
predetermined level of significance ) F statistics can
coincide with a rather small R2.
So how do we assess the value of a regression model?

Do not consider either the coefficient of determination
R2 or the calculated F-statistic for the hypothesis
1 = L = p -1 = 0 in isolation
Do strongly consider the consistency of the results
with the theory that led to the study
Box & Wetz (1964, 1973) suggest
the range~ of values in the X-space over which
values of the response variable Y are to be
predicted should be larger than the
corresponding prediction errors
a model should generate an R2 that is a multiple
of the critical value of the F-statistic with 1 and
2 degrees of freedom at a predetermined level of
significance .
The Box & Wetz Approach (with Draper & Smiths

suggested rule of thumb multiple of 4.0) applied to the
age/income data:
The critical value of the F-statistic with 1=3 and 2=6
degrees of freedom at a predetermined level of
significance =0.05 is F3,6,0.05=4.76.
4(4.76)=19.04
The coefficient of determination that corresponds to
4F3,6,0.05=19.04 is
1 4F , ,
3 (19.04 )
R2 =
=
= 0.904943
1 4F , , + 2
3 (19.04 ) + 6
1
The actual value of the coefficient of determination for

this model is R2= 0.9245>0.904943 - by Box & Wetz
standard, the model does explain enough variation in
the response variable to be useful!
Note that Draper & Smith actually argue strongly for use
of a (more conservative) rule of thumb multiple of 10.0
when applied to the age/income data:
The critical value of the F-statistic with 1=3 and 2=6
degrees of freedom at a predetermined level of
significance =0.05 is F3,6,0.05=4.76.
10(4.76)=47.60
The coefficient of determination that corresponds to
10F3,6,0.05=47.60 is
1 10F , ,
3 (47.60 )
R2 =
=
= 0.959677
1 10F , , + 2
3 (47.60 ) + 6
1
The actual value of the coefficient of determination for

this model is R2= 0.9245>0.959677 - by Box & Wetz
standard, the model does not explain enough variation
in the response variable to be useful!
An simpler approximately equivalent alternative

approach to Box & Wetz is to find
i ) - min ( Y
i )
i ) - min ( Y
i )
n max ( Y
max ( Y
,
i.e.
2
2
ps
ps
n
As a rule of thumb, this ratio should exceed 4.0.
For the age/income data, we have
( )
( )
i - min Y
i
max Y
ps2
7.9893242 - (-6.1871484 )
4 ( 28.2983 )
10
= 4.213646 > 4.0
So the model does explain sufficient variation in the

response variable to be useful.
What is the problem with the Box & Wetz approach?

It does not take the context of the model into
consideration (in some contexts we cannot expect a large
R2, while in others we should).
(a clumsy) SAS Example: The Box & Wetz Test

DATA salary;
INPUT name $ 1-9 age 10-11 income 13-14 yrseduc 16 yrsinjob 18-19;
dfn=3; dfd=6; alpha=0.05; pct=1-alpha; FCrit=FINV(pct,dfn,dfd,0);
BoxWetz4=(dfn*(4*FCrit))/((dfn*(4*FCrit))+dfd);
BoxWetz10=(dfn*(10*FCrit))/((dfn*(10*FCrit))+dfd);
CARDS;
Jackson 25 21 4 2
.
.
.
.
.
.
.
.
.
Sanford 47 42 4 18
;
PROC PRINT;
VAR BoxWetz4 BoxWetz10;
RUN;
PROC REG;
MODEL income=age yrseduc yrsinjob;
OUTPUT OUT=regdata PREDICTED=yhat RESIDUAL=error;
RUN;
PROC MEANS DATA=regdata;
VAR error;
RUN;
SAS Output: Box & Wetz Test

Analysis of Age/Income Data
Obs
1
2
3
4
5
6
7
8
9
10
Box
Wetz4
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
Box
Wetz10
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965

Using PROC REG to Generate Predicted & Residual Values
The REG Procedure
Model: MODEL1
Dependent Variable: income Annual Income (in $1,000's)
Analysis of Variance
Sum of
Mean
Source
DF
Squares
Square
F Value
Model
3
2078.31013
692.77004
24.48
Error
6
169.78987
28.29831
Corrected Total
9
2248.10000
Root MSE
Dependent Mean
Coeff Var
5.31962
43.30000
12.28549
Parameter Estimates
Variable
Intercept
age
yrseduc
yrsinjob
Label
DF
Intercept
1
Age in Years
1
Years of Post-Secon 1
Years in Current Po 1
Estimate
-24.56753
1.00791
6.30900
0.00815
R-Square
Adj R-Sq
Parameter
Error
9.27983
0.71229
2.78567
1.05914
Pr > F
0.0009
0.9245
0.8867
Standard
t Value
Pr > |t|
-2.65
0.0382
1.42
0.2068
2.26
0.0641
0.01
0.9941

Using PROC REG to Generate Predicted & Residual Values
The MEANS Procedure
Analysis Variable : error Residual
N
Mean
Std Dev
Minimum
Maximum
10
1.110223E-14
4.3434481
-6.1871484
7.9893242
B. Common Model Building Strategies &

Algorithms
iis p2/n. Thus,
As noted earlier, the average variance of Y
when constructing a regression model (deciding what
independent variables to include in the model), we must
balance two tensions:
Reduce the estimate s2 of 2 as much as possible by

including many independent variables in the model
Reduce p as much as possible by including few
independent variables in the model.
As always, we must also be mindful of overfitting the
model!
Many automated approaches for deciding which

independent variable to include in a regression model
exist:
1. All Possible Regressions if we have r independent
variables (including transformations and interactions
of the original p-1 independent variable) under
consideration, there are 2r possible models each of
these regressions is assessed with regard to some
criterion. Most frequently used criterion include:
coefficient of determination R2
residual mean square s2
Mallows statistic Cp
This is unwieldy, cumbersome, and unwarranted
(particularly when r is large), and leads to overfitting.
2. Best Subset Regression generate summary

statistics from the best k regressions that include 1,
2, 3, independent variables (including
transformations and interactions of the original p-1
independent variable) under consideration this
leaves fewer than rk (usually <<< 2r) possible models.
Again, the Most frequently used criterion include:
coefficient of determination R2
residual mean square s2
Mallows statistic Cp
This is less unwieldy, cumbersome, and unwarranted
(particularly when r is large), but still leads to
overfitting.
SAS Example: All Possible Regressions and Best

Subsets Regression
DATA salary;
LABEL name='Last Name'
age='Age in Years'
income='Annual Income (in $1,000''s)'
yrseduc='Years of Post-Secondary Education'
yrsinjob='Years in Current Position';
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Omit this
Clark
41 64 7 6
option to run
Snyder
33 23 2 11
all possible
Roediger 33 31 4 8
regressions
Simon
41 50 5 12
Lewis
33 36 4 12
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare BEST=1;
RUN;
Number of Observations Read

Number of Observations Used
Number in
Model
R-Square
10
10
Variables in Model
1
0.6556
age
---------------------------------------------2
0.9245
age yrseduc
---------------------------------------------3
0.9245
age yrseduc yrsinjob
Note you can use SAS code to exert some control over
the model fitting process:
To use Mallows Cp Statistic as the model selection
criterion:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=cp;
RUN;
To use the Adjusted R2 as the model selection criterion:

PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=adjrsq;
RUN;
To include the first k independent variables from your

model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare INCLUDE=k;
RUN;
To include at least k independent variables from your

model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare START=k;
RUN;
To include no more than k independent variables from

your model statement in each fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare STOP=k;
RUN;
To print the estimated regression coefficients for each

fitted model:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=rsquare B;
RUN;
3. Iterative Procedures algorithms that systematically

apply some criterion to assess whether to add
independent variables to or remove independent
variables from an existing regression.
Forward Selection (initialize with no independent
variables in the model)
a. A partial-F test is executed for each independent
variable not included in the current (reduced)
model
b. If the smallest p-value corresponding to any of
these partial-F tests is less than some preselected
significance level enter, the associated
independent variable is added to the model
c. The algorithm stops when the smallest p-value
corresponding to the partial-F tests for the
independent variables not in the model is greater
than some preselected significance level enter
Backward Elimination (initialize with all

independent variables in the model)
a. A partial-F test is executed for each
independent variable included in the current
(full) model
b. If the largest p-value corresponding to any of
these partial-F tests is greater than some
preselected significance level stay, the
associated independent variable is removed
from the model
c. The algorithm stops when the largest p-value
corresponding to any of these partial-F tests
for the independent variables in the model is
less than some preselected significance level
stay
Stepwise Regression (initialize with no

independent variables in the model)
a. A partial-F test is executed for each
independent variable not included in the
current (reduced) model
b. If the smallest p-value corresponding to any
of these partial-F tests is less than some
preselected significance level enter, the
associated independent variable is added to
the model
c. A partial-F test is then executed for each
independent variable now included in the
current (full) model
d. If the largest p-value corresponding to any of

these partial-F tests is greater than some
preselected significance level stay, the
associated independent variable is removed
from the model
e. The algorithm continues to iterate until
the largest p-value corresponding to any of
these partial-F tests for the independent
variables in the model is less than some
preselected significance level enter
and
the smallest p-value corresponding to the
partial-F tests for the independent
variables not in the model is greater than
some preselected significance level stay
Concerns and notes on iterative model building

algorithms:
One must carefully consider the choice of the
significance levels enter and stay:
Usually enter is set equal to stay
Occasionally enter is set smaller than stay (why?)
Never is enter is set larger than stay (why not?)
Smaller enter and stay result in more selective
models/larger enter and stay result in less
selective models
Some softwares use a fixed value of F no matter what
the change in degrees of freedom (this can usually be
set by the user)
The phrase F-remove is often used to refer to the
partial-F value.
If XX is not singular (or too near-singular), the

residual error variance s2 is a reliable estimate of 2 in
an asymptotic sense with respect to the number of
relevant independent variables in the model
Automatic model fitting algorithms are reputedly
very slow (especially for large r), but this difficulty
has diminished greatly as computer power has
increased
Automated model fitting algorithms can generate an
overwhelming amount of information (consider the
potential task of evaluating every model generated
for homogeneity of variances, normality, fit,
independence, multicolinearity, outliers, leverage
points, etc.)
As always, we must also be mindful of overfitting the
model!
SAS Example: Stepwise Regression

DATA salary;
age='Age in Years'
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Clark
41 64 7 6
Snyder
33 23 2 11
Roediger 33 31 4 8
Simon
41 50 5 12
Lewis
33 36 4 12
entry (default is 0.15)
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=stepwise SLE=0.1
SLS=0.1;
stay (default is 0.15)
RUN;
The REG Procedure

Model: MODEL1
10
10
Stepwise Selection: Step 1
Variable age Entered: R-Square = 0.6556 and C(p) = 21.3587
Source
Model
Error
Corrected Total
Variable
Intercept
age
DF
1
8
9
Parameter
Estimate
-5.38425
1.22630
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423
Type II SS
16.97223
1473.89410
F Value
15.23
F Value
0.18
15.23
Pr > F
0.0045
Pr > F
0.6864
0.0045
Bounds on condition number: 1, 1

Variable yrseduc Entered: R-Square = 0.9245 and C(p) = 2.0001
Source
Model
Error
Corrected Total
DF
2
7
9
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
The REG Procedure

Model: MODEL1
10
10
Variable
Intercept
age
yrseduc
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Bounds on condition number: 1.0736, 4.2945
Pr > F
0.0135
0.0004
0.0016
Pr > F
0.0001
All variables left in the model are significant at the 0.1000 level.
No other variable met the 0.1000 significance level for entry into the model.
Variable
Step Entered
1 age
2 yrseduc
Variable
Removed
Summary of Stepwise Selection

Number
Partial
Label
Vars In R-Square
Age in Years
1
0.6556
Years of Post
2
0.2689
Model
R-Square
C(p) F Value
0.6556 21.3587
15.23
0.9245
2.0001
24.92

Step Pr > F
1 0.0045
2 0.0016
What if we used an extremely liberal stay?

DATA salary;
age='Age in Years'
CARDS;
Jackson 25 21 4 2
Ross
47 57 6 11
Bright
35 44 4 8
Standifer62 65 4 27
Clark
41 64 7 6
Snyder
33 23 2 11
Roediger 33 31 4 8
Simon
41 50 5 12
Lewis
33 36 4 12
Sanford 47 42 4 18
;
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=stepwise SLE=0.1
SLS=0.999999;
RUN;
The REG Procedure

Model: MODEL1
10
10
Variable age Entered: R-Square = 0.6556 and C(p) = 21.3587
Source
Model
Error
Corrected Total
Variable
Intercept
age
DF
1
8
9
Parameter
Estimate
-5.38425
1.22630
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423
Type II SS
16.97223
1473.89410
F Value
15.23
F Value
0.18
15.23
Pr > F
0.0045
Pr > F
0.6864
0.0045
Bounds on condition number: 1, 1

Variable yrseduc Entered: R-Square = 0.9245 and C(p) = 2.0001
Source
Model
Error
Corrected Total
DF
2
7
9
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
The REG Procedure

Model: MODEL1
10
10
Variable
Intercept
age
yrseduc
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Pr > F
0.0135
0.0004
0.0016
Pr > F
0.0001

Variable yrsinjob Entered: R-Square = 0.9245 and C(p) = 4.0000
Source
Model
Error
Corrected Total
DF
3
6
9
Sum of
Mean
Squares
Square
2078.31013
692.77004
169.78987
28.29831
2248.10000
F Value
24.48
Pr > F
0.0009
The REG Procedure

Model: MODEL1
10
10
Variable
Intercept
age
yrseduc
yrsinjob
Parameter
Estimate
-24.56753
1.00791
6.30900
0.00815
Standard
Error
9.27983
0.71229
2.78567
1.05914
Type II SS
198.33685
56.66155
145.15177
0.00168
F Value
7.01
2.00
5.13
0.00
Pr > F
0.0382
0.2068
0.0641
0.9941

Variable yrsinjob Removed: R-Square = 0.9245 and C(p) = 2.0001
Source
Model
Error
Corrected Total
DF
2
7
9
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
Pr > F
0.0001
The REG Procedure

Model: MODEL1
10
10
Variable
Intercept
age
yrseduc
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Pr > F
0.0135
0.0004
0.0016
All variables left in the model are significant at the 0.1000 level.
The stepwise method terminated because the next variable to be entered was just
removed.
Variable Variable
Step Entered
Removed
1 age
2 yrseduc
3 yrsinjob
4
yrsinjob

Number
Partial
Model
Label
Vars In R-Square R-Square
C(p) F Value
Age in Year
1
0.6556
0.6556 21.3587
15.23
Years of Post
2
0.2689
0.9245
2.0001
24.92
Years in Curr
3
0.0000
0.9245
4.0000
0.00
Years in Curr
2
0.0000
0.9245
2.0001
0.00
Step Pr > F
1 0.0045
2 0.0016
3 0.9941
4 0.9941
To use the forward selection algorithm:

entry (default is 0.50)
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=f SLE=0.15;
RUN;
To use the backward elimination algorithm:

PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=b SLS=0.15;
RUN;
stay (default is 0.10)
The INCLUDE, START, and STOP options can also be

used with the STEPWISE, FORWARD, and BACKWARD
selection algorithms.
SAS offers two additional forward selection-based

algorithms:
Maximum R2 Improvement
Use forward selection to find a two-independent
variable model
Examine each pair of independent variables not
included in the current model; choose the pair
that generates the maximum increase in R2 over
the current (bivariate) model
Continue the forward selection process to look for
a third variable to add to the new bivariate model
(if none exists, the algorithm terminates)
Examine each trio of independent variables not
included in the current model; choose the trio that
generates the maximum increase in R2 over the
current (trivariate) model
Continue until the algorithm terminates
Minimum R2 Improvement
Use forward selection to find a two-independent
variable model
Examine each pair of independent variables not
included in the current model; choose the pair
that generates the minimum increase in R2 over
the current (bivariate) model
Continue the forward selection process to look for
a third variable to add to the new bivariate model
(if none exists, the algorithm terminates)
Examine each trio of independent variables not
included in the current model; choose the trio that
generates the minimum increase in R2 over the
current (trivariate) model
Continue until the algorithm terminates
Min R2 will give a greater variety of models
consideration than Max R2.
To use the Maximum R2 selection algorithm:

PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=maxr;
RUN;
To use the backward elimination algorithm:

PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=minr
RUN;
The INCLUDE, START, and STOP options can also be

used with the MAXR and MINR selection algorithms.
Note that the INCLUDE option overrides the START and
STOP options for each of these algorithms in SAS.
C. External Model Validation

The coefficient of determination (R2) for sample data is
naturally biased to be larger than the coefficient of
determination that would be achieved over the
population why?
1. Suppose the population model is
= x
Y
2. If for every possible sample of size n, we find
= b x where b = (i.e., the population model)
Y
the mean of all possible sample R2s is the population

model R2
3. However, for each possible sample of size n, we find

either
= b x where b = (i.e., the population model)
Y
or
= b x where b (i.e., some other model)
Y
where some model other than the population model
is selected when it produces a larger R2 than the
population model for that sample
4. If some model other than the population model is
selected for at least one sample, the mean sample R2
will exceed the R2 of the population model (i.e., is
positively biased).
This bias (and all overfitting) results from fitting the
model and evaluating the fit on the same data!
This also explains why residuals do not give an

indication of how well the model will estimate/predict
values of the dependent variable for new data
One way to overcome this problem is to partition the
data and
use only a subset of the entire data set when fitting a
model
assess the performance of the fitted model on
reaming data that were not used to fit the model
This is the concept behind this class of model evaluation
methods called cross validation.
In cross-validation, the performance of the final

regression model is assessed over data not used to fit the
model. This is generally done in one of several ways:
Prediction Coefficient of Determination calculate the
coefficient of determination for the fitted model on
data not used to fit the model. This is usually denoted
R 2prediction.
Mean Square Prediction calculate the mean squared
error for the fitted model on data not used to fit the
model. This is usually denoted MSEprediction.
Use of other performance measures such as the Akaike
Information Criteria (AIC) or Bayesian Information
Criteria (BIC) (sometimes referred to as the Schwarz
Information Criteria or SIC)
The resulting model assessment statistics are frequently
referred to as measures of generalization error.
Partitioning of the data set into the i) subset to be used to

fit the model and ii) subset to be used to assess the fit of
the model is generally done in one of two ways:
Holdout Sample Cross-Validation (a.k.a. Data
Splitting or Split Sampling) separate data sets are
used to fit the model (the modeling sample of size
nmodel) and assess the models fit (the holdout sample of
size nholdout) statistics used to assess the fit of the
model on the holdout sample include
n model + n holdout
R 2prediction = 1 -
i = n model + 1
n model + n holdout
( yi
( yi
i )
- y
- y holdout )
i = n model + 1
i is the estimated value for the ith observation

where y
from the holdout using the regression fitted on
the model sample
or
n model + n holdout
MSEprediction =
( yi
i )
- y
i = n model + 1
n holdout - p
Issues/concerns in Holdout Sample Cross-Validation

Extremely simple and intuitively pleasing
Model selection is very simple (choose the model
with the best holdout sample performance)
How do you divide the data among the mutually
exclusive and collectively exhaustive samples?
equally?
nmodel > nholdout?
nmodel < nholdout?
Montgomery, Peck, & Vining (2001) suggest a
minimum of nholdout 15 or 20
Does not use the data efficiently both nmodel

and nholdout <<< n
If we have a small data set, our model and/or
holdout sample may not be representative of the
population thus, there are two potential
sources of bias and the holdout sample estimator
of fit generally has high variance
Cross sectional data are usually randomly
partitioned
Time series data generally use the latest/most
recent observation as the holdout sample (why?)
A SAS Holdout Sample Routine

DATA salary;
seed=1575764;
ranhold=RANUNI(seed);
pctanalysis=0.5;
mergenum=1;
age='Age in Years'
CARDS;
Jackson 25 21 4 2
.
.
.
.
.
.
.
.
.
.
.
.
;
PROC MEANS NOPRINT;
VAR age;
OUTPUT OUT=nobs n=sampsize;
ID mergenum;
RUN;
DATA salary;
MERGE salary nobs;
BY mergenum;
DROP _TYPE_ _FREQ_ mergenum;
RUN;
PROC SORT DATA=salary;
BY ranhold;
RUN;
DATA salary;
SET salary;
nanalysis = pctanalysis*sampsize;
IF _N_<=nanalysis THEN DO;
analysis = 1;
yanalysis = income;
END;
IF analysis ne 1 THEN DO;
yholdout = income;
END;
RUN;
PROC REG;
MODEL yanalysis=age yrseduc;
OUTPUT OUT=new P=predict;
TITLE "REGRESSION EXECUTED ON MODEL DATA";
RUN;
DATA model;
SET new;
IF analysis ne 1 THEN DELETE;
error=yanalysis-predict;
DROP analysis yholdout;
RUN;
PROC PRINT DATA=model;
VAR name yanalysis predict error age yrseduc yrsinjob;
TITLE "MODEL DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=model;
VAR error yanalysis;
OUTPUT OUT=mdiagnostics VAR=verror vyanalysis;
TITLE "SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;
DATA mdiagnostics;
SET mdiagnostics;
rsquare=1-verror/vyanalysis;
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
TITLE "MORE SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;
DATA holdout;
SET new;
IF analysis=1 THEN DELETE;
error=yholdout-predict;
DROP analysis yanalysis;
RUN;
PROC PRINT DATA=holdout;
VAR name yholdout predict error age yrseduc yrsinjob;
TITLE "HOLDOUT DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=holdout;
VAR error yholdout;
OUTPUT OUT=hdiagnostics VAR=verror vyholdout;
TITLE "SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;
DATA hdiagnostics;
SET hdiagnostics;
rsquare=1-verror/vyholdout;
LABEL rsqaure="R-SQUARE PREDITION FOR THE HOLDOUT SAMPLE"
verror="MSE PREDITION FOR THE HOLDOUT SAMPLE";
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
VAR rsquare verror;
TITLE "MORE SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;
The REG Procedure

Model: MODEL1
Dependent Variable: yanalysis
Number of Observations with Missing Values
Source
DF
Model
2
Error
2
Corrected Total 4
Sum of
Mean
Squares
Square
F Value
1393.65758
696.82879
24.74
56.34242
28.17121
1450.00000
Root MSE
Dependent Mean
Coeff Var
Variable
Intercept
age
yrseduc
DF
1
1
1
Parameter
Estimate
-27.85033
1.02684
6.78357
5.30766
45.00000
11.79479
R-Square
Adj R-Sq
Parameter Estimates
Standard
Error
t Value
10.90428
-2.55
0.22311
4.60
1.50291
4.51
10
5
5
Pr > F
0.0389
0.9611
0.9223
Pr > |t|
0.1252
0.0441
0.0457
MODEL DATA AND PREDICTED INCOMES

Obs
1
2
3
4
5
name
yanalysis
Roediger
31
Standifer
65
Clark
64
Sanford
42
Snyder
23
predict
33.1696
62.9478
61.7349
47.5453
19.6024
error age yrseduc yrsinjob

-2.16955 33
4
8
2.05218 62
4
27
2.26505 41
7
6
-5.54527 47
4
18
3.39758 33
2
11
SUMMARY STATISTICS FOR THE MODEL DATA

The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
error
5 4.973799E-15
3.7530793 -5.5452663
3.3975814
yanalysis
5
45.0000000
19.0394328 23.0000000
65.0000000
MORE SUMMARY STATISTICS FOR THE MODEL DATA

Obs
1
verror
14.0856
vyanalysis
362.5
rsquare
0.96114
HOLDOUT DATA AND PREDICTED INCOMES

Obs
1
2
3
4
5
name
Jackson
Lewis
Simon
Bright
Ross
yholdout
21
36
50
44
57
predict
24.9549
33.1696
48.1678
35.2232
61.1124
error
-3.95486
2.83045
1.83219
8.77677
-4.11240
age
25
33
41
35
47
yrseduc yrsinjob
4
2
4
12
5
12
4
8
6
11
SUMMARY STATISTICS FOR THE HOLDOUT DATA

The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
error
5
1.0744305
5.3661170
-4.1123996
8.7767746
yholdout
5 41.6000000 13.8672276
21.0000000
57.0000000
MORE SUMMARY STATISTICS FOR THE HOLDOUT DATA

Obs
1
rsquare
0.85026
verror
28.7952
So the model
^
Income = -27.85 + 1.03Age + 6.78YearsEduc
explains 85% of the variation in the holdout values of

Income ( R 2prediction = 0.85026) and has a MSEprediction of
28.7952.
Note that the model generated values of R2 = 0.9611 and
MSE = 28.17121 what does the discrepancies between
these sets of values say about the model?
Resampling
jackknife (leave-one-out) cross-validation
(invented by Quenouille and later developed by
Tukey);
o fit the model n times (omitting a different
observation each time)
o use only the omitted observation to assess the
model fitted when it was omitted using
whatever error criterion interests you
o summarize the error criterion over the n
subsets
v-fold cross-validation (developed by Geisser,

Stone, and Wahba)
o divide the data into k mutually excusive and
collectively exhaustive subsets of
(approximately) equal size
o fit the model k times (omitting a different one
of the k subsets each time)
o use only the omitted subset to assess the
model fitted when it was omitted using
whatever error criterion interests you
o summarize the error criterion over the k
subsets
Note that:
v-fold cross-validation was developed as a
compromise between holdout sample and jackknife
cross validation methods
If v = n, then v-fold cross-validation is the jackknife
Evidence suggests v-fold cross-validation is markedly
superior to holdout sample cross-validation for small
data sets (Goutte, 1997, "Note on Free Lunches and
Cross-Validation," Neural Computation, 9, 1211-1215,
ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz)
A value of 10 for k is popular for estimating
generalization error
A small change in the data can cause a large change in
the model selected under the jackknife/leave-one-out
cross-validation (Breiman, 1996, "Heuristics of
Instability and Stabilization in Model Selection,"
Annals of Statistics, 24, 2350-2383)
For choosing subsets of independent variables in

linear regression, Breiman and Spector, 1992
(Submodel Selection and Evaluation in Regression:
The X-Random Case," International Statistical Review,
60, 291-319) found 10-fold and 5-fold cross-validation
to work better than leave-one-out
For an insightful early discussion of the limitations of
cross-validation methods, see Stone, 1977
("Asymptotics for and against Cross-Validation,"
Biometrika, 64, 29-35)
For an insightful later discussion of the limitations of
cross-validation methods, see Efron & Tibshirani, 1997
("Improvements on cross-validation: The .632+
bootstrap method," Journal of the American Statistical
Association, 92, 548-560)
Ultimately do not forget Occams Razor - the simplest

explanation is probably the most likely (a quality often
strived for in science)
Parsimony is a particularly relevant issues in model
building/selection, where the modeler must make a
compromise between
model bias (the difference between the estimated
value and true unknown value of a parameter)
and
variance (the precision of these estimates)
where
a model with too many variables will have low
precision
a model with too few variables will be biased
(Burnham and Anderson 2002).
SOME questions you should be able to answer:

1. What is the difference between a statistically significant
result and a contextually meaningful result? Why should
we be concerned with this distinction? Under what
circumstances can a statistical analysis yield one but not the
other?
2. What is the relationship between the coefficient of
determination R2 and the calculated F-statistic for the
hypothesis 1 = L = p -1 = 0 ? Under what conditions
can this F-test be significant while the coefficient of
determination R2 is relatively small? Under what conditions
can this F-test be insignificant while the coefficient of
determination R2 is relatively large?
SOME questions you should be able to answer:

3. What considerations should be made when evaluating the
overall model? How is the Box & Wetz procedure used?
What are the limitations of this approach? On what
statistical theory is this approach based? How can SAS be
used to implement this approach?
4. What are the most common automated algorithms for
finding a regression? What is the rationale of each of these
approaches? What are the risks of using one of these
approaches?
5. Why is overfitting the model to the sample data a problem?
Why does overfitting occur? How is external model
validation used to assess/prevent overfitting? How does
cross-validation work? How can cross-validation be used to
assess/prevent overfitting? How does resampling work?
How can resampling be used to assess/prevent overfitting?

Issues in Model Building PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Issues in Model Building PDF

Uploaded by

Copyright:

Available Formats

IX.

Issues in Model Building

How/why are these outcomes possible?

Consider the relationship between the coefficient of

where 1 and 2 are the degrees of freedom of MSREG and

So how do we assess the value of a regression model?

The Box & Wetz Approach (with Draper & Smiths

The actual value of the coefficient of determination for

The actual value of the coefficient of determination for

An simpler approximately equivalent alternative

= 4.213646 > 4.0

So the model does explain sufficient variation in the

What is the problem with the Box & Wetz approach?

(a clumsy) SAS Example: The Box & Wetz Test

SAS Output: Box & Wetz Test

SAS Output: Box & Wetz Test

SAS Output: Box & Wetz Test

B. Common Model Building Strategies &

Reduce the estimate s2 of 2 as much as possible by

Many automated approaches for deciding which

2. Best Subset Regression generate summary

SAS Example: All Possible Regressions and Best

Number of Observations Read

To use the Adjusted R2 as the model selection criterion:

To include the first k independent variables from your

To include at least k independent variables from your

To include no more than k independent variables from

To print the estimated regression coefficients for each

3. Iterative Procedures algorithms that systematically

Backward Elimination (initialize with all

Stepwise Regression (initialize with no

d. If the largest p-value corresponding to any of

Concerns and notes on iterative model building

If XX is not singular (or too near-singular), the

SAS Example: Stepwise Regression

The REG Procedure

Bounds on condition number: 1, 1

Stepwise Selection: Step 2

The REG Procedure

Bounds on condition number: 1.0736, 4.2945

Summary of Stepwise Selection

Summary of Stepwise Selection

What if we used an extremely liberal stay?

The REG Procedure

Bounds on condition number: 1, 1

Stepwise Selection: Step 2

The REG Procedure

Bounds on condition number: 1.0736, 4.2945

Stepwise Selection: Step 3

The REG Procedure

Bounds on condition number: 17.572, 117.17

Stepwise Selection: Step 4

The REG Procedure

Bounds on condition number: 1.0736, 4.2945

Summary of Stepwise Selection

To use the forward selection algorithm:

To use the backward elimination algorithm:

The INCLUDE, START, and STOP options can also be

SAS offers two additional forward selection-based

To use the Maximum R2 selection algorithm:

To use the backward elimination algorithm:

The INCLUDE, START, and STOP options can also be