Professional Documents
Culture Documents
It is possible for
a large-sample study to have statistically significant
results that do not reflect meaningful change or
difference.
or, conversely
a small sample study to yield a meaningful change or
difference (from the hypothesized value) that is not
statistically significant.
1F
1F + 2
value of R2 at
significance level a and
is related directly to
R2s beta distribution
Note that Draper & Smith actually argue strongly for use
of a (more conservative) rule of thumb multiple of 10.0
when applied to the age/income data:
The critical value of the F-statistic with 1=3 and 2=6
degrees of freedom at a predetermined level of
significance =0.05 is F3,6,0.05=4.76.
10(4.76)=47.60
The coefficient of determination that corresponds to
10F3,6,0.05=47.60 is
1 10F , ,
3 (47.60 )
R2 =
=
= 0.959677
1 10F , , + 2
3 (47.60 ) + 6
1
,
i.e.
2
2
ps
ps
n
As a rule of thumb, this ratio should exceed 4.0.
For the age/income data, we have
( )
( )
i - min Y
i
max Y
ps2
7.9893242 - (-6.1871484 )
4 ( 28.2983 )
10
Obs
1
2
3
4
5
6
7
8
9
10
Box
Wetz4
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
0.90489
Box
Wetz10
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
0.95965
5.31962
43.30000
12.28549
Parameter Estimates
Variable
Intercept
age
yrseduc
yrsinjob
Label
DF
Intercept
1
Age in Years
1
Years of Post-Secon 1
Years in Current Po 1
Estimate
-24.56753
1.00791
6.30900
0.00815
R-Square
Adj R-Sq
Parameter
Error
9.27983
0.71229
2.78567
1.05914
Pr > F
0.0009
0.9245
0.8867
Standard
t Value
Pr > |t|
-2.65
0.0382
1.42
0.2068
2.26
0.0641
0.01
0.9941
10
1.110223E-14
4.3434481
-6.1871484
7.9893242
R-Square
10
10
Variables in Model
1
0.6556
age
---------------------------------------------2
0.9245
age yrseduc
---------------------------------------------3
0.9245
age yrseduc yrsinjob
Note you can use SAS code to exert some control over
the model fitting process:
To use Mallows Cp Statistic as the model selection
criterion:
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=cp;
RUN;
Source
Model
Error
Corrected Total
Variable
Intercept
age
DF
1
8
9
Parameter
Estimate
-5.38425
1.22630
Analysis of Variance
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423
Type II SS
16.97223
1473.89410
F Value
15.23
F Value
0.18
15.23
Pr > F
0.0045
Pr > F
0.6864
0.0045
Source
Model
Error
Corrected Total
DF
2
7
9
Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Pr > F
0.0135
0.0004
0.0016
Pr > F
0.0001
All variables left in the model are significant at the 0.1000 level.
No other variable met the 0.1000 significance level for entry into the model.
Variable
Step Entered
1 age
2 yrseduc
Variable
Removed
Model
R-Square
C(p) F Value
0.6556 21.3587
15.23
0.9245
2.0001
24.92
Source
Model
Error
Corrected Total
Variable
Intercept
age
DF
1
8
9
Parameter
Estimate
-5.38425
1.22630
Analysis of Variance
Sum of
Mean
Squares
Square
1473.89410
1473.89410
774.20590
96.77574
2248.10000
Standard
Error
12.85697
0.31423
Type II SS
16.97223
1473.89410
F Value
15.23
F Value
0.18
15.23
Pr > F
0.0045
Pr > F
0.6864
0.0045
Source
Model
Error
Corrected Total
DF
2
7
9
Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Pr > F
0.0135
0.0004
0.0016
Pr > F
0.0001
Source
Model
Error
Corrected Total
DF
3
6
9
Analysis of Variance
Sum of
Mean
Squares
Square
2078.31013
692.77004
169.78987
28.29831
2248.10000
F Value
24.48
Pr > F
0.0009
Parameter
Estimate
-24.56753
1.00791
6.30900
0.00815
Standard
Error
9.27983
0.71229
2.78567
1.05914
Type II SS
198.33685
56.66155
145.15177
0.00168
F Value
7.01
2.00
5.13
0.00
Pr > F
0.0382
0.2068
0.0641
0.9941
Source
Model
Error
Corrected Total
DF
2
7
9
Analysis of Variance
Sum of
Mean
Squares
Square
2078.30845
1039.15423
169.79155
24.25594
2248.10000
F Value
42.84
Pr > F
0.0001
Parameter
Estimate
-24.60236
1.01323
6.29030
Standard
Error
7.50022
0.16300
1.26012
Type II SS
260.98995
937.19626
604.41435
F Value
10.76
38.64
24.92
Pr > F
0.0135
0.0004
0.0016
All variables left in the model are significant at the 0.1000 level.
The stepwise method terminated because the next variable to be entered was just
removed.
Variable Variable
Step Entered
Removed
1 age
2 yrseduc
3 yrsinjob
4
yrsinjob
PROC REG;
MODEL income=age yrseduc yrsinjob/SELECTION=f SLE=0.15;
RUN;
Minimum R2 Improvement
Use forward selection to find a two-independent
variable model
Examine each pair of independent variables not
included in the current model; choose the pair
that generates the minimum increase in R2 over
the current (bivariate) model
Continue the forward selection process to look for
a third variable to add to the new bivariate model
(if none exists, the algorithm terminates)
Examine each trio of independent variables not
included in the current model; choose the trio that
generates the minimum increase in R2 over the
current (trivariate) model
Continue until the algorithm terminates
Min R2 will give a greater variety of models
consideration than Max R2.
R 2prediction = 1 -
i = n model + 1
n model + n holdout
( yi
( yi
i )
- y
- y holdout )
i = n model + 1
or
n model + n holdout
MSEprediction =
( yi
i )
- y
i = n model + 1
n holdout - p
DATA salary;
MERGE salary nobs;
BY mergenum;
DROP _TYPE_ _FREQ_ mergenum;
RUN;
PROC SORT DATA=salary;
BY ranhold;
RUN;
DATA salary;
SET salary;
nanalysis = pctanalysis*sampsize;
IF _N_<=nanalysis THEN DO;
analysis = 1;
yanalysis = income;
END;
IF analysis ne 1 THEN DO;
yholdout = income;
END;
RUN;
PROC REG;
MODEL yanalysis=age yrseduc;
OUTPUT OUT=new P=predict;
TITLE "REGRESSION EXECUTED ON MODEL DATA";
RUN;
DATA model;
SET new;
IF analysis ne 1 THEN DELETE;
error=yanalysis-predict;
DROP analysis yholdout;
RUN;
PROC PRINT DATA=model;
VAR name yanalysis predict error age yrseduc yrsinjob;
TITLE "MODEL DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=model;
VAR error yanalysis;
OUTPUT OUT=mdiagnostics VAR=verror vyanalysis;
TITLE "SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;
DATA mdiagnostics;
SET mdiagnostics;
rsquare=1-verror/vyanalysis;
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
TITLE "MORE SUMMARY STATISTICS FOR THE MODEL DATA";
RUN;
DATA holdout;
SET new;
IF analysis=1 THEN DELETE;
error=yholdout-predict;
DROP analysis yanalysis;
RUN;
PROC PRINT DATA=holdout;
VAR name yholdout predict error age yrseduc yrsinjob;
TITLE "HOLDOUT DATA AND PREDICTED INCOMES";
RUN;
PROC MEANS DATA=holdout;
VAR error yholdout;
OUTPUT OUT=hdiagnostics VAR=verror vyholdout;
TITLE "SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;
DATA hdiagnostics;
SET hdiagnostics;
rsquare=1-verror/vyholdout;
LABEL rsqaure="R-SQUARE PREDITION FOR THE HOLDOUT SAMPLE"
verror="MSE PREDITION FOR THE HOLDOUT SAMPLE";
DROP _TYPE_ _FREQ_;
RUN;
PROC PRINT;
VAR rsquare verror;
TITLE "MORE SUMMARY STATISTICS FOR THE HOLDOUT DATA";
RUN;
Source
DF
Model
2
Error
2
Corrected Total 4
Analysis of Variance
Sum of
Mean
Squares
Square
F Value
1393.65758
696.82879
24.74
56.34242
28.17121
1450.00000
Root MSE
Dependent Mean
Coeff Var
Variable
Intercept
age
yrseduc
DF
1
1
1
Parameter
Estimate
-27.85033
1.02684
6.78357
5.30766
45.00000
11.79479
R-Square
Adj R-Sq
Parameter Estimates
Standard
Error
t Value
10.90428
-2.55
0.22311
4.60
1.50291
4.51
10
5
5
Pr > F
0.0389
0.9611
0.9223
Pr > |t|
0.1252
0.0441
0.0457
name
yanalysis
Roediger
31
Standifer
65
Clark
64
Sanford
42
Snyder
23
predict
33.1696
62.9478
61.7349
47.5453
19.6024
error
5 4.973799E-15
3.7530793 -5.5452663
3.3975814
yanalysis
5
45.0000000
19.0394328 23.0000000
65.0000000
verror
14.0856
vyanalysis
362.5
rsquare
0.96114
name
Jackson
Lewis
Simon
Bright
Ross
yholdout
21
36
50
44
57
predict
24.9549
33.1696
48.1678
35.2232
61.1124
error
-3.95486
2.83045
1.83219
8.77677
-4.11240
age
25
33
41
35
47
yrseduc yrsinjob
4
2
4
12
5
12
4
8
6
11
error
5
1.0744305
5.3661170
-4.1123996
8.7767746
yholdout
5 41.6000000 13.8672276
21.0000000
57.0000000
rsquare
0.85026
verror
28.7952
So the model
^
Resampling
jackknife (leave-one-out) cross-validation
(invented by Quenouille and later developed by
Tukey);
o fit the model n times (omitting a different
observation each time)
o use only the omitted observation to assess the
model fitted when it was omitted using
whatever error criterion interests you
o summarize the error criterion over the n
subsets
Note that:
v-fold cross-validation was developed as a
compromise between holdout sample and jackknife
cross validation methods
If v = n, then v-fold cross-validation is the jackknife
Evidence suggests v-fold cross-validation is markedly
superior to holdout sample cross-validation for small
data sets (Goutte, 1997, "Note on Free Lunches and
Cross-Validation," Neural Computation, 9, 1211-1215,
ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz)
A value of 10 for k is popular for estimating
generalization error
A small change in the data can cause a large change in
the model selected under the jackknife/leave-one-out
cross-validation (Breiman, 1996, "Heuristics of
Instability and Stabilization in Model Selection,"
Annals of Statistics, 24, 2350-2383)