You are on page 1of 72

1

Regression Analysis
with SAS
Robert A. Yaffee, Ph.D.
Statistics, Mapping, and Social Science
Group
Academic Computing Services
ITS
NYU
251 Mercer Street
New York, NY 10012
Office: 75 Third Avenue, SB
p. 212.998.3402
Email: yaffee@nyu.edu
2
Outline
1. Conceptualization of Regression Analysis
2. Plotting the data
3. Linear Regression Theory
1. Derivation of the intercept
2. Derivation of the slope
3. Multiple linear Regression
4. Significance tests
5. Assumptions
6. Diagnostic Tests of the assumptions
7. Assessment of robustness of model
4. Interaction model
1. Conceptualization
2. Path diagram
3. Program syntax
5. Note on polynomial regression
6. Model Building Strategies
7. Robust Alternatives
1. WLS
2. White Estimators
3. Median Regression Proc Robustreg
3
Regression Analysis
Have a clear notion of what you can and
cannot do with regression analysis
Conceptualization
A Path Model of a Regression
Analysis

Path Diagram of A Linear Regression
Analysis
Y
Y
X1
X2
x3
error
i i
Y k b x b x b x e = + + + +
1 1 2 2 3 3
4
Hypothesis Testing
For example: hypothesis 1 : X is
statistically significantly related to
Y.
The relationship is positive (as X
increases, Y increases) or negative
(as X decreases, Y increases).
The magnitude of the relationship is
small, medium, or large.
If the magnitude is small, then a unit
change in x is associated with a
small change in Y.



5
Plotting the Data: Testing
for Functional form
1. The relationship between the Y
and each X needs to be
examined.
2. One needs to ascertain whether
it is linear or not
3. If it is not linear, can it be
transformed into linearity?
4. If so, then that may be the
easiest recourse.
5. If not, then perhaps linear
regression is not appropriate
6
I. Exploring the functional form of the relationship.
It is necessary to plot the dependent variable against each of the independent
variables.
1. The purpose is to determine the functional form of the relationship. If
the relationship is linear, then the distribution of data points on the graph will
have the approximate appearance of a straight line. There may be some
dispersion about the line. For the most part, the distribution can be
approximated by a straight line.
2 If the distribution has a curved form, the question arises as to whether
actual functional form approximating the distribution is one of a
curve or not.
3. If it is some sort of curve, it may be transformed into a straight line. If it can,
then this transformation of the original variable may be necessary.
4. To ascertain what the best functional form of these individual relationships
is, the analyst may wish to run the segmented regression over portions
of each independent variable.
5. The analystmay a test a number of functional forms to see which produces the
best R square. This functional form estimation procedure on the relationship to
see which functional form provides the best fit. He may refer to the R
2
statistic for
each functional form and select that with the maximum.
6. To demonstrate how this procedure works with data provided, we may show
example of employing the curve estimation with a polynomial of order three.
The real data are distributed as a cubic.
a. We may plot the data with a plot procedure, called
proc plot; The command syntax for PROC PLOT follows:
PROC PLOT;
PLOT Y*X;
RUN:
b. If we are using high resolution graphics, called
proc gplot, we may have much more control over the appearance
of the output. The command syntax for PROC GPLOT looks
like the following:
PROC GPLOT;
PLOT Y*X;
RUN:
Explore the
relationship
7
Graphical Exploration
The generation of the data is done with a do loop:
data plot1;
do x = 1 to 30;
y = x**3;
output;
end;
proc plot;
plot y*x; run;
We examine the plot to look for functional form,
outlier patterns, and other peculiarities.
8
Graphical Plot
The output from this plot
appears in figure 1:

Figure 1 A plot of a nonlinear relationship between the dependent
variable Y, and an independent variable, X.
Figure 1 A plot of a nonlinear relationship between the dependent
variable Y, and an independent variable, X.
9
Testing Functional
Form with Curve Fitting
1. 1b. Curve Fitting
2. What is done here is that the data are subjected
to a number of tests of functional form.
Application of the data to such a process
presumes either a linear relationship or a
relationship that can be transformed into a linear
one.
1. From a regression analysis of the
relationship, an r square is generated. This is
the square of the multiple correlation
coefficient between the dependent variable
and the independent variables in the model.
This r square is the proportion of variance of
the dependent variable explained by the
functional form. The higher the r square, the
closer the functional form is to the actual
relationship inherent in the data.

3. The program to set up the curve fitting may be
found in figure 3 below.

10
We may fit functional
forms
Do x = 1 to 100;
liny = a + x;
quady = a + x**2;
cubicy = a + x**3;
lnx = log(x);
invx = 1/x;
expx = exp(x);
compound = a*b**x;
power = a*x**b;
sshapex = exp( a + x**(-1)) ;
growth = exp(a + b*x);
output;
End;
Proc print;
run;
Then run different models. For example,
Proc Reg;
model y = quadx;
run;
Use the transformation that yields the highest R
2


11
2. Command syntax for Curve Fitting
The SAS programming syntax to set up the test of the data against
the functional forms may be found in figure 4.
12
Curve Fitting Output &
Interpretation
Output and interpretation
The proc print label data =bbb;
Var _model_ p rsq;
Run;
Produces the output in figure 5 on the next page.

While all three forms have significant components, it can be seen that the
R
2
approaches 1.00 only with the cubic and power functional forms. The
regression coefficient for this curve equals 1.00 and we have identified the
functional form of the regression analysis. The cubic functional form is
the third power of the function, so it is not surprising that the coefficient
of determination (R
2
) for that will be 1.00 also.
The nice aspect of curve fitting is that it provides a clear objective
criterion against which to assess functional form.
13
14
With the aid of this tool, one may obtain a better idea as to which
transformation to apply in order to render a nonlinear relationship
amenable to linear regression, in the event of apparent nonlinearity.
Linear Regression analysis
A. The General Linear Model
B.Derviation of the SS and Anova
C. Derivation of the coefficients: a and b
1. Bivariate Case
2. Multivariate Case: a, b1 and b2
D.The significance tests
E.The Prediction interval and its derivation
F. The Assumptions of Regression Analysis
G. Testing those assumptions
H. Robustness of the Model in face of their violation
I. Fixes:
1. Weighted Least Squares estimation for heteroskedasticity
2. Autoregression for autocorrelation
3. Nonlinear regression for nonlinearity
15
Fixes-continued
4. Nonparametric regression for bivariate cases where assumptions dont
Hold
5. Logistic or Probit regression for dichotomous dependent variables
6. Poisson Regression for Count or Rare event data
16
The General Linear
Model
A. The General Linear Model
Includes regression, anova, and ancova. The main difference is the
Levels of measurement of the independent variables.
1. Regression: IVs are continuous
2. Anova: IVs are discrete
3. Ancova: IVs are both discrete and continuous
B. Decomposition of the Sums of squares
If we look at the graph of the regression line, we may geometrically
represent the error, the regression effect, and the total effect.
The graphical representation below depicts the sums of squares decomposition of
the equation :
SS total = SS regression + SS error
If we divide both sides of the equation by the
Respective df, we obtain the MS
SS total/n -1 = SS regression/k + SS error/n- k - 1
Where n = sample size and k = # independent vars in model
MS total = MS regression + MS error
Since MS = variance
Variance total = Regression Variance + error variance
We divide both sides by the total variance to obtain
1 = R square regression + R square error
F = Regression Variance
---------------------------
Error Variance
17
X
Y
y a bx = +
X
Y
i
Y
}

i i
y y error =
y y regression effect =
}
{
i
y y Total Effect =
Decomposition of Effects
18
Decomposition of the
sum of squares

( )


( ) ( ) ( )

( ) ( ) ( )
i i i i
i i i i
n n n
i i i i
i i i
Y Y Y Y Y Y
total effect error effects regression model effect
Y Y Y Y Y Y per case i
Y Y Y Y Y Y per case i
Y Y Y Y Y Y for data set
= = =
= +
= +
= +
= +
= +

2 2 2
2 2 2
1 1 1
19
Decomposition of the sum
of squares
Total SS = model SS + error SS
and if we divide by df





This yields the Variance Decomposition:
We have the total variance= model
variance + error variance


( ) ( ) ( )
n n n
i i i i
i i i
Y Y Y Y Y Y
n n k k
= = =

= +


2 2 2
1 1 1
1 1
20
F test for significance and
R
2
for magnitude of effect
R
2
= Model var/total var

( , ) k n k
R
k
F
R
n k

=


2
1
2
1
1

( )

( )
n
i
i
n
i i
i
Y Y
k
R
Y Y
n k
=
=

2
1
2
2
1
1






F test for model significance
= Model Var/Error Var



21
ANOVA tests the significance of the
Regression Model
22
Derivation of the Intercept
n n n
i i i
i i i
n n n n
i i i i
i i i i
n
i
i
n n n
i i i
i i i
a y b x
n n
i i
i i
y a bx e
e y a bx
e y a b x
Because by definition e
y a b x
na y b x
a y bx
= = =
= = = =
=
= = =
=
= =
= + +
=
=
=
=

=
=



1 1 1
1 1 1 1
1
1 1 1
1 1
0
0
23
Derivation of the
Regression Coefficient
:
( )
( )
( )
( )
i i i
i i i
n n
i i i
i i
n n
i i i
i i
n
i
n n
i
i i i i
i i
n n
i i i i
i i
n
i i
i
n
i
i
Given y a b x e
e y a b x
e y a b x
e y a b x
e
x y b x x
b
x y b x x
x y
b
x
= =
= =
=
= =
= =
=
=
= + +
=
=
=
c
=
c
=
=

1 1
2 2
1 1
2
1
1 1
1 1
1
2
1
2 2
0 2 2
24
The Multiple
Regression Equation
We proceed to the derivation of its
components:
The intercept: a
The regression parameters, b1 and b2


i i
Y a b x b x e = + + +
1 1 2 2
25
The Multiple
Regression Formula




If we recall that the formula for
the correlation coefficient can
be expressed as follows:
26
from which it can be seen that the regression coefficient b,
is a function of r.
( ) ( )
n
i i
i 1
n n
2 2
i i
i 1 i 1
i
i
x y
r
x y
where
x x x
y y y
=
= =
=
=
=


n
i i
i 1
j
n
2
i 1
x y
b
x
=
=
=

*
y
j
x
sd
b r
sd
=
27
1 2 1 2
1 2
1 2
.
2
* (6)
1
yx yx x x
y
yx x
x x x
r r r
sd
r sd
|

=

2 1 1 2
2 1
1 2
.
2
* (7)
1
yx yx x x
y
yx x
x x x
r r r
sd
r sd
|

=

1 1 2 2
(8) a Y b x b x =
It is also easy to extend the bivariate intercept
to the multivariate case as follows.
Substitute the partial r for r
28
Significance Tests for the
Regression Coefficients
1. We find the significance of the
parameter estimates by using the
F or t test.

2. The R
2
is the proportion of
variance explained.
3. Adjusted R
2

= 1-(1-R
2
) (n-1)/(n-p-1)

29
F and T tests for
significance for overall
model
/
( ) /( )
2
2
Model variance
F
error variance
R p
1 R n p 1
where
p number of parameters
n sample size
=
=

=
=
( ) *
2
2
t F
n 2 r
1 r
=

30
Significance tests
If we are using a type II sum of
squares, we are dealing with
the ballantine. DV Variance
explained = a + b
31
Significance tests
T tests for statistical significance
a
b
t
se
b
t
se
o
=

=
0
0
32
Significance tests

Standard Error of intercept
( )
*
( ) ( )
i
a
i
x
Y Y
SE
n n
n x x
(
( (

( = + (

( (


(

2
2
2
1
2
1

b
n
i
SE
x
where std devof residual
e
n
o
o
o
=
=
=
=

2
2
2
1
2
Standard error of regression coefficient
33
SAS Regression
Command Syntax
34
SAS Regression syntax
Proc reg simple data=regdat;
model y = x1 x2 x3 /spec dw corrb collin r dffits
influence ;
output out=resdat p=pred
r=resid rstudent=rstud;
Data check;
set resdat;
Proc univariate normal plot;
var resid;
Title Check of Normality of Residuals;
Run;
Proc Freq data=resdat; tables rstud;
Title Check for Outliers;
run;
Proc ARIMA data=resdat;
identify var=resid;
Title Check of Autocorrelation of the errors;
Run;

35
SAS Regression Output
& Interpretation
36
More Simple Statistics
37
Omnibus ANOVA
Statistics
38
Parameter Estimates
39
Assumptions of the Linear
Regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper
specification of the model (no
omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors
(homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
40
Explanation of the
Assumptions
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and
warps the significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and
external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter
estimation. It may also preclude computation of the
parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers
have high influence and the sample is not large
enough, then they may serious bias the parameter
estimates
41
Diagnostic Tests for the
Regression Assumptions
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime
2. Independence of observations: Runs test
3. Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
4. Homogeneity of variance if the residuals: Whites
General Specification test
5. No autocorrelation of residuals: Durbin Watson or
ACF or PACF of residuals
6. Multicollinearity: Correlation matrix of residuals.
Condition index or condition number
7. No serious outlier influence: tests of additive outliers:
Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cooks distance or leverage statistics
42
Explanation of
Diagnostics
1. Plots show linearity or
nonlinearity of relationship
2. Correlation matrix shows
whether the independent
variables are collinear and
correlated.
3. Representative sample is done
with probability sampling


43
Explanation of
Diagnostics
Tests for Normality of the residuals.
The residuals are saved and then
subjected to either of:
Kolmogorov-Smirnov Test: Tests the
limit of the theoretical cumulative
normal distribution against your
residual distribution.
Shapiro-Wilks Test

Proc reg;
model y = x1 x2;
Output out=resdat r=resid p=pred;
Data check;
set resdat;
Proc univariate normal plot; var resid;
Title Test of Normality of Residuals;
Run;
44
Test for Homogeneity of
variance
Whites General
Test

Proc reg;
model y = x1
x2/spec;
Run;

Chow test


i
e b b x b x b x b x b x x = + + + + +
2 2 2
0 1 1 2 2 3 1 4 2 5 1 2
( ) /( )
( ) /( )
ur r ur r
chow
ur ur
R R p p
F
R n p

=

2 2
2
1
45
Weighted Least
squares
A fix for heteroskedasticity is
Weighted least squares or WLS
There are two ways to do this.
If x is proportional to the
variance in y, then form the
weight by wt = 1/y**2;
Proc reg;
weight wt;
Model y = x;
Run;
46
Collinearity Diagnostics
R
small tolerances imply problems
Tolerance
Small intercorrelations among indep vars
means VIF
VIF signifies problems
=
=
~
>
2
1
Variance InflationFactor (VIF)
1
1
10
Tolerance
47
More Collinearity
Diagnostics
Watch for eigenvalues much
greater than 1
condition numbers = maximum
eigenvalue/minimum
eigenvalue.
If condition numbers are between
100 and 1000, there is moderate
to strong collinearity

condition index k
where k condition number
=
=
If Condition index > 30 then there is strong collinearity
48
Collinearity Diagnostics
49
Outlier Diagnostics
1. Residuals.
1. The predicted value minus the actual
value. This is otherwise known as the
error.
2. Studentized Residuals
1. the residuals divided by their
standard errors without the ith
observation
3. Leverage, called the Hat diag
1. This is the measure of influence of
each observation
4. Cooks Distance:
1. the change in the statistics that
results from deleting the observation.
Watch this if it is much greater than
1.0.

50
Outlier detection
Outlier detection involves the
determination whether the residual
(error = predicted actual) is an
extreme negative or positive value.
We may plot the residual versus
the fitted plot to determine which
errors are large, after running the
regression.
The command syntax was already
demonstrated with the graph on
page 16: rvfplot, border yline(0)

51
Create Standardized
Residuals
A standardized residual is one
divided by its standard deviation.

i i
standardized
y y
resid
s
where s std devof residuals

=
=
52
Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
53
Outlier Influence
Suppose we had a different
data set with two outliers.
We tabulate the standardized
residuals and obtain the
following output:

54
Outlier a does not distort
and outlier b does.
55
Studentized Residuals
Alternatively, we could form
studentized residuals. These are
distributed as a t distribution with
df=n-p-1, though they are not
quite independent. Therefore, we
can approximately determine if
they are statistically significant or
not.
Belsley et al. (1980)
recommended the use of
studentized residuals.

56
Studentized Residual
( )
( )
( )
i
s
i
i
i
s
i
i
i
e
e
s h
where
e studentized residual
s standard deviation whereithobs is deleted
h leverage statistic
=

=
=
=
2
1
These are useful in estimating the statistical significance
of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
57
Influence of Outliers
1. Leverage is measured by the
diagonal components of the hat
matrix.
2. The hat matrix comes from the
formula for the regression of Y.

'( ' ) '


'( ' ) ' ,
,

Y X X X X X Y
where X X X X the hat matrix H
Therefore
Y HY
|

= =
=
=
1
1
58
Leverage and the Hat
matrix
1. The hat matrix transforms Y into the
predicted scores.
2. The diagonals of the hat matrix indicate
which values will be outliers or not.
3. The diagonals are therefore measures of
leverage.
4. Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
5. The trace of the hat matrix = the number of
variables in the model.
6. When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.
59
Cooks D
1. Another measure of influence.
2. This is a popular one. The
formula for it is:

'
( )
i i
i
i i
h e
Cook s D
p h s h
| || |
| |
=
| |
|

\ .
\ .\ .
2
2
1
1 1
Cook and Weisberg(1982) suggested that values of
D that exceeded 50% of the F distribution (df = p, n-p)
are large.
60
Using Cooks D in SAS
Cook is the option /R
Finding the influential outliers
List cook, if cook > 4/n
Belsley suggests 4/(n-k-1) as a cutoff

61
Graphical Exploration of
Outlier Influence
The two influential outliers can be found easily here
in the upper right.
One can plot the leverage against the standardized
residual to see if the outlier is problematic
62
DFbeta
One can use the DFbetas to
ascertain the magnitude of
influence that an observation has
on a particular parameter estimate
if that observation is deleted.

( )
( )
.
i
j j j
j
j
j
j
b b u
DFbeta
u h
where u residuals of
regressionof x on remaining xs

=

2
1
63
Alternatives to Violations
of Assumptions
1. Nonlinearity: Transform to
linearity if there is nonlinearity or
run a nonlinear regression
2. Nonnormality: Run a least
absolute deviations regression or a
median regression
3. Heteroskedasticity: weighted
least squares regression or white
estimator. One can use Proc
Robustreg to obtain downweighted
outlier effect in the estimation.
4. Autocorrelation: Newey-West
estimators or autoregression model
4. Multicollinearity: components
regression or ridge regression or
proxy variables

64
The Interaction model
The product vector x1*x2 is an
interaction term.
This is the joint effect over and above
the main effects (x1 and x2).
The main effects must be in the model
for the interaction to be properly
specified, regardless of whether the
main effects remain statistically
significant.
*
i i
Y a b x b x b x x e = + + + +
1 1 2 2 3 1 2
65
Path Diagram of an
Interaction model
Interaction Analysis
X1
X2
Y
A
B
C
Y= K + aX1 + BX2 + CX1*X2
Error
66
SAS Regression
Interaction Model Syntax
data one;
input y x1 x2 x3 x4;
lincrossprodx1x2=x1*x2;
x1sq=x1*x1;
time+1;
datalines;
112 113 114 39 10
322 230 310 43 23
323 340 250 33 33
112 122 125 144 45
99 100 89 55 34
14 13 10 249 40
40 34 98 39 30
30 32 34 40 40
90 80 93 50 50
89 90 91 60 44
120 130 43 100 34
444 432 430 20 44
proc print;
run;
proc reg;
model y= x1 x2 lincrossprodx1x2/r collin stb spec influence;
output out=resdat r=resid p=pred stdr=rstd student=rstud cookd=cooksd;
run;
67
SAS Regression
Diagnostic syntax
data outck;
set resdat;
degfree=7;
if cooksd > 4/7; /* if cd > p/(n-p-1)
where p = # parms, n=sample size */
proc freq; tables rstud;
title 'Outlier Indicator';
run;
axis1 label=(a=90 'Cooks D Influence Stat');
proc gplot data=resdat;
plot cooksd * rstud/vaxis=axis1;
title 'Leverage and the Outlier';
run;

68
Regression Model test of
linear interaction
69
SAS Polynomial
Regression Syntax
proc reg data=one;
model y= x1 x1sq;
title 'The Polynomial Regression';
run;


70
Model Building Strategies
Specific to General: Cohen
and Cohen
General to Specific: Hendry
and Richard
Extreme Bounds analysis: E.
Leamer.
F. Harrell, Jr. approach:
Regression Modeling
Strategies (Springer, 2001).
Loess, Splines for mode fitting, data
reduction, missing data imputation,
validation & simplification.
71
Goodness of Fit indices
R
2
= 1 SSE/SST
Explained Variance divided by
total variance
AIC = -2LL + 2p
SC = -2LL + nlog(p)

Nested models can be
compared on the basis of
these criteria to determine
which model is best
72
Robust Regression
1. This procedure permits 4 types of
robust regression: M, least
trimmed squares, S, and MM
estimation
2. These means down-weight the
outliers.
3. M (median absolute deviation)
estimation used by Huber is
performed with
4. proc robustreg data=stack;
model y = x1 x2 x3 / diagnostics
leverage; id x1; test x3; run;
5. Estimation is done by IRLS.

You might also like