Model in R

Multiple Linear Regression
and
the General Linear Model
With Thanks to My Students in

AMS 572: Data Analysis
1
Outline
 1. Introduction to Multiple Linear Regression
 2. Statistical Inference
 3. Topics in Regression Modeling
 4. Example
 5. Variable Selection Methods
 6. Regression Diagnostic and Strategy for
Building a Model
2
1.
Introduction to Multiple Linear
Regression
3
Multiple Linear Regression
• Regression analysis is a statistical methodology to estimate the
relationship of a response variable to a set of predictor variables
• Multiple linear regression extends simple linear regression model to the
case of two or more predictor variable
Example:
A multiple regression analysis might show us that the demand of a product varies
directly with the change in demographic characteristics (age, income) of a market
area.
Historical Background
• Francis Galton started using the term regression in his biology research
• Karl Pearson and Udny Yule extended Galton’s work to the statistical
context
• Legendre and Gauss developed the method of least squares used in
regression analysis
• Ronald Fisher developed the maximum likelihood method used in the 4
related statistical inference (test of the significance of regression etc.).
History
Hi, I am Carl Friedrich Gauss (1777/4/30 – 1855/4/23).
I IHi,
developed amIthe am Francis Galton
Adrien-Marie
fundamentals Legendre
of
(1822/2/16
the basis
–
(1752/9/18
for -
1911/1/17).
HI, I am Karl Pearson.
1833/1/10).
least-squares
You guys analysis in 1795
regard me asatthe the age of
eighteen.In 1805, Iofpublished
founder an article named
Biostatistics.
Nouvelles
I published an article
In my méthodes
research called pour
Theoria
I found tall laMotus
Hi, again. I am
détermination des orbites
Corporum Coelestium
parents usual have shorter desConicis
in Sectionibus
Ronaldcomètes.
Aylmer
Solem Ambientum
children;
Fisher.In this article,andin vice
1809. versa. So the
In 1821, Iheight
published I introduced
another
of human hasMethod
article
being about
the least of
Least
square analysis
tendency Squares
with to the to
further
to regress world. Yes, I am
development,
its mean.
the
called Theoria
Yes, first person who
combinationis
It is me who published
first the article
observationum
use
We both developed
regarding toobnoxiae.
method ofsuch
leastarticle
squares,
erroribus word
regressionwhich “regression”
minimis
theoryisafter for This
includes Gauss–Markov
phenomenon the earliest
and form of
theorem
problems.
Galton. regression.
Most content in this page comes from Wikipedia

5
Probabilistic Model
yi is the observed value of the random variable (r.v.) Yi which depends on
fixed predictor values xi1 , xi 2 , , xik according to the following model:
Yi     xi1  xi1  kxik i , i  1, 2,..., n
Here  0 , 1 , ,  k are unknown model parameters, and n is the number of observations.
The random error, i , are assumed to be independent r.v.’s with mean 0 and variance 2
Thus Yi are independent r.v.’s with mean i and variance  2 , where
i  E (Yi)    xi   xi   kxi
1 1 k
6
Fitting the model
• The least squares (LS) method is used to find a line that fits the equation
Yi     xi1  xi1  kxik i , i  1, 2,..., n
• Specifically, LS provides estimates of the unknown model parameters,
0 , 1 , , k which minimizes, Q , the sum of squared difference of the
observed values, yi , and the corresponding points on the line with the same x’s
n
Q   [ yi  (  0   1 xi1   2 xi 2  ...  kxik )]2
i 1
• The LS can be found by taking partial derivatives of Q with respect to the

unknown variables  0 , 1 , ,  k and setting them equal to 0. The result
is a set of simultaneous linear equations.
• The resulting solutions, ˆ0 , ˆ1 , , ˆk are the least squares (LS)
estimators of  ,  , ,  , respectively.
0 1 k
• Please note the LS method is non-parametric. That is, no probability

distribution assumptions on Y or ε are needed. 7
Goodness of Fit of the Model
• To evaluate the goodness of fit of the LS model, we use the residuals
defined by ei  yi  y
ˆi (i  1, 2, , n)
• yˆ i are the fitted values:
yˆ i  ˆ ˆ  xi1ˆ  xi1 ˆ kxik (i  1, 2,..., n)

• An overall measure of the goodness of fit is the error sum of squares (SSE)
n
min Q  SSE   ei2
i 1
• A few other definition similar to those in simple linear regression:
total sum of squares (SST):

SST   ( yi  y )2
regression sum of squares (SSR):
SSR  SST  SSE
8
• coefficient of multiple determination:
SSR SSE
R2   1
SST SST
• 0  R 1
2
• values closer to 1 represent better fits

• adding predictor variables never decreases and generally increases R 2
• multiple correlation coefficient (positive square root of R2 ):
R   R2
• only positive square root is used
• R is a measure of the strength of the association between the predictors
(x’s) and the one response variable Y
9
Multiple Regression Model in Matrix Notation
 The multiple regression model can be represented in a compact form using matrix notation
Let:
 Y1   y1   1 
Y  y   
Y   2, y   ,2
   2
     
     
Yn   n
y  n 
be the n x 1 vectors of the r.v.’s Yi ' s , their observed values yi ' s, and random errors  i ' s,
respectively for all n observations
Let: 1 x11 x1k 
1 x x2 k 
X  21
 
 
1 xn1 xnk 
be the n x (k + 1) matrix of the values of the predictor variables for all n observations
(the first column corresponds to the constant term  )
0
10
 ˆ0 
Let:  0   
   ˆ1 
ˆ
  
   1
and
   
   ˆ 
k   k
be the (k + 1) x 1 vectors of unknown model parameters and their LS estimates, respectively
• The model can be rewritten as:
Y  X 
• The simultaneous linear equations whose solutions yields the LS estimates:
X 'X  X 'y
• If the inverse of the matrix X 'X exists, then the solution is given by:
ˆ  ( X ' X ) 1 X ' y
11
2.
Statistical Inference
12
Determining the statistical significance of the
predictor variables:
• For statistical inferences, we need the assumption that
 i ~ N  0,  2  *i.i.d. means independent & identically distributed
iid
• We test the hypotheses: H 0 j :  j  0 vs. H1 j :  j  0
• If we can’t reject H 0 j : B j  0 , then the corresponding variable

x j is not a significant predictor of y.
 
 shown that each ̂ is normal with mean 

• It is easily j j

and variance  2v jj , where v jj is the jth diagonal entry of the
matrix V  (X' X)1

  13

Deriving a pivotal quantity for the inference
on  j
• Recall ˆ j ~ N ( j ,  2v jj )
• The unbiased estimator of the unknown error variance  2
SSE ei 2 MSE
is given by S  2
 
n  (k  1) n  (k  1) d.o. f .
(n  (k  1)) S 2 SSE 2 
• We also know that W   ~   ( k 1) , and that
 2
 2 n
S 2 and ˆ j are statistically independent.


ˆ 
• With Z  j j
~ N (0,1), and by the definition of the t-distribution,
 v jj
we obtain the pivotal quantity for the inference on  j
Z ˆ j   j
T  ~ Tn( k 1)
W / n  (k  1) S v jj
14
Derivation of the Confidence Interval for j
ˆ j   j
P(t / 2,n( k 1)   t / 2,n( k 1) )  1  
S v jj
P( ˆ j  t / 2,n ( k 1) s v jj   j  ˆ j  t

/ 2 , n  ( k 1) s v jj )  1  
Thus the 100(1-α)% confidence interval for j is:
ˆ j  t / 2,n( k 1) SE ( ˆ j )
where SE ( ˆ j )
 s v jj
15
Derivation of the Hypothesis Test for  j
at the significance level α
Hypotheses: H 0 :  j  0
Ha :  j  0
The test statistic is: T0 

ˆ j  0 H
~T

0
n  ( k 1)
S v jj
The decision rule of the test is derived based on the Type I

error rate α. That is
P (Reject H0 | H0 is true) = P( T0  c)  
 c  t / 2, n ( k 1)
Therefore, we reject H0 at the significance level α if and only

if t0  t / 2, n ( k 1) , where t 0 is the observed value of T0
16
Another Hypothesis Test
H 0 :  j  0 for all 0  j  k
Now consider:
H a :  j  0 for at least one 0  j  k
MSR
When 
H0 is true, the test statistics F0  ~ f k ,n ( k 1)
MSE
  a decision for a statistical
• An alternative and equivalent way to make
test is through the p-value, defined as:
p = P(observe a test statistic value at least as extreme as the one observed | H 0 )
• At the significance level , we reject H0 if and only if p < 

17
The General Hypothesis Test
• Consider the full model:
Yi  0  1xi1  ... k xik  i (i=1,2,…n)
• Now consider a partial model:

Yi  0  1x i1  ... km x i,km  i (i=1,2,…n)

• Hypotheses: H0 : km1  ... k  0 vs. H a :  j  0
for at least one k  m 1 j  k

( SSEk  m  SSE
k)/m
• Teststatistic: F0  ~ f m,n ( k 1)
SSEk /[n  (k  1)]

• Reject H0 when F0  f m , n ( k 1), 18
Estimating and Predicting Future Observations
• Let x  (x 0 , x1 ,..., x k ) and let
* * * * '
Yˆ *  ˆ0  ˆ1 x1*  ...  ˆk xk*  x* ˆ
  *   0  1 x1*  ...   k xk*  x* 
ˆ *   *
• The pivotal quantity for  * is T
* *
~ Tn ( k 1)
s x Vx
• Using this pivotal quantity, we can derive a CI for the estimated

mean *:
ˆ  tn( k 1), / 2 s x Vx
* * *
• Additionally, we can derive a prediction interval (PI) to predict Y*:

Yˆ *  tn( k 1), / 2 s 1  x* Vx* 19
3.
Topics in Regression
Modeling
20
3.1 Multicollinearity
Definition. The predictor variables are

linearly dependent.
This can cause serious numerical and

statistical difficulties in fitting the
regression model unless “extra” predictor
variables are deleted.
21
How does the multicollinearity cause
difficulties?
 is the solution to the equation X T X   X T y, Thus X T X  must be invertable
^
^
in order for  to be unique and computable .
If the approximate multicollinearity happens:

^
1. T
X X is nearly singular, which makes  numerically unstable. This
reflected in large changes in their magnitudes with small changes in data.
^
1
2. The matrix V  ( X X ) ^ has very large elements. Therefore Var( )   2 v jj
T
are large, which makes  j statistically nonsignificant.
22
Measures of Multicollinearity
1. The correlation matrix R.
Easy but can’t reflect linear relationships
between more than two variables.
2. Determinant of R can be used as measurement
of singularity of X T X .
3. Variance Inflation Factors (VIF): the diagonal
elements of R1 . VIF>10 is regarded as
unacceptable.
23
3.2 Polynomial Regression
A special case of a linear model:
y  0  1 x  ...   k x k  
Problems:
2 k
1. The powers of x, i.e., x, x , , x tend to be highly
correlated.
2. If k is large, the magnitudes of these powers tend to vary
over a rather wide range.
So, set k<=3 if a good idea, and almost never use k>5.
24
Solutions
1. Centering the x-variable:
 
y     ( x  x)  ...   ( x  x) k  
*
0
*
1
*
k
Effect: removing the non-essential multicollinearity in

the data.
2. Further more, we can standardize the data by dividing

the standard deviation s x of x:  

 xx
 sx 
 
Effect: helping to alleviate the second problem.
3. Using the first few principal components of the original

variables instead of the original variables.
25
3.3 Dummy Predictor Variables & The
General Linear Model
How to handle the categorical predictor
variables?
1. If we have categories of an ordinal

variable, such as the prognosis of a
patient (poor, average, good), one can
assign numerical scores to the
categories. (poor=1, average=2, good=3)
26
2. If we have nominal variable with c>=2
categories. Use c-1 indicator variables,
x1 , , xc 1 , called Dummy Variables, to
code.
xi  1 for the ith category, 1  i  c 1
x1  ...  xc 1  0 for the cth category.
27
Why don’t we just use c indicator variables:
x1 , x2 ,..., xc ?
If we use that, there will be a linear

dependency among them:
x1  x2  ...  xc  1
This will cause multicollinearity.
28
Example of the dummy variables
 For instance, if we have four years of quarterly sale data
of a certain brand of soda. How can we model the time
trend by fitting a multiple regression equation?
Solution: We use quarter as a predictor variable x1. To model the

seasonal trend, we use indicator variables x2, x3, x4, for Winter,
Spring and Summer, respectively. For Fall, all three equal zero. That
means: Winter-(1,0,0), Spring-(0,1,0), Summer-(0,0,1), Fall-(0,0,0).
Then we have the model:
Yi  0  1 xi1  2 xi 2  3 xi 3  4 xi 4   i (i  1,...,16)
29
3. Once the dummy variables are included, the
resulting regression model is referred to as a
“General Linear Model”.
This term must be differentiated from that of the

“Generalized Linear Model” which include the
“General Linear Model” as a special case with the
identity link function:
i  E (Yi)    xi   xi   kxi
1 1 k
The generalized linear model will link the model

parameters to the predictors through a link function.
For another example, we will check out the logit
link in the logistic regression this afternoon.
30
4. Example
 Here we revisit the classic regression towards
Mediocrity in Hereditary Stature by Francis Galton
 He performed a simple regression to predict offspring
height based on the average parent height
 Slope of regression line was less than 1 showing that
extremely tall parents had less extremely tall children
 At the time, Galton did not have multiple regression as
a tool so he had to use other methods to account for
the difference between male and female heights
 We can now perform multiple regression on parent-
offspring height and use multiple variables as
predictors
31
Example
OUR MODEL: Yi  o  1 xi1   2 xi 2  3 xi 3   i
 Y = height of child
 x1 = height of father
 x2 = height of mother
 x3 = gender of child
32
Example
In matrix notation:
Y  X  

  (X ' X ) X ' y
1
We find that: β0 = 15.34476

β1 = 0.405978
β2 = 0.321495
β3 = 5.22595
33
Example
Father Mother Gender Child
1 78.5 67 1 73.2
1 78.5 67 1 69.2
1 78.5 67 0 69
1 78.5 67 0 69
X  1 75.5 66.5 1 73.5
1 75.5 66.5 1 72.5
1 75.5 66.5 0 65.5
1 75.5 66.5 0 65.5
1 75 64 1 71
1 75 64 0 68
… … … … …
34
Example
Important calculations

SSE   ( yi  yi )  4149.162
2

SST   ( yi  y )  11,515.060
2
yi  X i ˆ
Is the predicted height
SSE of each child given a
r  1
2
 0.6397 set of predictor
SST variables
SSE
MSE   4.64
n  (k  1)
35
Example
Are these values significantly different than zero?
 Ho: βj = 0
 Ha: βj ≠ 0

Reject H0j if j
tj  
 tn ( k 1), / 2  t894,0.025  2.245
SE (  j )
 1.63 0.0118 0.0125 0.00596 
 0.0118 0.000184 0.0000143 0.0000223
V  ( X ' X ) 1 
 0.0125 0.0000143 0.000211 0.0000327 
 
 0.00596 0.0000223 0.0000327 0.00447 

SE (  j )  MSE * v jj
36
Example
β-estimate SE t
Intercept 15.3 2.75 5.59*
Father 0.406 0.0292 13.9*

Height
Mother 0.321 0.0313 10.3*
Height
Gender 5.23 0.144 36.3*
* p<0.05. We conclude that all β are significantly different than
zero. 37
EXAMPLE
Testing the model as a whole:
 Ho: β0 = β1 = β2 = β3 = 0
 Ha: The above is not true.
r 2 [n  (k  1)]
F  f k ,n ( k 1),  f3,894,.05  2.615
Reject H0 if k (1  r 2 )
SSE
r2  1  0.6397
SST

SSE   ( yi  yi ) 2  4149.162
SST   ( yi  y ) 2  11,515.060
Since F = 529.032 >2.615, we reject Ho and conclude

that our model predicts height better than by chance.
38
EXAMPLE
Making Predictions
Let’s say George Clooney (71 inches) and

Madonna (64 inches) have a baby boy.

Y *  15.34476  0.405978 (71)  0.321495(64) + 5.225951(1)  69.97
95% Prediction interval:


Y * t894,.025 MSE * (1  x*'Vx*)  Y * 

Y * t894,.025 MSE * (1  x*'Vx*)
69.97 ± 4.84 = (65.13, 74.81)
39
EXAMPLE
SAS code
ods graphics on
data revise;
set mylib.galton;
if sex = 'M' then gender = 1.0;
else gender = 0.0;
run;
proc reg data=revise;

title "Dependence of Child Heights on Parental
Heights";
model height = father mother gender / vif;
run;
quit;
Alternatively, one can use proc GLM procedure that can
incorporate the categorical variable (sex) directly via the
40
class statement.
Dependence of Child Heights on Parental Heights 2
The REG Procedure

Model: MODEL1
Dependent Variable: height height
Number of Observations Read 898

Number of Observations Used 898
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 7365.90034 2455.30011 529.03 <.0001

Error 894 4149.16204 4.64112
Corrected Total 897 11515
Root MSE 2.15433 R-Square 0.6397

Dependent Mean 66.76069 Adj R-Sq 0.6385
Coeff Var 3.22694
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 15.34476 2.74696 5.59 <.0001

father father 1 0.40598 0.02921 13.90 <.0001
mother mother 1 0.32150 0.03128 10.28 <.0001
gender 1 5.22595 0.14401 36.29 <.0001
Parameter Estimates
Variance
Variable Label DF Inflation
Intercept Intercept 1 0
father father 1 1.00607
mother mother 1 1.00660
gender 1 1.00188
41
EXAMPLE
42
EXAMPLE
By
Gary
Bedford &
Christine
Vendikos
43
5. Variables Selection Method
A. Stepwise Regression
44
Variables selection method
(1) Why do we need to select the variables?
(2) How do we select variables?

* stepwise regression
* best subset regression
45
Stepwise Regression
(p-1)-variable model:
Yi   0 1 xi ,1  ...   p 1 xi , p 1   i
P-variable model
Yi   0 1 xi ,1  ...   p 1 xi , p 1   p xi , p   i
46
Partial F - test
Hypothesis test
H 0 p: p  0
H1 p: p  0
( SSE p 1  SSE p ) / 1
Fp   f1,n ( p 1),
SSE p /[ n  ( p  1)]
p
test statistic : t p 
SE (  p )
reject H 0 p :| t p | tn ( p 1), / 2
47
Partial correlation coefficients
SSE p 1  SSE p SSE ( x1...x p 1 )  SSE ( x1...x p )

r 2
yx p | x1...x p 1  
SSE p SSE ( x1...x p 1 )
ryx2 p|x1...x p1 [n  ( p  1)]
Fp  t 2p 
1  ryx2 p|x1...x p1
48
5. Variables selection method
A. Stepwise Regression:
SAS Example
49
Example 11.5 (T&D pg. 416), 11.9 (T&D pg. 431)
The following table shows data on the heat evolved in calories during the hardening of
cement on a per gram basis (y) along with the percentages of four ingredients: tricalcium
aluminate (x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium
silicate (x4).
No. X1 X2 X3 X4 Y
1 7 26 6 60 78.5
2 1 29 15 52 74.3
3 11 56 8 20 104.3
4 11 31 8 47 87.6
5 7 52 6 33 95.9
6 11 55 9 22 109.2
7 3 71 17 6 102.7
8 1 31 22 44 72.5
9 2 54 18 22 93.1
10 21 47 4 26 1159
11 1 40 23 34 83.8
12 11 66 9 12 113.3
13 10 68 8 12 109.4
50
SAS Program (stepwise variable selection is used)
data example115;
input x1 x2 x3 x4 y;
datalines;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
;
run;
proc reg data=example115;
model y = x1 x2 x3 x4 /selection=stepwise;
run; 51
Selected SAS output
The SAS System 22:10 Monday, November 26, 2006 3
The REG Procedure

Model: MODEL1
Dependent Variable: y
Stepwise Selection: Step 4
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 52.57735 2.28617 3062.60416 528.91 <.0001

x1 1.46831 0.12130 848.43186 146.52 <.0001
x2 0.66225 0.04585 1207.78227 208.58 <.0001
Bounds on condition number: 1.0551, 4.2205

----------------------------------------------------------------------------------------------------
52
SAS Output (cont)
 All variables left in the model are significant at the 0.1500 level.
 No other variable met the 0.1500 significance level for entry into the model.
 Summary of Stepwise Selection
 Variable Variable Number Partial Model

Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F
1 x4 1 0.6745 0.6745 138.731 22.80 0.0006

2 x1 2 0.2979 0.9725 5.4959 108.22 <.0001
3 x2 3 0.0099 0.9823 3.0182 5.03 0.0517
4 x4 2 0.0037 0.9787 2.6782 1.86 0.2054
53
5. Variables selection method
B. Best Subsets Regression
54
Best Subsets Regression
For the stepwise regression algorithm

 The final model is not guaranteed to be optimal
in any specified sense.
In the best subsets regression,

 subset of variables is chosen from the collection
of all subsets of k predictor variables) that
optimizes a well-defined objective criterion
55
In the stepwise regression,

 We get only one single final models.
In the best subsets regression,

 The investor could specify a size for the
predictors for the model.
56
Optimality Criteria
SSR p SSE p
• rp 2-Criterion: r 
2
p  1
SST SST
MSE p
• Adjusted rp 2-Criterion: r 2
adj , p  1
MST
• Cp-Criterion (recommended for its ease of computation and its ability
to judge the predictive power of a model)
n
1
p 
2
 ip
[ E
i 1
[Yˆ ]  E[Yi ]] 2
The sample estimator, Mallows’ Cp-statistic, is given by
SSE p
Cp   2( p  1)  n
ˆ 2 57
Algorithm
Note that our problem is to find the minimum of a given function.
 Use the stepwise subsets regression algorithm
and replace the partial F criterion with other
criterion such as Cp.
 Enumerate all possible cases and find the
minimum of the criterion functions.
 Other possibility?
58
Best Subsets Regression & SAS
proc reg data=example115;
model y = x1 x2 x3 x4 /selection=ADJRSQ;
run;
For the selection option, SAS has implemented 9 methods
in total. For best subset method, we have the following
options:
 Maximum R2 Improvement (MAXR)
 Minimum R2 (MINR) Improvement
 R2 Selection (RSQUARE)
 Adjusted R2 Selection (ADJRSQ)
 Mallows' Cp Selection (CP)
59
6. Building A Multiple
Regression Model
Steps and Strategy
60
Modeling is an iterative process. Several
cycles of the steps maybe needed before
arriving at the final model.
The basic process consists of seven steps
61
Get started and Follow the Steps
Categorization by Usage Collect the Data
Divide the Data Explore the Data
Fit Candidate Models
Select and Evaluate
Select the Final Model

62
Step I
Decide the type of model needed,

according to different usage.
Main categories include:
 Predictive
 Theoretical
 Control
 Inferential
 Data Summary
• Sometimes, models are involved in

multiple purposes.
63
Step II
Collect the Data

Predictor (X)
Response (Y)
Data should be relevant and bias-free
64
Step III
Explore the Data
Linear Regression Model is sensitive to the

noise. Thus, we should treat outliers and
influential observations cautiously.
65
Step IV
 Divide the Data
Training Sets: building
Test Sets: checking
 How to divide?
 Large sample Half-Half
 Small sample size of training set >16
66
Step V
Fit several Candidate Models

Using Training Set.
67
Step VI
Select and Evaluate a Good Model

To improve the violations of model assumptions.
68
Step VII
Select the Final Model

Use test set to compare competing
models by cross-validating them.
69
Regression Diagnostics (Step VI)
Graphical Analysis of Residuals

Plot Estimated Errors vs. Xi Values
Difference Between Actual Yi & Predicted Yi
Estimated Errors Are Called Residuals
Plot Histogram or Stem-&-Leaf of Residuals
Purposes
Examine Functional Form (Linearity )
Evaluate Violations of Assumptions
70
Linear Regression Assumptions
Mean of Probability Distribution of Error Is

0
Probability Distribution of Error Has
Constant Variance
Probability Distribution of Error is Normal
Errors Are Independent
71
Residual Plot
for Functional Form (Linearity)
Add X^2 Term Correct Specification
e e
X X
72
Residual Plot
for Equal Variance
Unequal Variance Correct Specification
SR SR
X X
Fan-shaped.
Standardized residuals used typically (residual
divided by standard error of prediction)
73
Residual Plot
for Independence
Not Independent Correct Specification
SR SR
X X
74
Questions?
www.ams.sunysb.edu/~zhu
zhu@ams.sunysb.edu
75

Model in R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Model in R

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression

With Thanks to My Students in

Most content in this page comes from Wikipedia

yi is the observed value of the random variable (r.v.) Yi which depends on

fixed predictor values xi1 , xi 2 , , xik according to the following model:

Yi     xi1  xi1  kxik i , i  1, 2,..., n

Here  0 , 1 , ,  k are unknown model parameters, and n is the number of observations.

0 , 1 , , k which minimizes, Q , the sum of squared difference of the

• The LS can be found by taking partial derivatives of Q with respect to the

• Please note the LS method is non-parametric. That is, no probability

yˆ i  ˆ ˆ  xi1ˆ  xi1 ˆ kxik (i  1, 2,..., n)

• A few other definition similar to those in simple linear regression:

total sum of squares (SST):

• values closer to 1 represent better fits

• multiple correlation coefficient (positive square root of R2 ):

• The model can be rewritten as:

• We test the hypotheses: H 0 j :  j  0 vs. H1 j :  j  0

• If we can’t reject H 0 j : B j  0 , then the corresponding variable

 shown that each ̂ is normal with mean 

S 2 and ˆ j are statistically independent.

P( ˆ j  t / 2,n ( k 1) s v jj   j  ˆ j  t

Thus the 100(1-α)% confidence interval for j is:

The test statistic is: T0 

The decision rule of the test is derived based on the Type I

Therefore, we reject H0 at the significance level α if and only

p = P(observe a test statistic value at least as extreme as the one observed | H 0 )

• At the significance level , we reject H0 if and only if p < 

• Now consider a partial model:

Yˆ *  ˆ0  ˆ1 x1*  ...  ˆk xk*  x* ˆ

  *   0  1 x1*  ...   k xk*  x* 

• Using this pivotal quantity, we can derive a CI for the estimated

• Additionally, we can derive a prediction interval (PI) to predict Y*:

Definition. The predictor variables are

This can cause serious numerical and

If the approximate multicollinearity happens:

are large, which makes  j statistically nonsignificant.

Effect: removing the non-essential multicollinearity in

2. Further more, we can standardize the data by dividing

3. Using the first few principal components of the original

1. If we have categories of an ordinal

xi  1 for the ith category, 1  i  c 1

x1  ...  xc 1  0 for the cth category.

If we use that, there will be a linear

This will cause multicollinearity.

Solution: We use quarter as a predictor variable x1. To model the

This term must be differentiated from that of the

The generalized linear model will link the model

OUR MODEL: Yi  o  1 xi1   2 xi 2  3 xi 3   i

We find that: β0 = 15.34476

X  1 75.5 66.5 1 73.5

1 75.5 66.5 1 72.5

1 75.5 66.5 0 65.5

1 75.5 66.5 0 65.5

Intercept 15.3 2.75 5.59*

Father 0.406 0.0292 13.9*

Since F = 529.032 >2.615, we reject Ho and conclude

Let’s say George Clooney (71 inches) and

95% Prediction interval:

proc reg data=revise;

The REG Procedure

Number of Observations Read 898

Model 3 7365.90034 2455.30011 529.03 <.0001

Root MSE 2.15433 R-Square 0.6397