Professional Documents
Culture Documents
Slide
2008 Thomson South-Western. All Rights Reserved
Slides by
JOHN
LOUCKS
& Updated by
SPIROS
VELIANITIS
2
Slide
2008 Thomson South-Western. All Rights Reserved
Chapter 15
Multiple Regression
Multiple Regression Model
Least Squares Method
Multiple Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Qualitative Independent Variables
Residual Analysis
Logistic Regression
3
Slide
2008 Thomson South-Western. All Rights Reserved
The equation that describes how the dependent
variable y is related to the independent variables
x
1
, x
2
, . . . x
p
and an error term is:
Multiple Regression Model
y = b
0
+ b
1
x
1
+ b
2
x
2
+
. . . + b
p
x
p
+ e
where:
b
0
, b
1
, b
2
, . . . , b
p
are the parameters, and
e is a random variable called the error term
Multiple Regression Model
4
Slide
2008 Thomson South-Western. All Rights Reserved
The equation that describes how the mean
value of y is related to x
1
, x
2
, . . . x
p
is:
Multiple Regression Equation
E(y) = b
0
+ b
1
x
1
+ b
2
x
2
+ . . . + b
p
x
p
Multiple Regression Equation
5
Slide
2008 Thomson South-Western. All Rights Reserved
A simple random sample is used to compute sample
statistics b
0
, b
1
, b
2
, . . . , b
p
that are used as the point
estimators of the parameters b
0
, b
1
, b
2
, . . . , b
p
.
Estimated Multiple Regression Equation
^
y = b
0
+ b
1
x
1
+ b
2
x
2
+ . . . + b
p
x
p
Estimated Multiple Regression Equation
6
Slide
2008 Thomson South-Western. All Rights Reserved
Estimation Process
Multiple Regression Model
E(y) = b
0
+ b
1
x
1
+ b
2
x
2
+. . .+ b
p
x
p
+ e
Multiple Regression Equation
E(y) = b
0
+ b
1
x
1
+ b
2
x
2
+. . .+ b
p
x
p
Unknown parameters are
b
0
, b
1
, b
2
, . . . , b
p
Sample Data:
x
1
x
2
. . . x
p
y
. . . .
. . . .
0 1 1 2 2
...
p p
y b b x b x b x
Estimated Multiple
Regression Equation
Sample statistics are
b
0
, b
1
, b
2
, . . . , b
p
b
0
, b
1
, b
2
, . . . , b
p
provide estimates of
b
0
, b
1
, b
2
, . . . , b
p
7
Slide
2008 Thomson South-Western. All Rights Reserved
Least Squares Method
Least Squares Criterion
2
min ( )
i i
y y
( )
i
y y
( )
i i
y y
= +
18
Slide
2008 Thomson South-Western. All Rights Reserved
Excels ANOVA Output
Multiple Coefficient of Determination
A B C D E F
32
33 ANOVA
34 df SS MS F Significance F
35 Regression 2 500.3285 250.1643 42.76013 2.32774E-07
36 Residual 17 99.45697 5.85041
37 Total 19 599.7855
38
SSR
SST
19
Slide
2008 Thomson South-Western. All Rights Reserved
Multiple Coefficient of Determination
R
2
= 500.3285/599.7855 = .83418
R
2
= SSR/SST
20
Slide
2008 Thomson South-Western. All Rights Reserved
Adjusted Multiple Coefficient
of Determination
R R
n
n p
a
2 2
1 1
1
1
( ) R R
n
n p
a
2 2
1 1
1
1
( )
2
20 1
1 (1 .834179) .814671
20 2 1
a
R
21
Slide
2008 Thomson South-Western. All Rights Reserved
The variance of e , denoted by
2
, is the same for all
values of the independent variables.
The error e is a normally distributed random variable
reflecting the deviation between the y value and the
expected value of y given by b
0
+ b
1
x
1
+ b
2
x
2
+ . . + b
p
x
p
.
Assumptions About the Error Term e
The error e is a random variable with mean of zero.
The values of e are independent.
22
Slide
2008 Thomson South-Western. All Rights Reserved
In simple linear regression, the F and t tests provide
the same conclusion.
Testing for Significance
In multiple regression, the F and t tests have different
purposes.
23
Slide
2008 Thomson South-Western. All Rights Reserved
Testing for Significance: F Test
The F test is referred to as the test for overall
significance.
The F test is used to determine whether a significant
relationship exists between the dependent variable
and the set of all the independent variables.
24
Slide
2008 Thomson South-Western. All Rights Reserved
A separate t test is conducted for each of the
independent variables in the model.
If the F test shows an overall significance, the t test is
used to determine whether each of the individual
independent variables is significant.
Testing for Significance: t Test
We refer to each of these t tests as a test for individual
significance.
25
Slide
2008 Thomson South-Western. All Rights Reserved
Testing for Significance: F Test
Hypotheses
Rejection Rule
Test Statistics
H
0
: b
1
= b
2
= . . . = b
p
= 0
H
a
: One or more of the parameters
is not equal to zero.
F = MSR/MSE
Reject H
0
if p-value < a or if F > F
a
,
where F
a
is based on an F distribution
with p d.f. in the numerator and
n - p - 1 d.f. in the denominator.
26
Slide
2008 Thomson South-Western. All Rights Reserved
Testing for Significance: t Test
Hypotheses
Rejection Rule
Test Statistics
Reject H
0
if p-value < a or
if t < -t
a2
or t > t
a2
where t
a2
is based on a t distribution
with n - p - 1 degrees of freedom.
t
b
s
i
b
i
t
b
s
i
b
i
0
: 0
i
H b
: 0
a i
H b
27
Slide
2008 Thomson South-Western. All Rights Reserved
Testing for Significance: Multicollinearity
The term multicollinearity refers to the correlation
among the independent variables.
When the independent variables are highly correlated
(say, |r | > .7), it is not possible to determine the
separate effect of any particular independent variable
on the dependent variable.
28
Slide
2008 Thomson South-Western. All Rights Reserved
Testing for Significance: Multicollinearity
Every attempt should be made to avoid including
independent variables that are highly correlated.
If the estimated regression equation is to be used only
for predictive purposes, multicollinearity is usually
not a serious problem.
29
Slide
2008 Thomson South-Western. All Rights Reserved
Using the Estimated Regression Equation
for Estimation and Prediction
The procedures for estimating the mean value of y
and predicting an individual value of y in multiple
regression are similar to those in simple regression.
We substitute the given values of x
1
, x
2
, . . . , x
p
into
the estimated regression equation and use the
corresponding value of y as the point estimate.
30
Slide
2008 Thomson South-Western. All Rights Reserved
Using the Estimated Regression Equation
for Estimation and Prediction
Software packages for multiple regression will often
provide these interval estimates.
The formulas required to develop interval estimates
for the mean value of y and for an individual value
of y are beyond the scope of the textbook.
^
31
Slide
2008 Thomson South-Western. All Rights Reserved
In many situations we must work with qualitative
independent variables such as gender (male, female),
method of payment (cash, check, credit card), etc.
For example, x
2
might represent gender where x
2
= 0
indicates male and x
2
= 1 indicates female.
Qualitative Independent Variables
In this case, x
2
is called a dummy or indicator variable.
32
Slide
2008 Thomson South-Western. All Rights Reserved
The years of experience, the score on the programmer
aptitude test, whether the individual has a relevant
graduate degree, and the annual salary ($1000) for each
of the sampled 20 programmers are shown on the next
slide.
Qualitative Independent Variables
Example: Programmer Salary Survey
As an extension of the problem involving the
computer programmer salary survey, suppose
that management also believes that the
annual salary is related to whether the
individual has a graduate degree in
computer science or information systems.
33
Slide
2008 Thomson South-Western. All Rights Reserved
4
7
1
5
8
10
0
1
6
6
9
2
10
5
6
8
4
6
3
3
78
100
86
82
86
84
75
80
83
91
88
73
75
81
74
87
79
94
70
89
24.0
43.0
23.7
34.3
35.8
38.0
22.2
23.1
30.0
33.0
38.0
26.6
36.2
31.6
29.0
34.0
30.1
33.9
28.2
30.0
Exper. Score Score Exper. Salary Salary Degr.
No
Yes
No
Yes
Yes
Yes
No
No
No
Yes
Degr.
Yes
No
Yes
No
No
Yes
No
Yes
No
No
Qualitative Independent Variables
34
Slide
2008 Thomson South-Western. All Rights Reserved
Estimated Regression Equation
y = b
0
+ b
1
x
1
+ b
2
x
2
+ b
3
x
3
^
where:
y = annual salary ($1000)
x
1
= years of experience
x
2
= score on programmer aptitude test
x
3
= 0 if individual does not have a graduate degree
1 if individual does have a graduate degree
x
3
is a dummy variable
35
Slide
2008 Thomson South-Western. All Rights Reserved
Excels Regression Equation Output
Note: Columns F-I are not shown.
Qualitative Independent Variables
A B C D E
38
39 Coeffic. Std. Err. t Stat P-value
40 Intercept 7.94485 7.3808 1.0764 0.2977
41 Experience 1.14758 0.2976 3.8561 0.0014
42 Test Score 0.19694 0.0899 2.1905 0.04364
43 Grad. Degr. 2.28042 1.98661 1.1479 0.26789
44
Not significant
36
Slide
2008 Thomson South-Western. All Rights Reserved
More Complex Qualitative Variables
If a qualitative variable has k levels, k - 1 dummy
variables are required, with each dummy variable
being coded as 0 or 1.
For example, a variable with levels A, B, and C could
be represented by x
1
and x
2
values of (0, 0) for A, (1, 0)
for B, and (0,1) for C.
Care must be taken in defining and interpreting the
dummy variables.
37
Slide
2008 Thomson South-Western. All Rights Reserved
For example, a variable indicating level of
education could be represented by x
1
and x
2
values
as follows:
More Complex Qualitative Variables
Highest
Degree x
1
x
2
Bachelors 0 0
Masters 1 0
Ph.D. 0 1
38
Slide
2008 Thomson South-Western. All Rights Reserved
Residual Analysis
y
For simple linear regression the residual plot against
and the residual plot against x provide the same
information.
y
In multiple regression analysis it is preferable to use
the residual plot against to determine if the model
assumptions are satisfied.
39
Slide
2008 Thomson South-Western. All Rights Reserved
Standardized Residual Plot Against
y
Standardized residuals are frequently used in
residual plots for purposes of:
Identifying outliers (typically, standardized
residuals < -2 or > +2)
Providing insight about the assumption that the
error term e has a normal distribution
The computation of the standardized residuals in
multiple regression analysis is too complex to be
done by hand
Excels Regression tool can be used
40
Slide
2008 Thomson South-Western. All Rights Reserved
Excel Value Worksheet
Note: Rows 37-51 are not shown.
Standardized Residual Plot Against
y
A B C D
28
29 RESIDUAL OUTPUT
30
31 Observation Predicted Y Residuals Standard Residuals
32 1 27.89626052 -3.89626052 -1.771706896
33 2 37.95204323 5.047956775 2.295406016
34 3 26.02901122 -2.32901122 -1.059047572
35 4 32.11201403 2.187985973 0.994920596
36 5 36.34250715 -0.54250715 -0.246688757
41
Slide
2008 Thomson South-Western. All Rights Reserved
Standardized Residual Plot Against
y
Excels Standardized Residual Plot
Standardized Residual Plot
-2
-1
0
1
2
3
0 10 20 30 40 50
Predicted Salary
S
t
a
n
d
a
r
d
R
e
s
i
d
u
a
l
s
Outlier
42
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Logistic regression can be used to model situations in
which the dependent variable, y, may only assume
two discrete values, such as 0 and 1.
In many ways logistic regression is like ordinary
regression. It requires a dependent variable, y, and
one or more independent variables.
The ordinary multiple regression model is not
applicable.
43
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Logistic Regression Equation
0 1 1 2 2
0 1 1 2 2
( )
1
p p
p p
x x x
x x x
e
E y
e
b b b b
b b b b
1
p p
p p
b b x b x b x
b b x b x b x
e
y
e
1
x x
x x
e
y
e
51
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Using the Estimated Logistic Regression Equation
2.1464 0.3416(2) 1.0987(0)
2.1464 0.3416(2) 1.0987(0)
0.1880
1
e
y
e
0.4099
1
e
y
e
52
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Testing for Significance
H
0
: b
1
= b
2
= 0
H
a
: One or both of the parameters
is not equal to zero.
Hypotheses
Rejection Rule
Test Statistics z = b
i
/s
b
i
Reject H
0
if p-value < a
53
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Testing for Significance
Conclusions For independent variable x
1
:
z = 2.66 and the p-value .008.
Hence, b
1
= 0. In other words,
x
1
is statistically significant.
For independent variable x
2
:
z = 2.47 and the p-value .013.
Hence, b
2
= 0. In other words,
x
2
is also statistically significant.
54
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
With logistic regression is difficult to interpret the relation-
ship between the variables because the equation is not linear so
we use the concept called the odds ratio.
The odds in favor of an event occurring is defined as the
probability the event will occur divided by the probability
the event will not occur.
Odds in Favor of an Event Occurring
1 2 1 2
1 2 1 2
( 1| , , , ) ( 1| , , , )
odds
( 0| , , , ) 1 ( 1| , , , )
p p
p p
P y x x x P y x x x
P y x x x P y x x x
Odds Ratio
1
0
odds
Odds Ratio
odds
55
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Estimated Probabilities
Credit
Card
Yes
No
$1000 $2000 $3000 $4000 $5000 $6000 $7000
Annual Spending
0.3305 0.4099 0.4943 0.5790 0.6593 0.7314 0.7931
0.1413 0.1880 0.2457 0.3143 0.3921 0.4758 0.5609
Computed
earlier
56
Slide
2008 Thomson South-Western. All Rights Reserved
Logistic Regression
Comparing Odds
Suppose we want to compare the odds of making a
$200 purchase for customers who spend $2000 annually
and have a Simmons credit card to the odds of making a
$200 purchase for customers who spend $2000 annually
and do not have a Simmons credit card.
1
.4099
estimate of odds .6946
1 - .4099
0
.1880
estimate of odds .2315
1 - .1880
.6946
Estimate of odds ratio 3.00
.2315
57
Slide
2008 Thomson South-Western. All Rights Reserved
Chapter 15
Multiple Regression
Multiple Regression Model
Least Squares Method
Multiple Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Qualitative Independent Variables
Residual Analysis
Logistic Regression