DADM-Correlation and Regression

ACE INSTITUTE OF MANAGEMENT
MASTER OF BUSINESS ADMINISTRATION (MBAE)

Semester: III
Credits: 2
Course Name: DATA ANALYSIS AND DECISION MODELING
Effective Date: April, 2011
Class Schedule: Wednesday and Thursday
Time: (6:00-9:00 P.M.)
CORRELATION ANALYSIS
PURPOSE OF CORRELATION ANALYSIS
Population Correlation Coefficient (Rho) is Used
to Measure the Strength between the Variables
Sample Correlation Coefficient r is an Estimate of
and is Used to Measure the Strength of the Linear
Relationship in the Sample Observations
3
CORRELATION
Mutual relationship between two or more than two variable

Variables under consideration are said to be correlated if the
effect of change in one variable tends to change in another
variable

Example: height/weight of persons
weight/blood pressure
price and supply
Demand and commodity
sales of a company and Earning per Share or Price-Earning Ratio of its
stock
income/house value,

We are interested to know what kind of relationship exist and
what is the degree (strength) of relationship between the
variables

TYPES OF CORRELATION
Positive and Negative
Simple correlation
Partial correlation
Multiple correlation
Linear and Non-linear

MEASUREMENT OF CORRELATION
SCATTER DIAGRAM METHOD:
A scatter plot is a graph of the ordered pairs (x,y) of numbers
consisting of the independent variables, x, and the dependent
variables, y.

KARL PEARSONS COEFFICIENT OF CORRELATION

RANK METHOD
for finding the qualitative coefficient of correlation
beauty, intelligence, honesty..
SCATTER DIAGRAM METHOD
The scatter shows the joint variation among the pairs
of values and gives an idea about the degree and
direction of the relationship between the variables x
and y

Greater the scatter of points over the graph, the
lesser the relationship between the variables

If all the points lie in a straight line, there is either
perfect positive or perfect negative correlation

The nearer the points are to the straight line the
high degree of correlation and the farther the points
are to the straight line the low degree of correlation

If the points are widely scatted and no trend are
revealed, the variables may be uncorrelated

It does not provide an exact measure of the extent
of the relationship between the variables

GRAPHICAL EXPLORE:

SCATTER PLOT (THE COLLECTION OF DOT CORRESPONDING TO (X
I
,Y
I
))
PERFECT POSITIVE CORRELATION
20 10 0
60
50
40
30
20
x
y
r=1
PERFECT NEGATIVE CORRELATION
20 10 0
120
110
100
90
80
x
y
r = -1
EXAMPLES OF R VALUES:
EXAMPLE
Independent variable in
this example is the
number of hours studied.
The grade the student
receives is a dependent
variable.
The grade student
receives depend upon the
number of hours he or she
will study.
Are these two variables
related?

Student Hours
studied
% Grade
A 6 82
B 2 63
C 1 57
D 5 88
E 3 68
F 2 75
SCATTER PLOT
the independent variable is plotted on the horizontal x-
axis. The dependent variable is plotted on the vertical
y-axis.
Scatter Plot
0
20
40
60
80
100
0 1 2 3 4 5 6 7
Hours Studied
G
r
a
d
e

(
%
)
RANGE OF CORRELATION COEFFICIENT
In case of exact
positive linear
relationship the value
of r is +1.
In case of a strong
positive linear
relationship, the value
of r will be close to +
1.

Correlation = +1
15
20
25
10 12 14 16 18 20
Independent variable
D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e
In case of exact
negative linear
relationship the
value of r is 1.
In case of a strong
negative linear
relationship, the
value of r will be
close to 1.
Correlation = -1
15
20
25
10 12 14 16 18 20
D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e
In case of a weak
of r will be close to 0
i.e. absence of linear
relationship.
the low or zero value
of r means that the
relationship is not
linear but there could
be other type of
relationship.
Correlation = 0
10
15
20
25
30
0 2 4 6 8 10 12
D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e
x y
1 0
0 1
-1 0
0 -1
1
2 2
= + y x
In case of nonlinear
of r will be close to 0.
Correlation = 0
0
10
20
30
0 2 4 6 8 10 12
D
e
p
e
n
d
e
n
t

v
a
r
i
a
b
l
e
KARL PEARSON CORRELATION COEFFICIENT
(rho), for population values
r for sample values
usually denoted by r(x,y), or r
xy,
simply r
r = is a numerical measure of relationship
between them

(PEARSON PRODUCT-MOMENT)
SAMPLE CORRELATION
) ( ) (
) , (
y Var x Var
y x Cov
r =
yy xx
xy
S S
S
r =
FEATURES OF AND r
Unit Free
Range between -1 and 1
The Closer to -1, the Stronger the Negative
Linear Relationship
The Closer to 1, the Stronger the Positive
Linear Relationship
The Closer to 0, the Weaker the Linear
Relationship
20
EXAMPLE:
Numbers of weeks
(in the program)
Speed gain
(words per minute)
3 86
5 118
2 49
8 493
6 164
9 232
r =0.991
EXAMPLE:
COMPUTE COEFFICIENT OF CORRELATION
X Y
6 9
2 11
10 ?
4 8
8 7
Arithmetic mean of X and Y-series are 6 and 8
EXAMPLE
THE FOLLOWING DATA PERTAIN TO THE DEMAND FOR A
PRODUCT (IN THOUSANDS OF UNITS) AND ITS PRICE (IN RS.)
CHARGED IN FIVE DIFFERENT AREAS;
Price
x
Demand
y
20 22
16 41
10 141
11 89
14 56
Draw a scatter diagram
Calculate the coefficient of correlation.
EXAMPLE
THE ANNUAL LABOR WELFARE FUNDS (LAKHS OF RUPEES) AND THE
CORRESPONDING ANNUAL PRODUCTION (IN CORES OF RUPEES) FOR THE
PAST 8 YEARS OF A COMPANY ARE PRESENTED BELOW.
Year Price
x
Demand
y
1 8 18
2 10 28
3 12 35
4 14 45
5 16 50
6 18 70
7 20 85
8 22 95
Draw a scatter diagram
Calculate the coefficient of correlation annual labor welfare funds and the
corresponding annual production. Also test the significance of the correlation
coefficient at a significance level of 0.05
HYPOTHESIS TESTING
Null hypothesis: =0 (two variables are not associated)
Alternative hypothesis: 0 (two variables are associated)
Level of significance =0.05
Test statistic

Decision : if null hypothesis is rejected there is a
relationship between the two variables.
2 n
r 1
- r
t
2
=
t - TEST FOR CORRELATION
Hypotheses
H
0
: = 0 (No Correlation)
H
1
: = 0 (Correlation)
Test Statistic

( )( )
( ) ( )
2
2
1
2 2
1 1
where
2
n
i i
i
n n
i i
i i
r
t
r
n
X X Y Y
r r
X X Y Y
=
= =
=
1

= =

27
HYPOTHESIS TESTING
( )( )
( )( )
) 1 , 0 ( ~
1 1
1 1
ln
2
3
3
1
N
r
r n
n
z
Z
z
|
|
.
|
\
|
+
+
=
|
.
|
\
|
+
=
r
r
z
1
1
ln
2
1
3
1
=
n
o
Null hypothesis: =
0
(two variables are not associated)
Alternative hypothesis:
0
(two variables are
associated)
Level of significance =0.05
Test statistic

Decision : if null hypothesis is rejected there is a
relationship between the two variables.
|
|
.
|
\
|
+
=
1
1
ln
2
1
z
EXAMPLE
COEFFICIENT OF CORRELATION BASED ON A SAMPLE OF SIZE
18 WAS COMPUTED TO BE 0.32. CAN WE CONCLUDE AT
SIGNIFICANCE LEVELS OF A) 0.05 B) 0.01
Null hypothesis: =0 (two variables are not associated)
Alternative hypothesis: > 0 One tail test
Alternative hypothesis: 0 Two tail test
EXAMPLE
COEFFICIENT OF CORRELATION BASED ON A SAMPLE OF SIZE
24 WAS COMPUTED TO BE 0.75. CAN WE CONCLUDE AT
SIGNIFICANCE LEVELS OF A) 0.05 B) 0.01
Null hypothesis: =0.60 (two variables are not associated)
Alternative hypothesis: > 0.60 One tail test
Alternative hypothesis: 0.60 Two tail test
CONFIDENCE INTERVAL FOR
3 3
2 2
+ < <
n
z
z
n
z
z
z
o o

EXAMPLE:
IF R = 0.7 FOR THE MATHEMATICS AND STATISTICS GRADES OF 30
STUDENTS, CONSTRUCT 95% CONFIDENCE INTERVAL FOR THE
POPULATION CORRELATION COEFFICIENT.

r = 0.70, n = 30, and
z
0.025
=1.96
z that correspond to r =
0.70 from table is 0.867

95% confidence interval for the population correlation
coefficient
85 . 0 45 . 0
27
96 . 1
867 . 0
27
96 . 1
867 . 0
3 3
2 2
< <
+ < <

+ < <
o o
z
z
n
z
z
n
z
z
construct 95% confidence interval for the
population correlation coefficient when
a) r = 0.72, n = 30
b) r = 0.35, n = 40
c) r = -0.87, n = 35,
d) r = 0.16, n = 42,
construct 99% confidence interval for the
population correlation coefficient when
a) r = 0.72, n = 30
b) r = 0.35, n = 40
c) r = -0.87, n = 35,
d) r = 0.16, n = 42,
STRENGTH VS. SIGNIFICANCE OF THE
CORRELATION:
the significance, given by P-value, depends on the
statistical evidence. When small, the correlation
exists.

the strength, given by the r value, is meaningful only
it is supported by statistical significance.

R
2
=12.70%
Means that the variables in the model explains
about 12.70% of the total variation in that age
r = .6 r = 1
SAMPLE OF OBSERVATIONS FROM VARIOUS r
VALUES
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
38
EXAMPLE: PRODUCE STORES
Reg ressi o n S tati sti cs
M u l t i p l e R 0 . 9 7 0 5 5 7 2
R S q u a r e 0 . 9 4 1 9 8 1 2 9
A d j u s t e d R S q u a r e 0 . 9 3 0 3 7 7 5 4
S t a n d a r d E r r o r 6 1 1 . 7 5 1 5 1 7
O b s e r va t i o n s 7
From Excel Printout
r
Is there any
evidence of linear
relationship between
Annual Sales of a
store and its Square
Footage at .05 level
of significance?
H
0
:

= 0 (No association)
H
1
: = 0 (Association)
o = .05
df = 7 - 2 = 5
39
Annual
Store Square Sales
Feet ($000)

1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
EXAMPLE: PRODUCE STORES SOLUTION
0 2.5706 -2.5706
.025
Reject Reject
.025
Critical Value(s):
Conclusion:
There is evidence of a linear
relationship at 5% level of
significance
Decision:
Reject H
0
2
.9706
9.0099
1 .9420
5
2
r
t
r
n

= = =
The value of the t statistic is

exactly the same as the t statistic
value for test on the slope
coefficient
40
SIMPLE REGRESSION
TOPICS
Introduction
Types of Regression Models
Determining the Simple Linear Regression
Equation
Interpretation of regression coefficients
42
INTRODUCTION
Decisions based on forecast
Relationship between variables between what is
known and what is to be estimated
e.g. relationship between annual sales and size of store
e.g. relationship between annual profits and investment
in R&D
Regression and Correlation Analyses
Determine nature and strength of relationship
Simple Regression Model develops relationship
between a response variable and ONE explanatory
variable (independent variable)
Simple Regression Analysis determines degree
to which variables are related, how best the model
describes the relationship
43
PURPOSE OF REGRESSION ANALYSIS
Regression Analysis is Used Primarily to Model
Causality and Provide Prediction
Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable e.g. predict annual sales based
on expenditure in advertising
Explain the effect of the independent variables on the
dependent variable

44
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
TYPES OF RELATIONSHIPS
45
SIMPLE LINEAR REGRESSION MODEL
Relationship Between Variables is Described by
a Linear Function
The Change of One Variable Causes the Other
Variable to Change
A Dependency of One Variable on the Other

46
Population
Regression
Line
(conditional mean)
Population regression line is a straight line that
describes the dependence of the average
value (conditional mean) of one variable on the
other
Population
Y intercept
Population
Slope
Coefficient
Random
Error
Dependent
(Response)
Variable
Independent
(Explanatory)
Variable
i i i
Y X | | c
0 1
+ + =
| Y X

(continued)
47
i i i
Y X | | c
0 1
+ + =
= Random Error
Y
X
(Observed Value of Y) =
Observed Value of Y
| Y X i
X | |
0 1
= +
i
c
|
0
|
1
(Conditional Mean)
(continued)
48
Sample regression line provides an estimate of
the population regression line as well as a
predicted value of Y
Sample
Y Intercept

Sample
Slope
Coefficient

Residual
0 1 i i i
b b Y X e + + =
0 1
Y b b X = + =
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
LINEAR REGRESSION EQUATION
49
b
0
and b
1
are obtained by finding the values of b
0

and b
1
that minimizes the sum of the squared
residuals (Least Squares Method)

b
0
provides an estimate of |
0
b
1
provides and estimate of |
1

(continued)
( )
2
2
1 1
n n
i i i
i i
Y Y e
= =
=

50
LEAST SQUARES METHOD
b
0
= Y - b
1
X
b
1
= Exy (ExEy)/n
Ex
2
- (Ex)
2
/n
Y = Ey/n
X = Ex/n
51
Y
X
Observed Value
| Y X i
X | |
0 1
= +
i
c
|
0
|
1
i i i
Y X | | c
0 1
+ + =
0 1
i i
Y b b X = +
i
e
0 1 i i i
b b Y X e + + =
1
b
0
b
(continued)
52
is the average value of Y when the value of
X is zero.

measures the change in the average
value of Y as a result of a one-unit change in X.
| 0 Y X
|
0 =
=
|
1
Y X
X
|
A
=
A
INTERPRETATION OF THE SLOPE AND
INTERCEPT
53
You wish to examine
the linear dependency
of the annual sales of
produce stores on their
sizes in square
footage. Sample data
for 7 stores were
obtained. Find the
equation of the straight
line that fits the data
best.
Annual
Store Square Sales
Feet ($1000)

1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
LINEAR REGRESSION EQUATION: EXAMPLE
54
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S qua re Fe e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
Excel Output
SCATTER DIAGRAM: EXAMPLE
55
0 1
1636.415 1.487
i i
i
Y b b X
X
= +
= +
From Excel Printout:
Co effi ci en ts
I n t e r c e p t 1 6 3 6 . 4 1 4 7 2 6
X V a r i a b l e 1 1 . 4 8 6 6 3 3 6 5 7
SIMPLE LINEAR REGRESSION EQUATION:
EXAMPLE
56
GRAPH OF THE SIMPLE LINEAR REGRESSION
EQUATION: EXAMPLE
0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0
S q u a r e F e e t
A
n
n
u
a
l

S
a
l
e
s

(
$
0
0
0
)
57
INTERPRETATION OF RESULTS: EXAMPLE
The slope of 1.487 means that each increase of one unit
in X, we predict the average of Y to increase by an
estimated 1.487 units.
The equation estimates that for each increase of 1
square foot in the size of the store, the expected annual
sales are predicted to increase by $1487.
1636.415 1.487
i i
Y X = +
58
TOPICS
Measures of Variation
Coefficient of Determination
Coefficient of Correlation
59
MEASURES OF VARIATION:
THE SUM OF SQUARES
SST = SSR + SSE
Total
Sample
Variability
Explained
Variability
Unexplained
Variability
To examine the ability of the independent variable to
predict the dependant variable
60
THE SUM OF SQUARES
SST = Total Sum of Squares
Measures the variation of the Y
i
values around their
mean,
SSR = Regression Sum of Squares
Explained variation attributable to the relationship
between X and Y, between predicted value and mean
value
SSE = Error Sum of Squares
Variation attributable to factors other than the
relationship between X and Y, between observed value
and predicted value
(continued)
Y
61
THE SUM OF SQUARES
(continued)
SST = (Y
i
- Y)
2
= Y
i
2
(Y
i
)
2
/n

_
SSR = (Y
i
- Y)
2
= b
0
Y
i
+ b
1
X
i
Y
i
- (Y
i
)
2
/n

_
.
SSE = (Y
i
- Y)
2
= Y
i
2
- b
0
Y
i
- b
1
X
i
Y
i

.
62
THE SUM OF SQUARES
(continued)
X
i
Y
X
Y
SST = (Y
i
- Y)
2
SSE =(Y
i
- Y
i
)
2

.
SSR = (Y
i
- Y)
2

.
_
_
_
63
MEASURES OF VARIATION
THE SUM OF SQUARES: EXAMPLE
ANOVA
df SS MS F Significance F
Regression 1 30380456.12 30380456.1 81.1790902 0.000281201
Residual 5 1871199.595 374239.919
Total 6 32251655.71
Excel Output for Produce Stores
SSR
SSE
Regression (explained) df
Degrees of freedom
Error (residual) df
Total df
SST
64
THE COEFFICIENT OF DETERMINATION

Measures the proportion of variation in Y that is
explained by the independent variable X in the
regression model
2
Regression Sum of Squares
Total Sum of Squares
SSR
r
SST
= =
65
COEFFICIENTS OF DETERMINATION (R
2
)
AND CORRELATION (R)
r
2
= 1, r
2
= 1,
r
2
= .81, r
2
= 0,
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
Y
Y
i
= b
0
+ b
1
X
i
X
^
r = +1 r = -1
r = +0.9 r = 0
66
TOPICS
Standard Error of Estimate
Assumptions of Simple Linear Regression
Model
Residual Analysis

67
STANDARD ERROR OF ESTIMATE
The standard deviation of the variation of
observations around the regression equation
( )
2
1
2 2
n
i
i
YX
Y Y
SSE
S
n n
=

= =

68
6
9

2
1
1 0
2

==

=
n
XY b X b Y
S
n
i
YX
X = values of the independent variable
Y = values of the dependent variable
b
0
= Y-intercept
b
1
= slope of the estimating equation
n = number of data points

Finding the Standard Error of Estimate
INFERENCE ABOUT THE SLOPE:
t - TEST
t Test for a Population Slope
Is there a linear dependency of Y on X ?
Null and Alternative Hypotheses
H
0
: |
1
= 0 (No Linear Dependency)
H
1
: |
1
= 0 (Linear Dependency)
Test Statistic

1
1
1 1
2
1
where
( )
YX
b
n
b
i
i
b S
t S
S
X X
|
=
= =
. . 2 d f n =
70
EXAMPLE: PRODUCE STORE
Data for 7 Stores:
Estimated Regression Equation:
The slope of this
model is 1.487.
Does Square
Footage Affect
Annual Sales?
Annual
Store Square Sales
Feet ($000)

1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
1636.415 1.487
i
Y X = +
71
INFERENCES ABOUT THE SLOPE:
T TEST EXAMPLE
H
0
: |
1
= 0
H
1
: |
1
= 0
o = .05
df = 7 - 2 = 5
Critical Value(s):
Test Statistic:
Decision:

Conclusion:

There is evidence that
square footage affects
annual sales.
t
0 2.5706 -2.5706
.025
Reject Reject
.025
Reject H
0
72
F TEST
F Test for a Population Slope
Is there a linear dependency of Y on X ?
Null and Alternative Hypotheses
H
0
: |
1
= 0 (No Linear Dependency)
H
1
: |
1
= 0 (Linear Dependency)
Test Statistic

Numerator d.f.=1, denominator d.f.=n-2
( )
1

2
SSR
F
SSE
n
=
73
CONFIDENCE INTERVAL EXAMPLE
Confidence Interval Estimate of the Slope:
1
1 2 n b
b t S
At 95% level of confidence the confidence interval for the

slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency of
annual sales on the size of the store.
74
ESTIMATION OF MEAN VALUES
Confidence Interval Estimate for :
The Mean of Y given a particular X
i
2
2
2
1
( ) 1
( )
i
i n YX
n
i
i
X X
Y t S
n
X X
t value from table with

df=n-2
Standard error of the
estimate
Size of interval vary according to distance
away from mean,
X
|
i
Y X X
=
75
PREDICTION OF INDIVIDUAL VALUES
Prediction Interval for Individual Response
Y
i
at a Particular X
i
Addition of 1 increases width of interval from that for the
mean of Y
2
2
2
1
( ) 1
1
( )
i
i n YX
n
i
i
X X
Y t S
n
X X
+ +
76
EXAMPLE: PRODUCE STORES
Data for 7 Stores:
Regression Equation
Obtained:
Consider a store
with 2000 square
feet.
Annual
Store Square Sales
Feet ($000)

1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
1636.415 1.487
i
Y X = +
77
ESTIMATION OF MEAN VALUES: EXAMPLE
Find the 95% confidence interval for the average
annual sales for stores of 2,000 square feet
2
2
2
1
( ) 1
4610.45 612.66
( )
i
i n YX
n
i
i
X X
Y t S
n
X X
+ =
Predicted Sales
Confidence Interval Estimate for

|
i
Y X X
=
( )
1636.415 1.487 4610.45 $000

i
Y X = + =
2 5
2350.29 611.75 2.5706
YX n
X S t t
= = = =
78
PREDICTION INTERVAL FOR Y : EXAMPLE
Find the 95% prediction interval for annual sales of
one particular store of 2,000 square feet
Predicted Sales)
2
2
2
1
( ) 1
1 4610.45 1687.68
( )
i
i n YX
n
i
i
X X
Y t S
n
X X
+ + =
Prediction Interval for Individual

i
X X
Y
=
( )
1636.415 1.487 4610.45 $000

i
Y X = + =
2 5
2350.29 611.75 2.5706
YX n
X S t t
= = = =
79
MULTIPLE REGRESSION
TOPICS
The Multiple Regression Model
Residual Analysis
Coefficient of Multiple Determination

81
THE MULTIPLE REGRESSION MODEL
Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population Y-
intercept
Population slopes Random
Error
Dependent (Response) variable Independent (Explanatory) variables
1 2 i i i k ki i
Y X X X | | | | c
0 1 2
= + + + + +
82
MULTIPLE REGRESSION EQUATION
The coefficients of the multiple regression model are estimated using
sample data
ki k 2i 2 1i 1 0 i
X b X b X b b Y
+ + + + =
Estimated
(or predicted)
value of Y
Estimated slope coefficients
Multiple regression equation with k independent variables:
Estimated
intercept
MULTIPLE REGRESSION EQUATION
Example with
two independent
variables
Y
X
1
X
2
2 2 1 1 0
X b X b b Y
+ + =
INTERPRETATION OF ESTIMATED
COEFFICIENTS
Slope (b
i
)
Estimated that the average value of Y changes by
b
i
for each 1 unit increase in X
i
holding all other
variables constant
Example: If b
1
= -2, then fuel oil usage (Y) is
expected to decrease by an estimated 2 gallons for
each 1 degree increase in temperature (X
1
) given
the inches of insulation (X
2
)
Y-Intercept (b
0
)
The estimated average value of Y when all X
i
= 0

85
MULTIPLE REGRESSION MODEL: EXAMPLE
Oil (Gal) Temp Insulation
275.30 40 3
363.80 27 3
164.30 40 10
40.80 73 6
94.30 64 6
230.90 34 6
366.70 9 6
300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 10
(
0
F)
Develop a model for estimating
heating oil used for a single family
home in the month of January based
on average temperature and amount
of insulation in inches.
86
1 2
562.151 5.437 20.012

i i i
Y X X =
MULTIPLE REGRESSION EQUATION: EXAMPLE
Coefficients
Intercept 562.1510092
X Variable 1 -5.436580588
X Variable 2 -20.01232067
Excel
Output
For each degree increase in temperature,
the estimated average amount of heating
oil used is decreased by 5.437 gallons,
holding insulation constant.
For each increase in one inch of
insulation, the estimated average use
of heating oil is decreased by 20.012
gallons, holding temperature
constant.
0 1 1 2 2
i i i k ki
Y b b X b X b X = + + + +
87
STANDARD ERROR OF ESTIMATE FOR MULTIPLE
REGRESSION
The standard error of estimate of dependent variable
Y on independent variables

( )
1
2

=

k n
Y Y
s
e
COEFFICIENT OF MULTIPLE DETERMINATION
Proportion of Total Variation in Y Explained by All X
Variables Taken Together

Never Decreases When a New X Variable is Added
to Model
2
12
Explained Variation
Total Variation
Y k
SSR
r
SST
-
= =
89
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R
Square 0.44172
Standard Error 47.46341
Observations 15
ANOVA

df SS MS F
Significanc
e F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficien
ts
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.4640
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.7088
.52148
56493.3
29460.0
SST
SSR
R
2
= = =
52.1% of the variation in pie sales is
explained by the variation in price
and advertising
ADJUSTED COEFFICIENT OF MULTIPLE
DETERMINATION
Adding additional variables will necessarily reduce
the SSE and increase the r
2
.To account for this, the
adjusted coefficient of determination given by

Proportion of Variation in Y Explained by All X
Variables Adjusted for the Number of X Variables
Used and Sample Size
Penalizes Excessive Use of Independent Variables
Smaller than
Useful in Comparing among Models having different
exploratory variables

( )
2 2
12
1
1 1
1
adj Y k
n
r r
n k
-

(
=
(

2
12 Y k
r
- 91
ADJUSTED R
2
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R
Square 0.44172
Standard Error 47.46341
Observations 15
ANOVA

df SS MS F
Significance
F
Regression 2 29460.027 14730.01 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficien
ts
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
.44172 r
2
adj
=
44.2% of the variation in pie sales is explained by the
variation in price and advertising, taking into account
the sample size and number of independent variables
Regressi on Stati sti cs
M u l t i p l e R 0 . 9 8 2 6 5 4 7 5 7
R S q u a re 0 . 9 6 5 6 1 0 3 7 1
A d j u s t e d R S q u a re 0 . 9 5 9 8 7 8 7 6 6
S t a n d a rd E rro r 2 6 . 0 1 3 7 8 3 2 3
O b s e rva t i o n s 1 5
Excel Output
2
12 Y
SSR
r
SST
-
=
Adjusted r
2
reflects the number
of explanatory
variables and sample
size
is smaller than r
2
93
INTERPRETATION OF COEFFICIENT OF
MULTIPLE DETERMINATION

96.56% of the total variation in heating oil can be
explained by temperature and amount of insulation

95.99% of the total fluctuation in heating oil can be
explained by temperature and amount of insulation
after adjusting for the number of explanatory
variables and sample size
2
12
.9656
Y
SSR
r
SST
-
= =
2
adj
.9599 r =
94
USING THE REGRESSION EQUATION TO
MAKE PREDICTIONS
Predict the amount of heating oil used for a
home if the average temperature is 30
0
and the
insulation is 6 inches.
The predicted heating
oil used is 278.97
gallons
( ) ( )
1 2
562.151 5.437 20.012

562.151 5.437 30 20.012 6
278.969
i i i
Y X X =
=
=
95
RESIDUAL PLOTS
Residuals Vs
Residuals Vs
Residuals Vs
Residuals Vs Time
May have autocorrelation
Y
1
X
2
X
96
RESIDUAL PLOTS: EXAMPLE
Insulation Residual Plot
0 2 4 6 8 10 12
No Discernible Pattern
Temperature Residual Plot
-60
-40
-20
0
20
40
60
0 20 40 60 80
R
e
s
i
d
u
a
l
s
Maybe some non-
linear relationship
97
TESTING FOR OVERALL SIGNIFICANCE
Shows if there is a Linear Relationship between all of
the X Variables together and Y
Use F test Statistic
Hypotheses:
H
0
: |
1
= |
2
= = |
k
= 0 (No linear relationship)
H
1
: At least one |
i
= 0 ( At least one independent variable
affects Y )
The Null Hypothesis is a Very Strong Statement
The Null Hypothesis is Almost Always Rejected
98
TESTING FOR OVERALL SIGNIFICANCE
Test Statistic:

where F has k numerator and (n-k-1)
denominator degrees of freedom

(continued)
( )
( )
all /
all
SSR k
MSR
F
MSE MSE
= =
99
TEST FOR OVERALL SIGNIFICANCE
EXCEL OUTPUT: EXAMPLE
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65411E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
k = 2, the number of
explanatory variables
n - 1
p value
Test Statistic
MSR
F
MSE
=
100
TEST FOR OVERALL SIGNIFICANCE
EXAMPLE SOLUTION
F
0 3.89
H
0
: |
1
= |
2
= = |
k
= 0
H
1
: At least one |
i
= 0
o = .05
df = 2 and 12
Critical Value:
Test Statistic:

Decision:

Conclusion:

Reject at o = 0.05
There is evidence that at
least one independent
variable affects Y
o = 0.05
F
=
168.47
(Excel Output)
101
TEST FOR SIGNIFICANCE:
INDIVIDUAL VARIABLES
Shows if There is a Linear Relationship Between
the Variable X
i
and Y
Use t Test Statistic
Hypotheses:
H
0
: |
i
= 0 (No linear relationship)
H
1
: |
i
= 0 (Linear relationship between X
i
and Y)
102
t TEST STATISTIC OUTPUT: EXAMPLE
Coefficients Standard Error t Stat
Intercept 562.1510092 21.09310433 26.65093769
X Variable 1 -5.436580588 0.336216167 -16.16989642
X Variable 2 -20.01232067 2.342505227 -8.543127434
t Test Statistic for X
1

(Temperature)
t Test Statistic for X
2

(Insulation)
i
i
b
b
t
S
=
103
T TEST : EXAMPLE SOLUTION
H
0
: |
1
= 0
H
1
: |
1
= 0
df = 12
Critical Values:

Test Statistic:

Decision:

Conclusion:

Reject H
0
at o = 0.05
There is evidence of a
significant effect of
temperature on oil
consumption.
t
0
2.1788 -2.1788
.025
Reject H
0
Reject H
0
.025
Does temperature have a significant effect on
monthly consumption of heating oil? Test at o =
0.05.
t Test Statistic = -16.1699
104
CONFIDENCE INTERVAL ESTIMATE
FOR THE SLOPE
Confidence interval for the population slope
i

Example: Form a 95% confidence interval for the effect of changes in
price (X
1
) on pie sales, holding constant the effects of advertising:
-24.975 (2.1788)(10.832): So the interval is (-48.576, -1.374)
i
b 1 k n i
S t b

Coefficients Standard Error

Intercept 306.52619 114.25389
Price -24.97509 10.83213
Advertising 74.13096 25.96732
where t has
(n k 1) d.f.
Here, t has
(15 2 1) = 12 d.f.
1
0
6

ASSUMPTIONS OF REGRESSION

Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given
value of X
Equal Variance (also called homoscedasticity)
The probability distribution of the errors has constant
variance

L.I.N.E

1
0
7

VARIATION OF ERRORS AROUND THE
REGRESSION LINE
Y values are normally distributed
around the regression line.
For each X value, the spread or
variance around the regression line is
the same.
X
1
X
2
f(c)
Sample Regression Line
1
0
8

PURPOSES OF RESIDUAL ANALYSIS
Examine for linearity assumption
Examine for constant variance for all levels of x
Evaluate normal distribution assumption

GRAPHICAL ANALYSIS OF RESIDUALS
Can plot residuals vs. x
Can create histogram of residuals to check for
normality
1
0
9

RESIDUAL ANALYSIS
The residual for observation i, e
i
, is the difference between
its observed and predicted value

Check the assumptions of regression by examining the
residuals
Examine for Linearity assumption
Evaluate Independence assumption
Evaluate Normal distribution assumption
Examine Equal variance for all levels of X
Graphical Analysis of Residuals
Can plot residuals vs. X
i i i
Y Y e

=
1
1
0

RESIDUAL ANALYSIS FOR LINEARITY
Not Linear Linear
x
r
e
s
i
d
u
a
l
s

x
Y
x
Y
x
r
e
s
i
d
u
a
l
s

1
1
1

RESIDUAL ANALYSIS FOR INDEPENDENCE
Not Independent Independent
X
X
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

X
r
e
s
i
d
u
a
l
s

1
1
2

CHECKING FOR NORMALITY
Examine the Stem-and-Leaf Display of the Residuals
Examine the Box-and-Whisker Plot of the Residuals
Examine the Histogram of the Residuals
Construct a Normal Probability Plot of the Residuals
1
1
3

RESIDUAL ANALYSIS FOR EQUAL VARIANCE
Unequal variance Equal variance
x x
Y
x
x
Y
r
e
s
i
d
u
a
l
s

r
e
s
i
d
u
a
l
s

1
1
4

LINEAR REGRESSION EXAMPLE EXCEL RESIDUAL
OUTPUT
House Price Model Residual Plot
-60
-40
-20
0
20
40
60
80
0 1000 2000 3000
Square Feet
R
e
s
i
d
u
a
l
s
RESIDUAL OUTPUT
Predicted
House
Price Residuals
1 251.92316 -6.923162
2 273.87671 38.12329
3 284.85348 -5.853484
4 304.06284 3.937162
5 218.99284 -19.99284
6 268.38832 -49.38832
7 356.20251 48.79749
8 367.17929 -43.17929
9 254.6674 64.33264
10 284.85348 -29.85348
Does not appear to violate
any regression assumptions
AUTOCORRELATION

One of the assumption on regression model is that the errors
Ei and Ej, associated with the ith and jth observation are
uncorrelated

Autocorrelation is correlation of the errors (residuals) over
time

115

DURBINWATSON TEST FOR AR(1) AUTOCORRELATION
The standard test statistic for autocorrelation of the AR(1) type is the
DurbinWatson d statistic, computed from the residuals as shown above.
Most regression applications calculate it automatically and present it as
one of the standard regression diagnostics.

=
=

=
T
t
t
T
t
t t
e
e e
d
1
2
2
2
1
) (
116

In large samples

2
It can be shown that in large samples d tends to 2 2, where is the
parameter in the AR(1) relationship u
t
= u
t1
+ c
t
.
=
=

=
T
t
t
T
t
t t
e
e e
d
1
2
2
2
1
) (
2 2 d
117

In large samples

No autocorrelation

If there is no autocorrelation, is 0 and d should be distributed randomly
around 2.
=
=

=
T
t
t
T
t
t t
e
e e
d
1
2
2
2
1
) (
2 2 d
2 d
118

In large samples

No autocorrelation
Severe positive autocorrelation

If there is severe positive autocorrelation, will be near 1 and d will be
near 0.
=
=

=
T
t
t
T
t
t t
e
e e
d
1
2
2
2
1
) (
2 2 d
2 d
0 d
119

In large samples

No autocorrelation
Severe negative autocorrelation
Likewise, if there is severe positive autocorrelation, will be near 1 and
d will be near 4.
=
=

=
T
t
t
T
t
t t
e
e e
d
1
2
2
2
1
) (
2 2 d
2 d
0 d
4 d
120

No autocorrelation
Thus d behaves as illustrated graphically above.
2 d
0 d
4 d
2 4 0
positive
autocorrelation
negative
autocorrelation
no
autocorrelation
121

No autocorrelation
To perform the DurbinWatson test, we define critical values of d. The null
hypothesis is H
0
: = 0 (no autocorrelation). If d lies between these values,
we do not reject the null hypothesis.
2 d
0 d
4 d
2 4 0
positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit
d
crit

122

No autocorrelation
The critical values, at any significance level, depend on the number of
observations in the sample and the number of explanatory variables.
2 d
0 d
4 d
2 4 0
positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit
d
crit

123

No autocorrelation
Unfortunately, they also depend on the actual data for the explanatory
variables in the sample, and thus vary from sample to sample.
2 d
0 d
4 d
2 4 0 d
crit

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit

124

No autocorrelation
However Durbin and Watson determined upper and lower bounds, d
U
and
d
L
, for the critical values, and these are presented in standard tables.
2 d
0 d
4 d
2 4 0 d
L
d
U
d
crit

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit

125

No autocorrelation
If d is less than d
L
, it must also be less than the critical value of d for
positive autocorrelation, and so we would reject the null hypothesis and
conclude that there is positive autocorrelation.
2 d
0 d
4 d
2 4 0 d
L
d
U
d
crit

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit

126

No autocorrelation
If d is above than d
U
, it must also be above the critical value of d, and so we
would not reject the null hypothesis. (Of course, if it were above 2, we
should consider testing for negative autocorrelation instead.)
2 d
0 d
4 d
2 4 0 d
L
d
U
d
crit

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit

127

No autocorrelation
If d lies between d
L
and d
U
, we cannot tell whether it is above or below the
critical value and so the test is indeterminate.
2 d
0 d
4 d
2 4 0 d
L
d
U
d
crit

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
d
crit

128

No autocorrelation
Here are d
L
and d
U
for 45 observations and two explanatory variables, at
the 5% significance level.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62
(n = 45, k = 3, 5% level)
129

No autocorrelation
There are similar bounds for the critical value in the case of negative
autocorrelation. They are not given in the standard tables because
negative autocorrelation is uncommon, but it is easy to calculate them
because are they are located symmetrically to the right of 2.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
2.38 2.57 1.43 1.62
(n = 45, k = 3, 5% level)
130

No autocorrelation
So if d < 1.43, we reject the null hypothesis and conclude that there is
positive autocorrelation.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62 2.38 2.57
(n = 45, k = 3, 5% level)
131

No autocorrelation
If 1.43 < d < 1.62, the test is indeterminate and we do not come to any
conclusion.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62 2.38 2.57
(n = 45, k = 3, 5% level)
132

No autocorrelation
If 1.62 < d < 2.38, we do not reject the null hypothesis of no autocorrelation.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62 2.38 2.57
(n = 45, k = 3, 5% level)
133

No autocorrelation
If 2.38 < d < 2.57, we do not come to any conclusion.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62 2.38 2.57
(n = 45, k = 3, 5% level)
134

No autocorrelation
If d > 2.57, we conclude that there is significant negative autocorrelation.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.43 1.62 2.38 2.57
(n = 45, k = 3, 5% level)
135

No autocorrelation
Here are the bounds for the critical values for the 1% test, again with 45
observations and two explanatory variables.
2 d
0 d
4 d
2 4 0 d
L
d
U

positive
autocorrelation
negative
autocorrelation
no
autocorrelation
1.24 1.42 2.58 2.76
(n = 45, k = 3, 1% level)
136

Here is a plot of the residuals from a logarithmic regression of expenditure on housing services on
income and the relative price of housing services. The residuals exhibit strong positive
autocorrelation.
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
1959 1963 1967 1971 1975 1979 1983 1987 1991 1995 1999 2003
137

============================================================
Dependent Variable: LGHOUS
Method: Least Squares
Sample: 1959 2003
Included observations: 45
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C 0.005625 0.167903 0.033501 0.9734
LGDPI 1.031918 0.006649 155.1976 0.0000
LGPRHOUS -0.483421 0.041780 -11.57056 0.0000
============================================================
R-squared 0.998583 Mean dependent var 6.359334
Adjusted R-squared 0.998515 S.D. dependent var 0.437527
S.E. of regression 0.016859 Akaike info criter-5.263574
Sum squared resid 0.011937 Schwarz criterion -5.143130
Log likelihood 121.4304 F-statistic 14797.05
Durbin-Watson stat 0.633113 Prob(F-statistic) 0.000000
============================================================
The d statistic is very low, below d
L
for the 1% significance test (1.24), so we would reject the null
hypothesis of no autocorrelation.
d
L
d
U

1.24 1.42
(n = 45, k = 3, 1% level)
138

DADM-Correlation and Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DADM-Correlation and Regression

Uploaded by

Copyright:

Available Formats

ACE INSTITUTE OF MANAGEMENT

MASTER OF BUSINESS ADMINISTRATION (MBAE)

The value of the t statistic is

SIMPLE LINEAR REGRESSION MODEL

At 95% level of confidence the confidence interval for the

t value from table with

1636.415 1.487 4610.45 $000

Prediction Interval for Individual

1636.415 1.487 4610.45 $000

562.151 5.437 20.012

562.151 5.437 20.012

Coefficients Standard Error

You might also like