You are on page 1of 59

Correlation and Regression

Dr. Seema Sharma

Correlation

It was originally proposed by Karl Pearson, it


is also known as the Pearson correlation
coefficient. It is also referred to as simple
correlation, bivariate correlation, or merely
the correlation coefficient.
The correlation, r, summarizes the strength
of association between two metric (interval or
ratio scaled) variables, say X and Y.

Product Moment Correlation


From a sample of n observations, X and Y, the
correlation, r, can be calculated as:

r=

n XY X Y

n X X
2

n Y Y
2

Product Moment Correlation

r varies between -1.0 and +1.0.


The correlation coefficient between two variables will
be the same regardless of their underlying units of
measurement.

Testing the Significance of Correlation

When correlation is computed for a population rather


than a sample, it is denoted by , the Greek letter
rho. The coefficient r is an estimator of .
The statistical significance of the relationship
between two variables measured by r can be
conveniently tested. The hypotheses are:

H0 : = 0
H1 : 0
= 0.05

Decomposition of the Total Variation


The test statistic is:

t = r n-2
1 - r2

1/2

which has a t distribution with n - 2 degrees of freedom.


For the correlation coefficient say r = .9361 with n = 12.

t = 0.9361

12-2
1 - (0.9361)2

1/2

= 8.414
and the degrees of freedom = 12-2 = 10. From the
t distribution table , the critical value of t for a two-tailed test and

= 0.05 is 2.228.

Hence, the null hypothesis of no


relationship between X and Y is rejected.

Partial Correlation
A partial correlation coefficient measures the
association between two variables after controlling for,
or adjusting for, the effects of one or more additional
variables.

rx y . z =

rx y - (rx z ) (ry z )

1 - rx2z 1 - ry2z

Partial correlations have an order associated with


them. The order indicates how many variables are
being adjusted or controlled.
The simple correlation coefficient, r, has a zeroorder, as it does not control for any additional
variables while measuring the association between
two variables.

SPSS Output for Simple Correlation

SPSS Output for Partial Correlation

Regression Analysis
Regression analysis examines associative relationships
between a metric dependent variable and one or more
independent variables in the following ways:

Determine whether the independent variables explain a


significant variation in the dependent variable: whether a
relationship exists.
Determine how much of the variation in the dependent
variable can be explained by the independent variables:
strength of the relationship.

Determine the structure or form of the relationship: the


mathematical equation relating the independent and
dependent variables.
Predict the values of the dependent variable.

Regression Analysis

Regression Line: Line of Best Fit

Ordinary Least Squares (OLS) Method

Regression Line: Minimizes the sum of


the squared vertical deviations (et) of
each point from the regression line.

Population Linear Regression


Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient

Population
Y intercept

Dependent
(Response)
Variable

Random
Error

Yi x
Xii i
Population
Regression yx
Line
(Expected mean E(Y)

Independent
(Explanatory)
Variable

Ordinary Least Squares (OLS)


Sample
Model:

Yt a bX t et

Yt a bX
t
et Yt Yt

Ordinary Least Squares (OLS)


Objective: Determine the slope and
intercept that minimize the sum of
the squared errors.
n

t 1

t 1

t 1

2
2
)2

(
Y

Y
)

(
Y

bX
t t t t
t

Ordinary Least Squares (OLS)


n

(X
t 1

X )(Yt Y )

a Y bX

2
(
X

X
)
t
t 1

OR

b=

n XY X Y
n X X
2

Ordinary Least Squares (OLS)


Estimation Example
Time

Xt

1
2
3
4
5
6
7
8
9
10

10
9
11
12
11
12
13
13
14
15
120

n 10

Yt
44
40
42
46
48
52
54
58
56
60
500

X t 120

Yt 500

t 1

X
t 1

X t 120

12
n
10

t 1

Yt 500

50
10
t 1 n

Xt X

Yt Y

-2
-3
-1
0
-1
0
1
1
2
3

( X t X )(Yt Y )

( X t X )2

12
30
8
0
2
0
4
8
12
30
106

4
9
1
0
1
0
1
1
4
9
30

-6
-10
-8
-4
-2
2
4
8
6
10

(X
t 1

X )2 30

X )(Yt Y ) 106

(X
t 1

Ordinary Least Squares (OLS)


Estimation Example
n

n 10
n

X
t 1

(X
t 1

t 1

120

Y 500
t 1

X ) 30

X )(Yt Y ) 106

(X

t 1

X t 120

12
n
10

Yt 500

50
10
t 1 n

106

b
3.533
30
a 50 (3.533)(12) 7.60

Y = 7.60+3.533X

Tests of Significance of Regression Co-efficient


Sampling Distribution of b: E(b), S.E.(b)

Test for Significance


The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0 : 1 = 0
H1 : 1 0
= 0.05

Under the validity of H0, t statistic will be used, where

t= b
SEb

With d.f. = n-2

SEb denotes the standard deviation of b and is called


the standard error.

Tests of Significance
Example Calculation
Yt

et Yt Yt

et2 (Yt Yt )2

( X t X )2

44

42.90

1.10

1.2100

40

39.37

0.63

0.3969

11

42

46.43

-4.43

19.6249

12

46

49.96

-3.96

15.6816

11

48

46.43

1.57

2.4649

12

52

49.96

2.04

4.1616

13

54

53.49

0.51

0.2601

13

58

53.49

4.51

20.3401

14

56

57.02

-1.02

1.0404

10

15

60

60.55

-0.55

0.3025

65.4830

30

Time

Xt

Yt

10

e (Yt Yt )2 65.4830
t 1

2
t

t 1

(X
t 1

(Y Y )
( n k ) ( X X )
2

X ) 30
2

sb

65.4830
0.52
(10 2)(30)

Tests of Significance
Example Calculation
n

t 1

t 1

2
2

(
Y

Y
)
t t t 65.4830

2
(
X

X
)
30
t
t 1

(Yt Y )

65.4830
sb

0.52
2
( n k ) ( X t X )
(10 2)(30)

Tests of Significance
Calculation of the t Statistic
b 3.53
t
6.79
sb 0.52

Degrees of Freedom = (n-k) = (10-2) = 8


Critical Value at 5% level =2.306

Y = 7.60+3.533 X
Hence we can say that b is a significant
regression coefficient which infers that
X is a significant explanatory variable
for Y.

Test of Significance of R2
Decomposition of Sum of Squares
Total Variation = Explained Variation + Unexplained Variation

2
2

(Yt Y ) (Y Y ) (Yt Yt )
2

Test of Significance
Coefficient of Determination
R2

(
Y

Y
)

Explained Variation

2
TotalVariation
(
Y

Y
)
t

373.84
R
0.85
440.00
2

Coefficient of Correlation
r R2 withthe sign of b

1 r 1
r 0.85 0.92

Significance of Coefficient of Determination


Another, equivalent test for examining the significance of the linear
relationship between X and Y (significance of b) is the test for the
significance of the coefficient of determination. The hypotheses in this
case are:
H0: R2 = 0

H1: R2

0.05
Under the validity of

H0, the appropriate test statistic is the F statistic:


S SR /k-1
F=
S SE /(n-k)

which has an F distribution with 1 and n - 2 degrees of freedom.

ANOVA Table

Source

Sum of Squares

D.F.

Mean Square
SSR
k 1

Regression

SSR

k-1

MSR

Error

SSE

n-k

MSE

Total

SST

n-1

If

F Fk 1, n k is accepted,

otherwise significant regression.

SSE
nk

F
F

MSR
MSE

Measure of Variation: The Sum of Squares


SST

n-1

Total
Sample
Variability

SSR

Explained
Variability
k-1

Total sum of squares (SST)

SSE

Unexplained
Variability

n-k

S S y = (Y i - Y )2
i =1

Regression sum of squares (SSR)

S S reg = (Y i - Y )

i =1

Error sum of squares (SSE)

SS

= (Y i - Y i )
i =1

SPSS Output for Simple Regression

Multiple Regression
The general form of the multiple regression model
is as follows:

Y = 0 + 1X1 + 2X2 + 3 X3+ . . . + k Xk + e


which is estimated by the following equation:

Y = b0 + b1X1 + b2X2 + b3X3+ . . . + bkXk


As before, the coefficient a represents the intercept,
but the b's are now the partial regression coefficients.

The Multiple Regression Model


Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept

Population slopes

Random
Error

Yi X1i X2i k Xki i

Yi b0 b1 X1i b2 X2i bk Xki ei


Dependent (Response)
variable for sample

Independent (Explanatory)
variables for sample model

Residual

Multiple Regression Analysis

Adjusted Coefficient of Determination

R 2 1 (1 R 2 )

(n 1)
(n k )

Multiple Regression Analysis

Analysis of Variance and F Statistic


F

Explained Variation /(k 1)


Unexplained Variation /(n k )

R 2 /(k 1)
F
(1 R 2 ) /(n k )

Significance Testing of Overall Regression


H0 : R2 = 0
This is equivalent to the following null hypothesis:

H0: 1 = 2 = 3 = . . . = k = 0
The overall test can be conducted by using an F statistic:

F=

R 2 /K-1
(1 - R 2 )/(n- k)

which has an F distribution with k-1 and (n - k ) degrees of freedom.

SPSS Output for Multiple Regression

Stepwise Regression
The purpose of stepwise regression is to select, from a large
number of predictor variables, a small subset of variables that
account for most of the variation in the dependent or criterion
variable. In this procedure, the predictor variables enter or are
removed from the regression equation one at a time. There are
several approaches to stepwise regression.

Forward inclusion. Initially, there are no predictor variables


in the regression equation. Predictor variables are entered one
at a time, only if they meet certain criteria specified in terms of
F ratio. The order in which the variables are included is based
on the contribution to the explained variance.

Backward elimination. Initially, all the predictor variables


are included in the regression equation. Predictors are then
removed one at a time based on the F ratio for removal.

Multicollinearity

Multicollinearity arises when intercorrelations


among the predictors are very high. Can be detected
with VIF (above 5). VIF should remain upto 5 only. It
is the reciprocal of tolerance.

Variance Inflation Factor (VIF)

Variance Inflation Factor measures how much the


variance of the regression coefficients is inflated by
multicollinearity problems. If VIF equals 0, there is
no correlation between the independent measures. A
VIF measure of 1 is an indication of some association
between predictor variables, but generally not
enough to cause problems. A maximum acceptable
VIF value would be 5.0; anything higher would
indicate a problem with multicollinearity.

Multicollinearity

A simple procedure for adjusting for multicollinearity


consists of using only one of the variables in a highly
correlated set of variables.
Sample size can be increased as a possible solution.
Drop the variable which is causing Multicollinearity.

Alternatively, the set of independent variables can be


transformed into a new set of predictors that are
mutually independent by using techniques such as
a) Taking first order differences
b) Taking logarithms of the data
b) Principal components analysis

Regression with Dummy Variables


Product Usage
Category
Boom
Recession
Depression
Normal

Original
Variable
Code
1
2
3
4

Dummy Variable Code

D1

D2

D3

1
0
0
0

0
1
0
0

0
0
1
0

Y i = a + b1D1 + b2D2 + b3D3 +b4 + b4 X1 + b5 X2


X1: Adv. Exp.
X2: R&D Expenses

SPSS Windows
The CORRELATE program computes Pearson product moment correlations
and partial correlations with significance levels. Univariate statistics,
covariance, and cross-product deviations may also be requested.
Significance levels are included in the output. To select these procedures
using SPSS for Windows click:

Analyze>Correlate>Bivariate
Analyze>Correlate>Partial
Scatterplots can be obtained by clicking:
Graphs>Scatter >Simple>Define
REGRESSION calculates bivariate and multiple regression equations,
associated statistics, and plots. It allows for an easy examination of
residuals. This procedure can be run by clicking:
Analyze>Regression Linear

You might also like