Regression and Correlation

Correlation and Regression
Dr. Seema Sharma
Correlation
It was originally proposed by Karl Pearson, it

is also known as the Pearson correlation
coefficient. It is also referred to as simple
correlation, bivariate correlation, or merely
the correlation coefficient.
The correlation, r, summarizes the strength
of association between two metric (interval or
ratio scaled) variables, say X and Y.
Product Moment Correlation

From a sample of n observations, X and Y, the
correlation, r, can be calculated as:
r=
n XY X Y
n X X
2
n Y Y
2
Product Moment Correlation
r varies between -1.0 and +1.0.

The correlation coefficient between two variables will
be the same regardless of their underlying units of
measurement.
Testing the Significance of Correlation
When correlation is computed for a population rather

than a sample, it is denoted by , the Greek letter
rho. The coefficient r is an estimator of .
The statistical significance of the relationship
between two variables measured by r can be
conveniently tested. The hypotheses are:
H0 : = 0
H1 : 0
= 0.05
Decomposition of the Total Variation

The test statistic is:
t = r n-2
1 - r2
1/2
which has a t distribution with n - 2 degrees of freedom.

For the correlation coefficient say r = .9361 with n = 12.
t = 0.9361
12-2
1 - (0.9361)2
1/2
= 8.414
and the degrees of freedom = 12-2 = 10. From the
t distribution table , the critical value of t for a two-tailed test and
= 0.05 is 2.228.
Hence, the null hypothesis of no

relationship between X and Y is rejected.
Partial Correlation
A partial correlation coefficient measures the
association between two variables after controlling for,
or adjusting for, the effects of one or more additional
variables.
rx y . z =
rx y - (rx z ) (ry z )
1 - rx2z 1 - ry2z
Partial correlations have an order associated with

them. The order indicates how many variables are
being adjusted or controlled.
The simple correlation coefficient, r, has a zeroorder, as it does not control for any additional
variables while measuring the association between
two variables.
SPSS Output for Simple Correlation
SPSS Output for Partial Correlation
Regression Analysis
Regression analysis examines associative relationships
between a metric dependent variable and one or more
independent variables in the following ways:
Determine whether the independent variables explain a

significant variation in the dependent variable: whether a
relationship exists.
Determine how much of the variation in the dependent
variable can be explained by the independent variables:
strength of the relationship.
Determine the structure or form of the relationship: the

mathematical equation relating the independent and
dependent variables.
Predict the values of the dependent variable.
Regression Analysis
Regression Line: Line of Best Fit
Ordinary Least Squares (OLS) Method
Regression Line: Minimizes the sum of

the squared vertical deviations (et) of
each point from the regression line.
Population Linear Regression

Population regression line is a straight line that
describes the dependence of the average value
(conditional mean) of one variable on the other
Population
Slope
Coefficient
Population
Y intercept
Dependent
(Response)
Variable
Random
Error
Yi x
Xii i
Population
Regression yx
Line
(Expected mean E(Y)
Independent
(Explanatory)
Variable
Ordinary Least Squares (OLS)

Sample
Model:
Yt a bX t et
Yt a bX
t
et Yt Yt

Objective: Determine the slope and
intercept that minimize the sum of
the squared errors.
n
t 1
t 1
t 1
2
2
)2
(
Y
Y
)
(
Y
bX
t t t t
t

n
(X
t 1
X )(Yt Y )
a Y bX
2
(
X
X
)
t
t 1
OR
b=
n XY X Y
n X X
2

Estimation Example
Time
Xt
1
2
3
4
5
6
7
8
9
10
10
9
11
12
11
12
13
13
14
15
120
n 10
Yt
44
40
42
46
48
52
54
58
56
60
500
X t 120
Yt 500
t 1
X
t 1
X t 120
12
n
10
t 1
Yt 500
50
10
t 1 n
Xt X
Yt Y
-2
-3
-1
0
-1
0
1
1
2
3
( X t X )(Yt Y )
( X t X )2
12
30
8
0
2
0
4
8
12
30
106
4
9
1
0
1
0
1
1
4
9
30
-6
-10
-8
-4
-2
2
4
8
6
10
(X
t 1
X )2 30
X )(Yt Y ) 106
(X
t 1

Estimation Example
n
n 10
n
X
t 1
(X
t 1
t 1
120
Y 500
t 1
X ) 30
X )(Yt Y ) 106
(X
t 1
X t 120
12
n
10
Yt 500
50
10
t 1 n
106
b
3.533
30
a 50 (3.533)(12) 7.60
Y = 7.60+3.533X
Tests of Significance of Regression Co-efficient

Sampling Distribution of b: E(b), S.E.(b)
Test for Significance

The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0 : 1 = 0
H1 : 1 0
= 0.05
Under the validity of H0, t statistic will be used, where
t= b
SEb
With d.f. = n-2
SEb denotes the standard deviation of b and is called

the standard error.
Tests of Significance
Example Calculation
Yt
et Yt Yt
et2 (Yt Yt )2
( X t X )2
44
42.90
1.10
1.2100
40
39.37
0.63
0.3969
11
42
46.43
-4.43
19.6249
12
46
49.96
-3.96
15.6816
11
48
46.43
1.57
2.4649
12
52
49.96
2.04
4.1616
13
54
53.49
0.51
0.2601
13
58
53.49
4.51
20.3401
14
56
57.02
-1.02
1.0404
10
15
60
60.55
-0.55
0.3025
65.4830
30
Time
Xt
Yt
10
e (Yt Yt )2 65.4830
t 1
2
t
t 1
(X
t 1
(Y Y )
( n k ) ( X X )
2
X ) 30
2
sb
65.4830
0.52
(10 2)(30)
Example Calculation
n
t 1
t 1
2
2
(
Y
Y
)
t t t 65.4830
2
(
X
X
)
30
t
t 1
(Yt Y )
65.4830
sb
0.52
2
( n k ) ( X t X )
(10 2)(30)
Calculation of the t Statistic
b 3.53
t
6.79
sb 0.52
Degrees of Freedom = (n-k) = (10-2) = 8

Critical Value at 5% level =2.306
Y = 7.60+3.533 X
Hence we can say that b is a significant
regression coefficient which infers that
X is a significant explanatory variable
for Y.
Test of Significance of R2
Decomposition of Sum of Squares
Total Variation = Explained Variation + Unexplained Variation
2
2
(Yt Y ) (Y Y ) (Yt Yt )
2
Test of Significance
Coefficient of Determination
R2
(
Y
Y
)
Explained Variation
2
TotalVariation
(
Y
Y
)
t
373.84
R
0.85
440.00
2
Coefficient of Correlation
r R2 withthe sign of b
1 r 1
r 0.85 0.92
Significance of Coefficient of Determination

Another, equivalent test for examining the significance of the linear
relationship between X and Y (significance of b) is the test for the
significance of the coefficient of determination. The hypotheses in this
case are:
H0: R2 = 0
H1: R2
0.05
Under the validity of
H0, the appropriate test statistic is the F statistic:

S SR /k-1
F=
S SE /(n-k)
which has an F distribution with 1 and n - 2 degrees of freedom.
ANOVA Table
Source
Sum of Squares
D.F.
Mean Square
SSR
k 1
Regression
SSR
k-1
MSR
Error
SSE
n-k
MSE
Total
SST
n-1
If
F Fk 1, n k is accepted,
otherwise significant regression.
SSE
nk
F
F
MSR
MSE
Measure of Variation: The Sum of Squares

SST
n-1
Total
Sample
Variability
SSR
Explained
Variability
k-1
Total sum of squares (SST)
SSE
Unexplained
Variability
n-k
S S y = (Y i - Y )2
i =1
Regression sum of squares (SSR)
S S reg = (Y i - Y )
i =1
Error sum of squares (SSE)
SS
= (Y i - Y i )
i =1
SPSS Output for Simple Regression
Multiple Regression
The general form of the multiple regression model
is as follows:
Y = 0 + 1X1 + 2X2 + 3 X3+ . . . + k Xk + e

which is estimated by the following equation:
Y = b0 + b1X1 + b2X2 + b3X3+ . . . + bkXk

As before, the coefficient a represents the intercept,
but the b's are now the partial regression coefficients.
The Multiple Regression Model

Relationship between 1 dependent & 2 or more
independent variables is a linear function
Population
Y-intercept
Population slopes
Random
Error
Yi X1i X2i k Xki i
Yi b0 b1 X1i b2 X2i bk Xki ei

Dependent (Response)
variable for sample
Independent (Explanatory)
variables for sample model
Residual
Multiple Regression Analysis
Adjusted Coefficient of Determination
R 2 1 (1 R 2 )
(n 1)
(n k )
Multiple Regression Analysis
Analysis of Variance and F Statistic

F
Explained Variation /(k 1)

Unexplained Variation /(n k )
R 2 /(k 1)
F
(1 R 2 ) /(n k )
Significance Testing of Overall Regression

H0 : R2 = 0
This is equivalent to the following null hypothesis:
H0: 1 = 2 = 3 = . . . = k = 0
The overall test can be conducted by using an F statistic:
F=
R 2 /K-1
(1 - R 2 )/(n- k)
which has an F distribution with k-1 and (n - k ) degrees of freedom.
SPSS Output for Multiple Regression
Stepwise Regression
The purpose of stepwise regression is to select, from a large
number of predictor variables, a small subset of variables that
account for most of the variation in the dependent or criterion
variable. In this procedure, the predictor variables enter or are
removed from the regression equation one at a time. There are
several approaches to stepwise regression.
Forward inclusion. Initially, there are no predictor variables

in the regression equation. Predictor variables are entered one
at a time, only if they meet certain criteria specified in terms of
F ratio. The order in which the variables are included is based
on the contribution to the explained variance.
Backward elimination. Initially, all the predictor variables

are included in the regression equation. Predictors are then
removed one at a time based on the F ratio for removal.
Multicollinearity
Multicollinearity arises when intercorrelations

among the predictors are very high. Can be detected
with VIF (above 5). VIF should remain upto 5 only. It
is the reciprocal of tolerance.
Variance Inflation Factor (VIF)
Variance Inflation Factor measures how much the

variance of the regression coefficients is inflated by
multicollinearity problems. If VIF equals 0, there is
no correlation between the independent measures. A
VIF measure of 1 is an indication of some association
between predictor variables, but generally not
enough to cause problems. A maximum acceptable
VIF value would be 5.0; anything higher would
indicate a problem with multicollinearity.
Multicollinearity
A simple procedure for adjusting for multicollinearity

consists of using only one of the variables in a highly
correlated set of variables.
Sample size can be increased as a possible solution.
Drop the variable which is causing Multicollinearity.
Alternatively, the set of independent variables can be

transformed into a new set of predictors that are
mutually independent by using techniques such as
a) Taking first order differences
b) Taking logarithms of the data
b) Principal components analysis
Regression with Dummy Variables

Product Usage
Category
Boom
Recession
Depression
Normal
Original
Variable
Code
1
2
3
4
Dummy Variable Code
D1
D2
D3
1
0
0
0
0
1
0
0
0
0
1
0
Y i = a + b1D1 + b2D2 + b3D3 +b4 + b4 X1 + b5 X2

X1: Adv. Exp.
X2: R&D Expenses
SPSS Windows
The CORRELATE program computes Pearson product moment correlations
and partial correlations with significance levels. Univariate statistics,
covariance, and cross-product deviations may also be requested.
Significance levels are included in the output. To select these procedures
using SPSS for Windows click:
Analyze>Correlate>Bivariate
Analyze>Correlate>Partial
Scatterplots can be obtained by clicking:
Graphs>Scatter >Simple>Define
REGRESSION calculates bivariate and multiple regression equations,
associated statistics, and plots. It allows for an easy examination of
residuals. This procedure can be run by clicking:
Analyze>Regression Linear

Regression and Correlation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression and Correlation

Uploaded by

Copyright:

Available Formats

Correlation and Regression

Dr. Seema Sharma

It was originally proposed by Karl Pearson, it

Product Moment Correlation

Product Moment Correlation

r varies between -1.0 and +1.0.

Testing the Significance of Correlation

When correlation is computed for a population rather

Decomposition of the Total Variation

which has a t distribution with n - 2 degrees of freedom.

Hence, the null hypothesis of no

Partial correlations have an order associated with

SPSS Output for Simple Correlation

SPSS Output for Partial Correlation

Determine whether the independent variables explain a

Determine the structure or form of the relationship: the

Regression Line: Line of Best Fit

Ordinary Least Squares (OLS) Method

Regression Line: Minimizes the sum of

Population Linear Regression

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS)

Tests of Significance of Regression Co-efficient

Test for Significance

Under the validity of H0, t statistic will be used, where

With d.f. = n-2

SEb denotes the standard deviation of b and is called

Degrees of Freedom = (n-k) = (10-2) = 8

Significance of Coefficient of Determination

H0, the appropriate test statistic is the F statistic:

which has an F distribution with 1 and n - 2 degrees of freedom.

otherwise significant regression.

Measure of Variation: The Sum of Squares

Total sum of squares (SST)

Regression sum of squares (SSR)

Error sum of squares (SSE)

SPSS Output for Simple Regression

Y = 0 + 1X1 + 2X2 + 3 X3+ . . . + k Xk + e

Y = b0 + b1X1 + b2X2 + b3X3+ . . . + bkXk

The Multiple Regression Model

Yi X1i X2i k Xki i

Yi b0 b1 X1i b2 X2i bk Xki ei

Multiple Regression Analysis

Adjusted Coefficient of Determination

Multiple Regression Analysis

Analysis of Variance and F Statistic

Explained Variation /(k 1)

Significance Testing of Overall Regression

which has an F distribution with k-1 and (n - k ) degrees of freedom.

SPSS Output for Multiple Regression

Forward inclusion. Initially, there are no predictor variables

Backward elimination. Initially, all the predictor variables

Multicollinearity arises when intercorrelations

Variance Inflation Factor (VIF)

Variance Inflation Factor measures how much the

A simple procedure for adjusting for multicollinearity

Alternatively, the set of independent variables can be

Regression with Dummy Variables

Dummy Variable Code

Y i = a + b1D1 + b2D2 + b3D3 +b4 + b4 X1 + b5 X2

You might also like