You are on page 1of 36

Correlation

&
Regression

Prof. G.R.C.Nair
1
Correlation Analysis

• Correlation Analysis is a
statistical technique used to
measure the strength of the
association between two
variables.
• This is very useful to predict
future scenario for business.
Scatter Diagram

• The Dependent Variable is the


variable being predicted or
estimated.
• The Independent Variable
provides the basis for estimation
or it is the estimator.
• A Scatter Diagram is a chart that
portrays the relationship between
the two variables. 3
This scatter plot locates pairs of observations of
advertising expenditures on the x-axis and sales
on the y -axis. We notice that Larger (smaller)
values of sales tend to be associated with larger
(smaller) values of advertising.

S c a tte rp lo t o f A d ve rtis ing E x p e n d iture s ( X ) a nd S a le s ( Y )

140

120

100
S a le s

80

60

40

20

0
0 10 20 30 40 50 4
A d ve rtis i ng
Direct Linear

• The scatter of points tends to be


distributed around a positively sloped
straight line.
• The pairs of values of advertising
expenditures and sales are not located
exactly on a straight line.
• The scatter plot reveals a more or less
strong tendency rather than a precise
linear relationship.
• The line represents the nature of the
relationship on average.
5
Y
Inverse Linear

6
X
Y
Direct Nonlinear

X
7
• No association / No correlation

• Correlated ?
Y

X
8
Perfect Negative Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Perfect Positive Correlation

10
9
8
7
6
5
Y4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Zero Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Strong Positive Correlation

10
9
8
7
6
Y 5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
X
Nature of Correlation

Correlation can be

• Positive or Negative
• Linear or Nonlinear
• Perfect / Strong / Weak

13
Coefficient of Correlation, r

Karl Pearson’s Coefficient of Correlation (r)


is a measure of the strength of the linear
relationship between two variables.

It requires interval or ratio-scaled data.


It can range from -1.00 to 1.00.

Values of -1.00 or 1.00 indicate perfect and strong


correlation.
Values close to 0.0 indicate weak correlation.

Negative values indicate an inverse relationship and


positive values indicate a direct relationship.
Formula for r

We calculate the coefficient of correlation


from the following formulae.
r = Cov (X,Y)/sxsy ,
Cov (X,Y) = Σ [(X-X)(Y-Y)]/(n-1)
r = Σ [(X-X) (Y-Y)] / root of [ Σ (X-X)2 * Σ (Y-
Y)2 ]
Coefficient of Determination

The coefficient of determination (r2) is the


proportion of the total variation in the
dependent variable (Y ) that is explained or
accounted for (not necessarily caused) by
the variation in the independent variable (X).

It is the square of the coefficient of


correlation. Ranges from 0 to 1.
It does not give any information on the

direction of the relationship between the


variables.
Rank Correlation
• Edward Spearman’s Rank Correlation
Coefficient (R) is used to measure the
degree of correlation between two
qualitative variables like, honesty,
beauty, talent for singing, gift dancing
etc which cannot be directly measured.
In this case, they are ranked serially ,
and the correlation ship between the
ranks is calculated as R= 1 – [6 ΣD2 /
N(N2-1)], where, D is difference in rank
for two variables for the same sample. 17
Regression

• In regression analysis we use the


independent variable (x) to estimate
the dependent variable (y ).

When the relationship between the variables


is linear, it is called Linear regression.
Both variables must be at least interval scale.

The least squares criterion is used to

determine the equation.


Least Square Regression

The linear regression equation is:


y’ = a + bx, where:
• y’ is the average predicted value of the
dependent variable for any value of x.

• a is the Y- intercept. It is the estimated y


value when x = 0
• b, the regression coefficient, is the slope of
the line, or the average change in y for each
change of one unit in x
Regression Equation

• The least squares principle is used


to obtain a and b.
• ΣY=na+bΣX
• Σ XY = a Σ X + b Σ X2 or,
n( ΣXY ) −( ΣX )( ΣY)
b=
n( ΣX ) −( Σ
2
X) 2

ΣY Σ X
a = −b
n n 20
Example -1

• Dan Ireland, the student body president at


Toledo State University, is concerned about
the cost to students of textbooks. He
believes there is a relationship between the
number of pages in the text and the selling
price of the book. To provide insight into the
problem he selects a sample of eight
textbooks currently on sale in the bookstore.
Draw a scatter diagram. Compute the
correlation coefficient.
Book Page Price ($)
(X) (Y) X-X Y-Y
Intro to History 500 84
Basic Algebra 700 75
Intro to Psychology 800 99
Intro to Sociology 600 72
Bus. Management 400 69
Intro to Biology 500 81
Fund. of Jazz 600 63
Principles of Nursing 800 93
Σ X Σ Y
ans = 0.614

Scatter Diagram of Number of Pages and Selling Price of Text

100

90
Price ($)

80

70

60
400 500 600 700 800
Page 23
Example 1 contn

Develop a regression equation for the information


given in Example 1 that can be used to estimate
the selling price based on the number of pages.

636 4,900
a= − 0.05143 = 48.0
8 8
8(397,200) − (4,900)(636)
b= 2
= .05143
8(3,150,000) − (4,900)
Example 1 contn

The regression equation is:

Y’ = 48.0 + .05143X

• The equation crosses the Y-axis at $48.


A book with no pages would cost $48.
• The slope of the line is .05143. Each
addition page costs about 5 cents.
25
Example 1 contn

We can use the regression equation to estimate


values of Y.

• Estimate the selling price of an 800 page book.

Y ′ = 48.0 + 0.05143 X
= 48.0 + 0.05143(800) = 89.14
26
Example 2/HW

• The marks given by 2 judges to the


contestants of a beauty contest is below.
Find the correlation between the tastes of
the 2 judges.
• Contst A B C D E F G H I J
• Judge X 52, 53, 42, 60, 45, 41, 37, 38, 25, 27
• Judge Y 65, 68, 43, 38, 77, 48, 35, 30, 25, 50
• Ans : 0.5394

27
Assumptions

For each value of x, there is a group of y


values
These y values are normally distributed.

The means of these normal distributions


of y values all lie on the straight line of
regression.
The standard deviations of these normal

distributions are equal.


Standard Error

• Standard Deviation of all values of y is


given by
S.E = root of { Σ ( Y - y’) 2 / (n - 2)}
y’ is the estimated value by regression
equation.
Y is corresponding actual.
Also, S.E=root {(Σ Y2- aΣ Y-bΣ XY)/(n-2)}
29
Confidence Interval

• Higher the standard error, lower the


reliability of the predicted value of y
• A confidence interval for y’ for a
given value of x can be constructed
as y’ + z S.E or y' + t S.E with n-2 d.f

30
Significance testing

• If it is necessary to use this sample


regression coefficient ‘b’ for the whole
population, its significance may be tested
• Std error of b = Sb
• Sb = S.E / root ( Σ x2 – nx 2 )
• For ‘t’ test, t = (b - B) / Sb, for d.f =n-2
• Ho: B=0, ie, no linear correlation for the
population. H1: B = 0 or > 0 or < 0
• A confidence interval for ‘b’ also can be
constructed as b + t sb.
31
Example 3

• Estimate the relationship between sales in Rs lakh


and ad expense in Rs lakh. Find the 95% confidence
interval for the sales when the ad expense is 7 lakhs.
Test if the ad has a positive impact on sales at 5%
significance.
Sales 3 15 6 20 9 25
Advt 1 2 3 4 5 6

• Ans: X = 3.5, Y = 13, a = 2.4, b = 3.03


• y’= 2.4+3.03 X. When X=7, y’= 23.6
• (2.4 means sales without any ad. For every Re
ad, expect 3 Rs sales increase) 32
• S.E=root{(Σ Y2- aΣ Y-bΣ XY)/(n-2)} =7.1,
• t for 5%at d.f, 4 is 2.776.
• 95% conf int = 23.6 + 2.776 * 7.1 =3. 9 to
43.3
• Ho: B=0, H1:B > 0
• Sb = S.E / root ( Σ x2 – nx 2 ) = 1.7
• t= (b - 0)/sb= 1.785.
• Since it is < t critical at d.f 4 (one tail),
2.132, we cannot conclude that there is
positive impact at 5% significance level
or 95% confidence level. 33
Multiple Regression

• A variable may depend on more than one


independent variable.
• eg:-Yield of grains depends on rain, fertilizer
used etc

• Y’ = a + b1X1 + b2 X2 - A three dimensional graph


• Or, even
• Y’= a + b1 X1 + b2 X2 + b3 X3 + b4 X4 + ……

34
Example 4/ HW

• A professor felt that the hours spent by students on


home work and the marks they get are
correlated. .Test it with the given data.

Student A B C D E F G H I J
Hrs 45 30 90 60 105 65 90 80 55 75
Mark 40 35 75 65 90 50 90 80 45 65

• Predict the mark of the student who spends 95 hrs


• Obtain a 95% confidence interval for the mark.
35
HW / Assignment
• IIMM Page 521,23, 42,79

• 2009 Terminal Part B 1 a & 1b

• 2007 terminal –make up part C .Q 5 a


& 5 b.
• 2007 terminal part C Q.6b

36