You are on page 1of 19

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 1

Chapter 21 : Simple Linear Regression and correlation.


The regression analysis is used to predict the value of one variable(Dependent) on the
basis of other variables(Independent).
This technique may be the most commonly used statistical procedure because, almost all
companies and government institutions forecast variables. Example include weather
forecasting, stock market analyses, sales predictions, crop prediction and sports
prediction and oil price prediction.

Prediction are made in all areas.

Some predictions are more accurate than others due to the strength of the
relationship. That is, the stronger the relationship is between variables, the more
accurate the prediction is. Eg Prediction of temperature in degree F based on
degree C using equation F=32 +

9
C is 100% accurate because this is an area of
5

pure science.

Regression analysis provide a Best Fit mathematical equation for the values of
the two variables using method of least square

21.1

Correlation analysis measures the strength of the relationship


Dependent and independent variables
The Dependent Variable(Y) is the variable being predicted or estimated, It is also
referred to as the Response Variable

The Independent Variable(X) provide the basis for estimation. It is the also
referred to as the predictor variable.

21.2

Regression Analysis model of the Population

The regression equation: Y = 0 + 1X + , where:

Y is the average predicted value of Y for any given X. It is called the dependent
or response variable. It is refers to as the average predicted or Estimated value of
Y for any given value of X.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 2

X is called the independent or predictor variable. It provides the basis for


estimation.
o 0 is the Y-intercept, or the estimated Y value when X = 0
o 1 is the slope of the line, or the average change in Y for each change of one
unit in X.
o is epsilon or the random error phrase.
Regression equation of the sample:
^

y = b o + b1 x
21.3

where

b1 =

SS xy
SS x

and

b o = y - b1 x

Assumption underlying Simple Linear Regression

The four assumptions of regression(known by


the acronym LINE) are as follows:

Linearity

Independence of Errors

Normality of Errors

Equal Variance(also called homoscedaticity)

21.3.1 Linearity
It states that the relationship between variables is linear
21.3.2 Independence of Errors
The errors variables are independent from one another.
21.3.3 Normality of Errors
The error variable

e is normally distributed at each value of x.

21.3.4 Equal Variance(Homoscedasticity)


The variance of the error variable is a constant(condition of homoscedasticity) for all
values of X.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 3

Example 21.1
A consultant was employed to study the relationship between annual sales and annual
advertising expenditure of business firms in order to build a model to predict annual
sales based on annual advertising expenditures.
A simple regression analysis of the relationship between the annual sales($ million) and
annual advertising expenditure($ thousand) of a random sample of 30 firms is shown
below.
Raw Data:
NO

Annual Advertising Expenditures($000)

Annual Sales($million)

22

4.5

28

5.1

31

5.3

31

5.4

35

5.9

43

43

6.5

48

6.6

43

6.6

10

49

6.8

11

56

6.9

12

52

13

57

14

58

7.5

15

61

8.7

16

60

8.9

17

62

9.2

18

66

9.5

19

64

9.6

20

69

10

21

67

10.2

22

72

10.4

23

75

10.5

24

78

10.8

25

77

11

26

82

11.2

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 4

27

81

11.5

28

83

11.7

29

85

12

30

89

12.4

a.

Determine the Least square Regression equation to predict annual sales based
on annual advertising expenditures.

b.

Calculate the standard error of estimate.

c.

Calculate the coefficient of determination

In order to calculate values of a,b and c, need to work out the tables of values below:
X
22
28
31
31
35
43
43
48
43
49
56
52
57
58
61
60
62
66
64
69
67
72
75
78
77
82
81
83

y
4.5
5.1
5.3
5.4
5.9
6
6.5
6.6
6.6
6.8
6.9
7
7
7.5
8.7
8.9
9.2
9.5
9.6
10
10.2
10.4
10.5
10.8
11
11.2
11.5
11.7

x2

y2

484
784
961
961
1225
1849
1849
2304
1849
2401
3136
2704
3249
3364
3721
3600
3844
4356
4096
4761
4489
5184
5625
6084
5929
6724
6561
6889

20.25
26.01
28.09
29.16
34.81
36.00
42.25
43.56
43.56
46.24
47.61
49.00
49.00
56.25
75.69
79.21
84.64
90.25
92.16
100.00
104.04
108.16
110.25
116.64
121.00
125.44
132.25
136.89

Xy
99.00
142.80
164.30
167.40
206.50
258.00
279.50
316.80
283.80
333.20
386.40
364.00
399.00
435.00
530.70
534.00
570.40
627.00
614.40
690.00
683.40
748.80
787.50
842.40
847.00
918.40
931.50
971.10

QBM101 Module 4 - CHAPTER 21(19 Pages)


85
89
1767

12
12.4
254.7

7225
7921
114129

144.00
153.76
2326.17

Page 5

1020.00
1103.60
16255.90

x = 1767 y = 254.7 xy =16255.90


2
x = 114129
2
y = 2326.17

From the above, the following are calculated

( x) 2
1767 2
SS x = x = 114129 = 10052.70
n
30
2

( y ) 2
254.7 2
SS y = y = 2326.17 = 163.767
n
30
2

SS xy = xy -

(1767)(254.7)
x y
= 16255.9 = 1254.07
n
30
^

a.

The required regression equation is


^

Where

b1 =
^

And

SS xy
SS x
^

y = b o + b1 x

1254.07
= 0.124749569 0.12475
10052.7

b o = y - b1 x =

254.7
1767
- 0.124749569
30
30

= 9.49 7.347749614 = 2.142250386


^

Therefore Least square regression equation is

y = 2.1422504 + 0.12475x

Where x is annual advertising expenditure in $000


And y is annual sales in $million.
b.

Standard error of estimate is the measurement of variation of the actual values of


y about the regression line and is written as

Se =

SSE
n-2

QBM101 Module 4 - CHAPTER 21(19 Pages)

Where SSE =

SS y -

2
SS xy

SS x

Page 6

1254.07 2
= 163.767 10052.7

= 163.767 -156.445 = 7.322


Therefore

c.

SS e =

SSE
7.322
=
= 0.5114 $million
n-2
30 - 2

Coefficient of determination (R ) = 1 -

2
SS xy

SS x SS y

1254.07 2
1572691.565
1=
= 0.9553
(10052.70)(163.767) 1646300.521

Unlike the standard error of estimate which is measured by the units of


measurement of y, the coefficient of determination do not have any units of
measurement and it measures the proportion of variation of y which have been
explained by the amount of variation of x.
The values of the coefficient ranges from 0 to 1.
21.4 Interpretation of computer output
You will notice that the above calculations are very tedious. The EXCEL data analysis
program can be used to generate the above outputs in just a split second. Therefore
students are encouraged to learn to use the Data Analysis program although students will
not be tested on using the program.
The emphasis is to test students on the interpretation of the outputs generated by the
EXCEL program
To illustrate the above, the following outputs are generated by the EXCEL program.
1.

Scatter Plot

2.

Summary Output

3.

ANOVA Table

4.

Residual Plot

5.

Histogram of the residual

6.

Residual output

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 7

Example 21.2
A consultant was employed to study the relationship between annual sales and annual
advertising expenditure of business firms in order to build a model to predict annual sales
based on annual advertising expenditure. A simple linear regression analysis of the
relationship between the sales ($ millions) and advertisement expenditure ($ thousands)
of a random sample of 30 firms was performed using EXCEL. The summary output and
charts for this analysis follow.

Annual Sales($millions)

Scatter plot showing Annual Sales($'million) vs sdvertising


expenditures($'000)
14
12
10
8
6
4
2
0
20

30

40

50

60

70

80

90

100

Annual Advertising expenditures($'000)

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.977388491
R Square
0.955288263
Adjusted R Square
0.953691415
Standard Error
0.511381429
Observations
30
ANOVA
Df
Regression
Residual
Total

Intercept
Annual advertising
expenditure($'000)

MS
156.4447
0.261511

F
598.2338

Significance
F
1.94824E-20

1
28
29

SS
156.444693
7.322307042
163.767

Coefficients
1.142250341

Standard
Error
0.314587143

t Stat
3.63095

P-value
0.00112

Lower 95%
0.497847798

Upper 95%
1.786652883

0.12474957

0.005100392

24.45882

1.95E-20

0.11430189

0.13519725

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 8

Annual advertising expenditure($'000)


Residual Plot

Residuals

1
0.5
0
-0.5

20

40

60

80

100

-1
-1.5

Frequency

Annual advertising expenditure($'000)

Histogram of residuals

15
10
5
0

-1.15 -0.85 -0.55 -0.25 0.05


Residuals

0.35

0.65

RESIDUAL OUTPUT
a

Interpret the scatter plot identifying the independent and dependent variables

In the scatter plot, annual advertising expenditures in $000 is the independent variable
and annual sales in $million is the dependent variable.
There is a positive linear relationship between advertising expenditures and sales
meaning that when advertising expenditures increased, sales is expected to increase and
vice versa.
b

Write down the regression equation and interpret the slope coefficient
^

The regression equation is y = 1.14225 + 0.12475x


Where x is the annual advertising expenditures in $000
And y is the annual sales in $million
1.14225 is the y-intercept and
0.12475 is the slope coefficient which means that when annual advertising expenditure

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 9

increase by $1,000, annual sales is expected to increase by $0.12475($million)


c.
Estimate the annual sales with the following annual advertising expenditure and comment
on the reliability of the estimates
i
ii.

annual advertising expenditure of $60,000


annual advertising expenditure of $100,000
^

i.

y = 1.14225 + 0.12475(60) = 8.62725($' million )


The estimate is quite reliable because $60,000 is within the range of the data
^

ii.

y = 1.14225 + 0.12475(100) = 13.61725($' million)


The estimate is not reliable because $100,000 is outside the range of the data.

d.

What is the value of the coefficient of determination and interpret its meaning
The coefficient of determination is the R square(if not given) can be calculated by
either (Multiple R)2 i.e 0.9773884912 = 0.955288262
or

R2 =

SS Re gression 156.444693
=
= .955288262 0.9553
SSTotal
163.767

It means that approximately 95.53% of the variation in annual sales have been
explained variation in annual advertising expenditures. There are still 4.47% of
variation in annual sales that have not been explained by variation in annual advertising
expenditure. Therefore the above regression model to predict sales when given
advertising expenditure is useful.
e.

What is the value of the standard error of estimate and interpret its meaning?
The standard error of estimate, S e is 0.511381429 $million.
It measure the fluctuations of the actual value of y about the regression line.

f.

What is the value of the coefficient of correlation and interpret its meaning:
The coefficient of correlation ranges from -1 to +1. In the above question the coefficient
of correlation is 0.977388491
It means that there is a high degree of positive linear correlation between annual
advertising expenditures and annual sales.
Note; The Multiple R is not the correlation coefficient. We need to decide whether
correlation is positive or negative because the Multiple R is always given as a positive
figure. In the above example, since the slope coefficient is positive, correlation
coefficient is also positive.

QBM101 Module 4 - CHAPTER 21(19 Pages)


g.

Page 10

Test whether there is any significant linear relationship between annual advertising
expenditures and annual sales at the 5% level of significance?

H o : b 1 = 0 There is no sig linear relationship between adv. exp and sales


H 1 : b 1 0 There is sig linear relationship between adv exp and sales
^

Test statistic :

t=

b1 - b1
S^

b1

p-value = 1.95E -20 = 1.95 (10-20 )

Since p-value(0) < 0.05, reject H0

h.

There is significant linear relationship between annual advertising expenditure and


annual sales at the 5% significant level
Set up the 95% confidence interval estimate of the slope coefficient between annual
advertising expenditure and annual sales.
^

CI ( b1 ) = b1 t 0.05

, 28

(S ^ )
b1

= 0.12475
2.048(0.0051)
= 0.12475
0.01044 = 0.114302 , 0.135197
95% CI for population slope is 0.114302 < b1 < 0.13519
i.

Which diagrams can be used to check the assumption of normality of the error
Variable and constant variance of the error variable?
To determine whether the error variable is normally distributed, we have to examine the
histogram of the error variable.
The histogram given indicate that it is not bell shape and therefore the assumption of
normality of the error variable has been violated.
To evaluate the condition of constant variance of the error variable, we have to examine
the residual plot. The residual plot given indicate that with increasing value of x, the
residual follows a pattern of increasing and decreasing values.
Therefore the assumption of constant variance of the error variable(homoscedascity) has
been violated. Or there is condition of heteroscedasticity.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 11

Example 21.3
A study was conducted to study the relationship between the marks scored in the statistics final
examination and the marks scored in the accounting final examination. Data were collected from a
random sample of 20 students with the following results.
Final examination marks scored in statistics and accounting
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Statistics
10
15
18
24
35
38
45
48
50
55
65
68
71
76
82
85
88
89
92
94

Accounting
13
12
22
25
30
36
48
44
54
50
62
66
69
74
85
87
86
92
90
96

a.i.

Calculate the regression coefficients and hence write down the regression equation to predict
accounting marks.

ii.

Calculate the standard error of estimate indicating the units of measurement.

iii.

Calculate the coefficient of correlation.

b.

MS EXCEL was used to generate the following linear regression outputs and appropriate charts.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 12

SUMMARY OUTPUT
Regression Statistics
Multiple R
A
R Square
0.9874
Adjusted R Square
0.9868
Standard Error
3.1845
Observations
20
ANOVA
Regression
Residual
Total

Intercept
Statistics

Df
1
B
19

SS
14404.41397
182.53603
14586.95

Coefficients
-0.2935
0.9990

MS
14404.41397
10.14089

Standard Error
1.6799
0.0265

RESIDUAL OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Predicted Accounting
9.69664
14.69172
17.68876
23.68286
34.67204
D
44.66220
47.65925
49.65728
54.65236
64.64252
67.63957
70.63662
75.63170
81.62580
84.62285
87.61989
88.61891
91.61596
93.61399

Residuals
3.30336
-2.69172
4.31124
1.31714
-4.67204
-1.66909
3.33780
-3.65925
4.34272
E
-2.64252
-1.63957
-1.63662
-1.63170
3.37420
2.37715
-1.61989
3.38109
-1.61596
2.38601

F
1420.429

t Stat
-0.1747
C

Significance F
1.40343E-18

P-value
0.8632
1.40343E-18

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 13

Scatter Diagram Of Marks Of Accounting And Statistics

Accounting Marks

120
100
80
60
40
20
0
0

20

40

60

80

100

Statistics Marks

i. Interpret the scatter diagram, indicating the dependent and the independent variable.
Statistics mark is the independent variable and Accounting mark is the dependent variable.
There is a positive linear relationship between statistics and accounting meaning that when
statistics mark increases, accounting mark will also increase.
ii. Use the output provided to write down the regression equation?
^

y = -0.2935 + 0.999 x
where x is the statistics marks and y is the accounting marks

iii. Interpret the slope coefficient.


0.999 is the slope coefficient which means that when statistics mark increase by 1, accounting
mark will increase by 0.999(approximately 1).
iv. Find the missing values of A, B, C, D and E in the given computer outputs.
A = 0.9874 = 0.9937
B = 19 1 = 18
C=

0.999
= 37.70
0.0265

D=

y = -0.2935 + 0.999(38) = 37.6685

OR

y = y - R = 36 - ( -1.66909) = 36 + 1.66090 = 37.66909

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 14

E=R=

y - y = 50 - 54.65236 = -4.6523

v. What is the value of the coefficient of determination? Interpret its meaning?


Coefficient of determination, R square = 0.9874
It means that approximately 98.74% of the variation in accounting marks have been explained by
variation in statistics marks.
Therefore 1.26% of the variation in accounting mark have not been explained by variation in
statistics mark.
vi. Can we check the assumptions of constant variance of the error variable and normality of the
distribution of the error variable based on the outputs given above and if so, check whether the
assumption(s) is(are) satisfied?
The assumption of normality of the error variable can be checked by looking at the histogram of
the error variable. Since this question did not provide the histogram the assumption of normality
of the error variable cannot be checked.
The other assumption of constant variance of the error variable can be checked by looking at the
residual plot. In the residual plot as statistics mark increase, there is no pattern on the movement
of the residuals. Therefore the assumption of constant variance of the error variable has not been
violated.
vii. At the 1% level of significance, is there evidence of a linear relationship between the final
examination marks of statistics and accounting?
H o : b1 = 0 , There is no sig relationship between statics and accounting marks

H1 : b1 0 , There is sig relationship between statistics and accounting marks


^

Test statistics ,

t=

b1 - b1
S^

b1

p-value (1.40343 E-18) =1.40343(10-18) = 0 < 0.01, reject Ho


Therefore there is sig linear relationship between statistics and accounting marks.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 15

Exercise 1
In a small fishing town the daily catches were sold locally. Recently, the fishermen have
complained about price fluctuations and reduced catches and hence requested the
government to introduce a minimum fish price. It was suspected that fluctuations in fish
prices were related to fish catches. A statistician was asked to study the relationship
between daily prices and daily catches in the fishing town. A random sample of 30 weeks
were selected and the prices of fish in ($) and the daily catches in kilograms were
recorded.
The prices range from a low of $3.00 to a high of $17.50 per kg.
The daily catches range from a low of 300 kg. to a high of 1,000 kg.
The sample data were analyzed using EXCEL, and the summary output and appropriate
charts were generated and provided below. However because of the printer malfunction,
some of the data values are missing and they are indicated as A, B, C and D.
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

Regression
Residual
Total

Prices($)

Intercept
Average Daily
Catch(kg.)

A
0.9646
0.9634
0.8426
30
df

SS

MS

Significance
F

541.9942
0.7100

7.32626E-22

Coefficients
24.5698

541.9942
19.8808
561.875
Standard
Error
0.5406

t Stat
45.4453

P-value
8.8279E-28

-0.0222

0.0008

7.326E-22

28
29

Scatterplot showing relationship between average


daily catch (kg.) and price ($)
20.00
15.00
10.00
5.00
0.00
0
200
400
600
800
1000
Average Daily Catch (kg.)

1200

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 16

Average daily catch residual Plot

Residuals

2
0
0

500

1000

1500

-2
Average Daily Catch (kg.)

Frequency

Histogram of residuals
8
6
4
2
0
-1.9

-1.5

-1.1

-0.7

-0.3

0.1

0.5

0.9

1.3

Residuals

Use the EXCEL output provided to answer the following questions.

(a)

Find the missing values of A, B, C, and D in the given computer output.

(b)

Interpret the scatter plot and identify the dependent and the independent variables.
Independent variable is average daily catch and price is the dependent cariable.
The scatter plot shows that there is a negative linear relationship between daily
catch and price. It means that when daily catch increase, price will decrease.

(c)

Write down the regression equation?


Where x is daily catch and y is the price

(d)
(e)

Interpret the slope coefficient.


What is the value of the coefficient of determination? Interpret its meaning?
Coefficient of determination is 0.9646.

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 17

It means that 96.46% of variation in price has been explained by variation in daily
catch. Therefore there are still 3.54% of unexplained variation.
(f)

Predict the price for a given day with a daily catch of 850 kg. Is your estimate
reliable?

(g)

Is there any linear relationship between daily catch and price at the 5%
significance level?

(h)

Which graph is used to check the assumption of constant variance of the error
variable? Is there any evidence that this assumption has been violated?

(i)

Which graph is used to check the assumption that the error variable must be
normally distributed? Comment on whether this assumption has been violated.

Exercise 2
A real estate company in a city would like to establish a model to predict the monthly
rent (RM) based on the size of the apartments measured in square feet(sq. ft.) in a
selected city.
A random sample of 15 apartments in the selected city was selected and the information
relating to monthly rent in RM and size in square feet were recorded.
MS EXCEL was used to produce the following charts and diagrams with some missing
figures labeled a to e.
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Monthly Rent(RM)
1200
1700
1200
1500
850
1700
1500
900
650
1150
1400
1500
2200
1800
1400

Size(square feet)
850
1450
1085
1232
718
1485
1136
500
300
956
1100
1285
1985
1800
1400

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 18

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9656
R Square
a
Adjusted R Square
0.9271
Standard Error
108.4298
Observations
15
ANOVA
Df
Regression
Residual
Total

Intercept
Size(square. feet)

SS
1 2106492
b 152841.2
14 2259333

Coefficients
390.56
0.8559

Standard
Error
78.81
0.0639

Predicted
Monthly
Rent(RM)
1118.071
1631.61
1319.207
1445.024
1005.093
1661.567
1362.858
818.5066
647.3269
e
1332.046
1490.387
2089.516
1931.175
1588.815

Residuals
81.92885
68.38965
-119.207
54.97556
-155.093
d
137.1418
81.49339
2.67312
-58.7964
67.95418
9.61293
110.4839
-131.175
-188.815

RESIDUAL OUTPUT

Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

MS
2106492
11757.01

F
179.169

Significance F
5.57758E-09

t Stat
4.96
C

P-value
0.0003
5.58E-09

Lower 95%
220.2969
0.7178

Upper 95%
560.8177
0.9940

QBM101 Module 4 - CHAPTER 21(19 Pages)

Page 19

i. Write down the regression equation and interpret the slope coefficient.
ii. What is the value of R square(4 decimal places) marked a in the regression
statistics and explain what it means.
iii. What are the values of the other missing values marked b to e ( 2 decimal places)?
iv. What is the value of the coefficient of correlation and explain what does it
measures?
v. At the 5% level of significance, is there any significant linear relationship between
size of the apartments and rent?
vi. Which chart or diagram can be used to check whether the assumption of
homoscedasticity has been violated and what is your conclusion?
vii. Estimate the monthly rental for the apartments with size of
a)2,000 sq ft and b)2,500 sq ft. and comment on the reliability of your
estimates.

You might also like