Professional Documents
Culture Documents
Page 1
Some predictions are more accurate than others due to the strength of the
relationship. That is, the stronger the relationship is between variables, the more
accurate the prediction is. Eg Prediction of temperature in degree F based on
degree C using equation F=32 +
9
C is 100% accurate because this is an area of
5
pure science.
Regression analysis provide a Best Fit mathematical equation for the values of
the two variables using method of least square
21.1
The Independent Variable(X) provide the basis for estimation. It is the also
referred to as the predictor variable.
21.2
Y is the average predicted value of Y for any given X. It is called the dependent
or response variable. It is refers to as the average predicted or Estimated value of
Y for any given value of X.
Page 2
y = b o + b1 x
21.3
where
b1 =
SS xy
SS x
and
b o = y - b1 x
Linearity
Independence of Errors
Normality of Errors
21.3.1 Linearity
It states that the relationship between variables is linear
21.3.2 Independence of Errors
The errors variables are independent from one another.
21.3.3 Normality of Errors
The error variable
Page 3
Example 21.1
A consultant was employed to study the relationship between annual sales and annual
advertising expenditure of business firms in order to build a model to predict annual
sales based on annual advertising expenditures.
A simple regression analysis of the relationship between the annual sales($ million) and
annual advertising expenditure($ thousand) of a random sample of 30 firms is shown
below.
Raw Data:
NO
Annual Sales($million)
22
4.5
28
5.1
31
5.3
31
5.4
35
5.9
43
43
6.5
48
6.6
43
6.6
10
49
6.8
11
56
6.9
12
52
13
57
14
58
7.5
15
61
8.7
16
60
8.9
17
62
9.2
18
66
9.5
19
64
9.6
20
69
10
21
67
10.2
22
72
10.4
23
75
10.5
24
78
10.8
25
77
11
26
82
11.2
Page 4
27
81
11.5
28
83
11.7
29
85
12
30
89
12.4
a.
Determine the Least square Regression equation to predict annual sales based
on annual advertising expenditures.
b.
c.
In order to calculate values of a,b and c, need to work out the tables of values below:
X
22
28
31
31
35
43
43
48
43
49
56
52
57
58
61
60
62
66
64
69
67
72
75
78
77
82
81
83
y
4.5
5.1
5.3
5.4
5.9
6
6.5
6.6
6.6
6.8
6.9
7
7
7.5
8.7
8.9
9.2
9.5
9.6
10
10.2
10.4
10.5
10.8
11
11.2
11.5
11.7
x2
y2
484
784
961
961
1225
1849
1849
2304
1849
2401
3136
2704
3249
3364
3721
3600
3844
4356
4096
4761
4489
5184
5625
6084
5929
6724
6561
6889
20.25
26.01
28.09
29.16
34.81
36.00
42.25
43.56
43.56
46.24
47.61
49.00
49.00
56.25
75.69
79.21
84.64
90.25
92.16
100.00
104.04
108.16
110.25
116.64
121.00
125.44
132.25
136.89
Xy
99.00
142.80
164.30
167.40
206.50
258.00
279.50
316.80
283.80
333.20
386.40
364.00
399.00
435.00
530.70
534.00
570.40
627.00
614.40
690.00
683.40
748.80
787.50
842.40
847.00
918.40
931.50
971.10
12
12.4
254.7
7225
7921
114129
144.00
153.76
2326.17
Page 5
1020.00
1103.60
16255.90
( x) 2
1767 2
SS x = x = 114129 = 10052.70
n
30
2
( y ) 2
254.7 2
SS y = y = 2326.17 = 163.767
n
30
2
SS xy = xy -
(1767)(254.7)
x y
= 16255.9 = 1254.07
n
30
^
a.
Where
b1 =
^
And
SS xy
SS x
^
y = b o + b1 x
1254.07
= 0.124749569 0.12475
10052.7
b o = y - b1 x =
254.7
1767
- 0.124749569
30
30
y = 2.1422504 + 0.12475x
Se =
SSE
n-2
Where SSE =
SS y -
2
SS xy
SS x
Page 6
1254.07 2
= 163.767 10052.7
c.
SS e =
SSE
7.322
=
= 0.5114 $million
n-2
30 - 2
Coefficient of determination (R ) = 1 -
2
SS xy
SS x SS y
1254.07 2
1572691.565
1=
= 0.9553
(10052.70)(163.767) 1646300.521
Scatter Plot
2.
Summary Output
3.
ANOVA Table
4.
Residual Plot
5.
6.
Residual output
Page 7
Example 21.2
A consultant was employed to study the relationship between annual sales and annual
advertising expenditure of business firms in order to build a model to predict annual sales
based on annual advertising expenditure. A simple linear regression analysis of the
relationship between the sales ($ millions) and advertisement expenditure ($ thousands)
of a random sample of 30 firms was performed using EXCEL. The summary output and
charts for this analysis follow.
Annual Sales($millions)
30
40
50
60
70
80
90
100
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.977388491
R Square
0.955288263
Adjusted R Square
0.953691415
Standard Error
0.511381429
Observations
30
ANOVA
Df
Regression
Residual
Total
Intercept
Annual advertising
expenditure($'000)
MS
156.4447
0.261511
F
598.2338
Significance
F
1.94824E-20
1
28
29
SS
156.444693
7.322307042
163.767
Coefficients
1.142250341
Standard
Error
0.314587143
t Stat
3.63095
P-value
0.00112
Lower 95%
0.497847798
Upper 95%
1.786652883
0.12474957
0.005100392
24.45882
1.95E-20
0.11430189
0.13519725
Page 8
Residuals
1
0.5
0
-0.5
20
40
60
80
100
-1
-1.5
Frequency
Histogram of residuals
15
10
5
0
0.35
0.65
RESIDUAL OUTPUT
a
Interpret the scatter plot identifying the independent and dependent variables
In the scatter plot, annual advertising expenditures in $000 is the independent variable
and annual sales in $million is the dependent variable.
There is a positive linear relationship between advertising expenditures and sales
meaning that when advertising expenditures increased, sales is expected to increase and
vice versa.
b
Write down the regression equation and interpret the slope coefficient
^
Page 9
i.
ii.
d.
What is the value of the coefficient of determination and interpret its meaning
The coefficient of determination is the R square(if not given) can be calculated by
either (Multiple R)2 i.e 0.9773884912 = 0.955288262
or
R2 =
SS Re gression 156.444693
=
= .955288262 0.9553
SSTotal
163.767
It means that approximately 95.53% of the variation in annual sales have been
explained variation in annual advertising expenditures. There are still 4.47% of
variation in annual sales that have not been explained by variation in annual advertising
expenditure. Therefore the above regression model to predict sales when given
advertising expenditure is useful.
e.
What is the value of the standard error of estimate and interpret its meaning?
The standard error of estimate, S e is 0.511381429 $million.
It measure the fluctuations of the actual value of y about the regression line.
f.
What is the value of the coefficient of correlation and interpret its meaning:
The coefficient of correlation ranges from -1 to +1. In the above question the coefficient
of correlation is 0.977388491
It means that there is a high degree of positive linear correlation between annual
advertising expenditures and annual sales.
Note; The Multiple R is not the correlation coefficient. We need to decide whether
correlation is positive or negative because the Multiple R is always given as a positive
figure. In the above example, since the slope coefficient is positive, correlation
coefficient is also positive.
Page 10
Test whether there is any significant linear relationship between annual advertising
expenditures and annual sales at the 5% level of significance?
Test statistic :
t=
b1 - b1
S^
b1
h.
CI ( b1 ) = b1 t 0.05
, 28
(S ^ )
b1
= 0.12475
2.048(0.0051)
= 0.12475
0.01044 = 0.114302 , 0.135197
95% CI for population slope is 0.114302 < b1 < 0.13519
i.
Which diagrams can be used to check the assumption of normality of the error
Variable and constant variance of the error variable?
To determine whether the error variable is normally distributed, we have to examine the
histogram of the error variable.
The histogram given indicate that it is not bell shape and therefore the assumption of
normality of the error variable has been violated.
To evaluate the condition of constant variance of the error variable, we have to examine
the residual plot. The residual plot given indicate that with increasing value of x, the
residual follows a pattern of increasing and decreasing values.
Therefore the assumption of constant variance of the error variable(homoscedascity) has
been violated. Or there is condition of heteroscedasticity.
Page 11
Example 21.3
A study was conducted to study the relationship between the marks scored in the statistics final
examination and the marks scored in the accounting final examination. Data were collected from a
random sample of 20 students with the following results.
Final examination marks scored in statistics and accounting
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Statistics
10
15
18
24
35
38
45
48
50
55
65
68
71
76
82
85
88
89
92
94
Accounting
13
12
22
25
30
36
48
44
54
50
62
66
69
74
85
87
86
92
90
96
a.i.
Calculate the regression coefficients and hence write down the regression equation to predict
accounting marks.
ii.
iii.
b.
MS EXCEL was used to generate the following linear regression outputs and appropriate charts.
Page 12
SUMMARY OUTPUT
Regression Statistics
Multiple R
A
R Square
0.9874
Adjusted R Square
0.9868
Standard Error
3.1845
Observations
20
ANOVA
Regression
Residual
Total
Intercept
Statistics
Df
1
B
19
SS
14404.41397
182.53603
14586.95
Coefficients
-0.2935
0.9990
MS
14404.41397
10.14089
Standard Error
1.6799
0.0265
RESIDUAL OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Predicted Accounting
9.69664
14.69172
17.68876
23.68286
34.67204
D
44.66220
47.65925
49.65728
54.65236
64.64252
67.63957
70.63662
75.63170
81.62580
84.62285
87.61989
88.61891
91.61596
93.61399
Residuals
3.30336
-2.69172
4.31124
1.31714
-4.67204
-1.66909
3.33780
-3.65925
4.34272
E
-2.64252
-1.63957
-1.63662
-1.63170
3.37420
2.37715
-1.61989
3.38109
-1.61596
2.38601
F
1420.429
t Stat
-0.1747
C
Significance F
1.40343E-18
P-value
0.8632
1.40343E-18
Page 13
Accounting Marks
120
100
80
60
40
20
0
0
20
40
60
80
100
Statistics Marks
i. Interpret the scatter diagram, indicating the dependent and the independent variable.
Statistics mark is the independent variable and Accounting mark is the dependent variable.
There is a positive linear relationship between statistics and accounting meaning that when
statistics mark increases, accounting mark will also increase.
ii. Use the output provided to write down the regression equation?
^
y = -0.2935 + 0.999 x
where x is the statistics marks and y is the accounting marks
0.999
= 37.70
0.0265
D=
OR
Page 14
E=R=
y - y = 50 - 54.65236 = -4.6523
Test statistics ,
t=
b1 - b1
S^
b1
Page 15
Exercise 1
In a small fishing town the daily catches were sold locally. Recently, the fishermen have
complained about price fluctuations and reduced catches and hence requested the
government to introduce a minimum fish price. It was suspected that fluctuations in fish
prices were related to fish catches. A statistician was asked to study the relationship
between daily prices and daily catches in the fishing town. A random sample of 30 weeks
were selected and the prices of fish in ($) and the daily catches in kilograms were
recorded.
The prices range from a low of $3.00 to a high of $17.50 per kg.
The daily catches range from a low of 300 kg. to a high of 1,000 kg.
The sample data were analyzed using EXCEL, and the summary output and appropriate
charts were generated and provided below. However because of the printer malfunction,
some of the data values are missing and they are indicated as A, B, C and D.
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
Regression
Residual
Total
Prices($)
Intercept
Average Daily
Catch(kg.)
A
0.9646
0.9634
0.8426
30
df
SS
MS
Significance
F
541.9942
0.7100
7.32626E-22
Coefficients
24.5698
541.9942
19.8808
561.875
Standard
Error
0.5406
t Stat
45.4453
P-value
8.8279E-28
-0.0222
0.0008
7.326E-22
28
29
1200
Page 16
Residuals
2
0
0
500
1000
1500
-2
Average Daily Catch (kg.)
Frequency
Histogram of residuals
8
6
4
2
0
-1.9
-1.5
-1.1
-0.7
-0.3
0.1
0.5
0.9
1.3
Residuals
(a)
(b)
Interpret the scatter plot and identify the dependent and the independent variables.
Independent variable is average daily catch and price is the dependent cariable.
The scatter plot shows that there is a negative linear relationship between daily
catch and price. It means that when daily catch increase, price will decrease.
(c)
(d)
(e)
Page 17
It means that 96.46% of variation in price has been explained by variation in daily
catch. Therefore there are still 3.54% of unexplained variation.
(f)
Predict the price for a given day with a daily catch of 850 kg. Is your estimate
reliable?
(g)
Is there any linear relationship between daily catch and price at the 5%
significance level?
(h)
Which graph is used to check the assumption of constant variance of the error
variable? Is there any evidence that this assumption has been violated?
(i)
Which graph is used to check the assumption that the error variable must be
normally distributed? Comment on whether this assumption has been violated.
Exercise 2
A real estate company in a city would like to establish a model to predict the monthly
rent (RM) based on the size of the apartments measured in square feet(sq. ft.) in a
selected city.
A random sample of 15 apartments in the selected city was selected and the information
relating to monthly rent in RM and size in square feet were recorded.
MS EXCEL was used to produce the following charts and diagrams with some missing
figures labeled a to e.
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Monthly Rent(RM)
1200
1700
1200
1500
850
1700
1500
900
650
1150
1400
1500
2200
1800
1400
Size(square feet)
850
1450
1085
1232
718
1485
1136
500
300
956
1100
1285
1985
1800
1400
Page 18
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9656
R Square
a
Adjusted R Square
0.9271
Standard Error
108.4298
Observations
15
ANOVA
Df
Regression
Residual
Total
Intercept
Size(square. feet)
SS
1 2106492
b 152841.2
14 2259333
Coefficients
390.56
0.8559
Standard
Error
78.81
0.0639
Predicted
Monthly
Rent(RM)
1118.071
1631.61
1319.207
1445.024
1005.093
1661.567
1362.858
818.5066
647.3269
e
1332.046
1490.387
2089.516
1931.175
1588.815
Residuals
81.92885
68.38965
-119.207
54.97556
-155.093
d
137.1418
81.49339
2.67312
-58.7964
67.95418
9.61293
110.4839
-131.175
-188.815
RESIDUAL OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
MS
2106492
11757.01
F
179.169
Significance F
5.57758E-09
t Stat
4.96
C
P-value
0.0003
5.58E-09
Lower 95%
220.2969
0.7178
Upper 95%
560.8177
0.9940
Page 19
i. Write down the regression equation and interpret the slope coefficient.
ii. What is the value of R square(4 decimal places) marked a in the regression
statistics and explain what it means.
iii. What are the values of the other missing values marked b to e ( 2 decimal places)?
iv. What is the value of the coefficient of correlation and explain what does it
measures?
v. At the 5% level of significance, is there any significant linear relationship between
size of the apartments and rent?
vi. Which chart or diagram can be used to check whether the assumption of
homoscedasticity has been violated and what is your conclusion?
vii. Estimate the monthly rental for the apartments with size of
a)2,000 sq ft and b)2,500 sq ft. and comment on the reliability of your
estimates.