Professional Documents
Culture Documents
Microsoft Excel
1. Before you start
You can compute summary statistics for data sets by using appropriate functions,
such as:
Average(B4..B23), which yields the average heating cost for all the houses;
Minimum(B4..B23), which yields the minimum heating cost for all the houses;
Maximum(B4..B23), which yields the maximum heating cost for all the
houses;
Stdev(B4..B23), which yields the sample standard deviation in heating costs;
etc..
A B C D E F
1 HEATING COST DATA
2
3 House Heating Cost Minimum Temperature Insulation (inches) Age Windows
4 1 250 35 3 6 10
5 2 360 29 4 10 1
6 3 165 36 7 3 9
7 4 43 60 6 9 8
8 5 92 65 5 6 8
9 6 200 30 5 5 9
10 7 355 10 6 7 14
11 8 290 7 10 10 9
12 9 230 21 9 11 11
13 10 120 55 2 5 9
14 11 73 54 12 4 11
15 12 205 48 5 1 10
16 13 400 20 5 15 12
17 14 320 39 4 7 10
18 15 72 60 8 6 8
19 16 272 20 5 8 10
20 17 94 58 7 3 10
21 18 190 40 8 11 11
22 19 235 27 9 8 14
23 20 139 30 7 5 9
If you enter for instance Average(B4..B23) in cell B24, the average heating cost
will be displayed in that cell. By selecting the cell and dragging the handle to the
adjacent cells, the formula will automatically be copied, and will display the
average temperature, insulation, age and number of windows.
The resulting spreadsheet is shown below (you may need to reformat the cells
and column widths).
2
Heating Cost Minimum Temperature Insulation (inches) Age Windows
3
3. Correlation Analysis
You can compute correlation statistics for a data set by using the following
function: Correl(B4..B23;C4..C23), which yields the correlation between the
heating cost and the minimum outside temperature. A correlation coefficient
indicates the level of linear association between a pair of variables. In this case,
the correlation between the heating cost and the minimum outside temperature
is 0.81, implying a rather strong negative correlation in the sense that if the
outside temperature is low, the heating cost is high and vice-versa.
Notice the high correlation between heating cost and the minimum temperature
(negative) and the age of the heating installation (positive). Also notice the
sometimes high correlations between the explanatory variables themselves.
4
3. Scatter Plots
Scatter plots are of great help in identifying the strength, nature and direction of
relationships between variables. In particular, they can highlight non-linear
relationships, which will not necessarily be apparent from the correlation values.
Since the observed correlation, -0.81, between the heating cost and the
minimum outside temperature suggests a strong (linear) relationship, let us
examine their scatter plot:
Select the data range B3:C23 (using the mouse);
Select Insert\Scatter Plot and then the first available type;
Specify the Chart title as Cost & Temperature, Value (X) Axis as
Temperature, and Value (Y) Axis as Cost;
70
60
50
Cost
40
30
20
10
0
0 100 200 300 400 500
Temperature
The scatter plot confirms the rather strong, linear relationship between heating
cost and temperature, with heating cost declining as the temperature increases.
Similar scatter plots can be examine for other pairs of variables.
5
4. Simple Linear Regression Analysis
A regression analysis estimates the linear equation that best fits a set of data,
in the sense that it minimises the residual scatter. Let us perform a regression
analysis of heating cost as a function of the temperature, i.e.
6
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.81
R Square 0.66
Adjusted R Square 0.64
Standard Error 63.55
Observations 20
ANOVA
df SS MS F Significance F
Regression 1 140215 140215 34.72 0.00
Residual 18 72701 4039
Total 19 212916
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0
Intercept 388.80 34.24 11.35 0.00 316.86 460.74 316.
Minimum Temperature -4.93 0.84 -5.89 0.00 -6.69 -3.17 -6.
RESIDUAL OUTPUT
The slope, -4.93, has a t-value of -5.89 (in absolute terms bigger than 2) and a
very small p-value (smaller than our confidence level of 5%). The coefficient
related to the temperature variable is therefore significantly different from zero,
which can also be seen from the confidence interval [-6.69; -3.17] which does not
include zero. We may conclude that there is a significant effect of temperature
on heating cost.
The regression model is able to explain 64% of the variability in heating cost in
terms of differences between outside temperature (Adjusted R2). The standard
error of the forecasts is 63.55, implying that if we want to make a prediction with
confidence (95%), we should subtract and add 127.10 (2*63.55) to the prediction
to obtain a confidence interval. For instance, for an outside temperature of 50,
we predict the heating costs to be in the region of [142.30-127.10; 142.30-
127.10] = [15.20; 269.40].
7
The Regression tool also displays several charts (you may have to move them to
make them visible):
The Line Fit Plot (see below) shows actual costs and predicted costs, plotted
for different values of temperature. This plot is identical to the scatter plot of
cost and temperature we constructed earlier, with the predicted points
superimposed. The regression line is shown as points rather than as a line.
This can be changed by double-clicking the estimated points, and selecting
Patterns\Line\Automatic.
The Residual Plot shows the forecast errors versus temperature. If this plot
exhibits an obvious pattern, it would suggest that the model is ill-specified.
Ideally, the residuals should be random. Residual plots are also useful for
spotting outliers data points much further from the regression line than
others.
500
400
Heating Cost
300
200
100
0
0 20 40 60 80
Minimum Temperature
150.00
100.00
Residuals
50.00
0.00
0 20 40 60 80
-50.00
-100.00
-150.00
Minimum Temperature
8
5. Multiple Linear Regression Analysis
The Adjusted R2 has increased from 64% to 75%, indicating that we are now able
to explain more of the variability in heating cost. Also the standard error of the
predictions has drastically decreased from 127.10 to 52.72, enabling much more
accurate predictions. However, although the coefficient related to the
temperature and the insulation are found to be significantly different from zero,
the coefficients related to the age of the installation and the number of windows
have not. Therefore, these variables should be excluded from the model and the
model re-analysed. Also the residual plots and line fit plots need to be examined.
9
SUMMARY OUTPUT
Minim
Regression Statistics
Multiple R 0.90 100.00
Residuals
R Square 0.80
Adjusted R Square 0.75 0.00
Standard Error 52.72
-100.00
Observations 20
ANOVA
df SS MS F Significance F
Regression 4 171227 42807 15.40 0.00
Residual 15 41689 2779
Total 19 212916
RESIDUAL OUTPUT
10
SUMMARY OUTPUT
Mini
Regression Statistics
Multiple R 0.88 200.00
Residuals
R Square 0.78
Adjusted R Square 0.75 0.00
Standard Error 52.98
-200.00
Observations 20
ANOVA
df SS MS F Significance F
Regression 2 165195 82597 29.42 0.00
Residual 17 47721 2807
Total 19 212916
RESIDUAL OUTPUT
11
6. Non-Linear Regression Analysis
In some case a linear model is not suitable for modelling the relationship
between two variables. Let us have a look at another example: General
Public Electric (GPE). GPE operates 11 thermal power stations of basically
the same design. We will investigate the relationship between the cost
efficiency (pence per Kilowatt-hour) of the electricity generating plants, as a
function of their generating capacity (Megawatts installed). The object of
the exercise is to model the economy of scale effect which allows larger
plants to generate electricity at lower marginal cost per unit. In practice,
this analysis might be part of a larger exercise in which economy of scale
would be one of a variety of factors which would be taken into account in
deciding between alternative development plans. A more accurate
understanding of the relative efficiency of different size plants would make
it easier to balance this factor against capital investment costs,
environmental factors, construction time, etc. The data can be found in the
Excel workbook GPE.xls that can be found on
the course Web site.
The spreadsheet, scatter plot and regression results are shown below.
A B C
1 General Public Electric
2
3 Plant Capacity Cost per Unit
Cost & Temperature
4 1 525 1.2
1.80
5 2 555 1.70
6 3 600 1
1.60
7 4 610 1.58
8 5 700 0.8
1.35
9 6 990 1.20
10 7 1100 1.13
Cost
11 8 1450 0.95
0.6
12 9 1950 0.85
13 10 1950 0.4
0.84
14 11 2400 0.75
0.2
0
0 1000 2000 3000
Temperature
12
Cost per Unit
2.00
1.50
1.00
0.50
0.00
0 500 1000 1500 2000 2500 3000
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.94 0.5
Residuals
R Square 0.88
Adjusted R Square 0.86 0.0
Standard Error 0.14
-0.5
Observations 11
ANOVA
df SS MS F Significance F
Regression 1 1.26 1.26 64.05 0.00
Residual 9 0.18 0.02
Total 10 1.43
RESIDUAL OUTPUT
13
The model, namely
seems reasonable because of the high t-statistic related to the Capacity variable
and the high Adjusted R2 (86%), implying that there is a significant
relationship between capacity and cost, and that we are able to explain a
lot of the variability in costs, purely by examining the capacity. Also, the
result is logical in the sense that we indeed observe an economies of scale
effect: cost decreases as capacity increases, for every Megawatt of
generating capacity, the unit cost decreases at a rate of 0.00053.
However, if we examine the line fit plot and the residual plot (see below), we
observe the following:
the line does not perfectly fit the data, it slightly underestimates the cost for
small capacity values, overestimates it for medium capacity values and again
underestimates it for high capacity values;
the residual plot clearly exhibits a pattern, the errors are positive for small
capacity values, negative for medium capacity values and again positive for
high capacity values.
2.00
Cost per Unit
1.50
1.00
0.50
0.00
0 500 1000 1500 2000 2500 3000
Capacity
14
Capacity Residual Plot
0.30
0.20
Residuals
0.10
0.00
0 500 1000 1500 2000 2500 3000
-0.10
-0.20
Capacity
This indicates that the model is ill-specified, and more specifically that we have
been trying to fit a line to data which exhibits a non-linear relationship. We
therefore should look for a more suitable specification of the model.
Let us try to regress costs to the reciprocal of the plant capacity, i.e.
Cost = a + b(1/Capacity) + e
In order to transform the capacity data in our model, we add another column. In
cell D3, we enter the title 1/Capacity. In cell D4, we enter the formula =1/B4.
The value 0.0019 should appear (= 1/525). We select cell D4 and drag the
handle down to fill the entire column with the transformed capacity data. Now,
we can run another regression analysis, using the new column as the
explanatory variable.
Below, you will also find a plot of the estimated costs as a function of Capacity,
revealing the non-linear nature of the estimated relationship. In order to draw
such a graph, add another column in your spreadsheet with the cost predictions,
computed using the regression coefficients and the explanatory variables data.
Then draw a scatter plot of the predictions as well as the actual heating cost data
versus capacity. Again, the predictions will be displayed as points rather than a
line, but this can be changed by double-clicking the points and selecting a
different format.
15
SUMMARY OUTPUT
Regression Statistics
Multiple R 1.00 0
Residuals
R Square 0.99 0
Adjusted R Square 0.99 -0
Standard Error 0.04
-0
Observations 11
ANOVA
df SS MS F Significance F
Regression 1 1.418077648 1.418077648 957.9913963 1.87988E-10
Residual 9 0.013322352 0.001480261
Total 10 1.4314
RESIDUAL OUTPUT
2.00
1.50
Cost per Unit
1.00
0.50
0.00
0.0000 0.0005 0.0010 0.0015 0.0020
1/Capacity
16
1/Capacity Residual Plot
0.06
0.03
Residuals
0.00
0.0000
-0.03 0.0005 0.0010 0.0015 0.0020
-0.06
-0.09
-0.12
1/Capacity
AB C D E F GH I J K
G
1
2
3
4
5
6
e
n
P
l
ae
n
tr
a
l
C
a
15
25
36
P
u
p
a
c
i
2
0
b
t
y
5l
i
cE
C
ol
e
c
s
t
pt
e
rr
i
c
U
n
i
1
.
8
0
7
1
.
6
0
t1
/
Ca
p
0
0
a
c
.
0
.
0
i
t
1
1
y
9
8
7
P
r
e
C
o
1d
i
c
t
1
1
i
o
st&
T
.
7
1.2
.
6
n
em
p
eratu
6
9
0
re
7
8
9 46
57
691
0
9 5
8
1
.
3
2
0 0.
016
4
0 15
.
4
0.8
18
7
10 711
0 1
.
1
3 0.
0 9 1.0
o
C
st
0.6
12
3 81
91
1
0 4
5
9
1
50 0
9
5
.
8
0
4 0.
00
07
5 0
09
.
8
0.4
5
4
1
14
5
6 2
4
0 .
7
5 0.
0 4 .
7
0.2
7
00100200300
18
Tem
p
eratu
re
29
0
1
2
23
4
25
6
27
8
9
17