You are on page 1of 133

Introductory Econometrics

Contents
1 Review of simple regression
1.1 The Sample Regression Function . . . . . . .
1.2 Interpretation of regression as prediction . . .
1.3 Regression in Eviews . . . . . . . . . . . . . .
1.4 Goodness of t . . . . . . . . . . . . . . . . .
1.5 Derivations . . . . . . . . . . . . . . . . . . .
1.5.1 Summation notation . . . . . . . . . .
1.5.2 Derivation of OLS . . . . . . . . . . .
1.5.3 Properties of predictions and residuals
2 Statistical Inference and the Population
2.1 Simple random sample . . . . . . . . . .
2.2 Population distributions and parameters
2.3 Population vs Sample . . . . . . . . . .
2.4 Conditional Expectation . . . . . . . . .
2.5 The Population Regression Function . .
2.6 Statistical Properties of OLS . . . . . .
2.6.1 Properties of Expectations . . . .
2.6.2 Unbiasedness . . . . . . . . . . .
2.6.3 Variance . . . . . . . . . . . . . .
2.6.4 Asymptotic normality . . . . . .
2.7 Summary . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

3
3
6
6
18
19
19
22
23

Regression Function
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

24
24
25
25
25
26
26
27
28
30
31
35

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

37
37
37
37
38
39
40
40
41
44
46
49
51
52
53

.
.
.
.
.
.
.
.

3 Hypothesis Testing and Condence Intervals


3.1 Hypothesis testing . . . . . . . . . . . . . . . .
3.1.1 The null hypothesis . . . . . . . . . . .
3.1.2 The alternative hypothesis . . . . . . . .
3.1.3 The null distribution . . . . . . . . . . .
3.1.4 The alternative distribution . . . . . . .
3.1.5 Decision rules and the signicance level
3.1.6 The t test theory . . . . . . . . . . .
3.1.7 The t test two sided example . . . .
3.1.8 The t test one sided example . . . .
3.1.9 p-values . . . . . . . . . . . . . . . . . .
3.1.10 Testing other null hypotheses . . . . . .
3.2 Condence intervals . . . . . . . . . . . . . . .
3.3 Prediction intervals . . . . . . . . . . . . . . . .
3.3.1 Derivations . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

4 Multiple Regression
4.1 Population Regression Function . . . .
4.2 Sample Regression Function and OLS
4.3 Example: house price modelling . . . .
4.4 Statistical Inference . . . . . . . . . .
4.5 Applications to house price regression
4.6 Joint hypothesis tests . . . . . . . . .
4.7 Multicollinearity . . . . . . . . . . . .
4.7.1 Perfect multicollinearity . . . .
4.7.2 Imperfect multicollinearity . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

55
55
55
56
57
59
62
65
65
69

5 Dummy Variables
5.1 Estimating two means . . . . . . . . . .
5.2 Estimating several means . . . . . . . .
5.3 Dummy variables in general regressions
5.3.1 Dummies for intercepts . . . . .
5.3.2 Dummies for slopes . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

69
69
71
72
73
78

. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
dependent variable
. . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

80
80
81
82
86
87
93
94
97

6 Some non-linear functional forms


6.1 Quadratic regression . . . . . . . . . . . . . . .
6.1.1 Example: wages and work experience .
6.2 Regression with logs explanatory variable . .
6.2.1 Example: wages and work experience .
6.3 Regression with logs dependent variable . . .
6.3.1 Example: modelling the log of wages . .
6.3.2 Choosing between levels and logs for the
6.4 Practical summary of functional forms . . . . .

7 Comparing regressions
98
7.1 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3 Adjusted R2 as an IC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8 Functional form

100

9 Regression and Causality


9.1 Notation . . . . . . . . .
9.2 Regression for prediction
9.3 Omitted variables . . . .
9.4 Simultaneity . . . . . . .
9.5 Sample selection . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

100
101
101
102
104
104

10 Regression with Time Series


10.1 Dynamic regressions . . . . . . . . . . . . . .
10.1.1 Finite Distributed Lag model . . . . .
10.1.2 Autoregressive Distributed Lag model
10.1.3 Forecasting . . . . . . . . . . . . . . .
10.1.4 Application . . . . . . . . . . . . . . .
10.2 OLS estimation . . . . . . . . . . . . . . . . .
10.2.1 Bias . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

105
106
106
106
107
107
108
109

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

10.2.2 A general theory for time series


10.3 Checking weak dependence . . . . . .
10.4 Model specication . . . . . . . . . . .
10.5 Interpretation . . . . . . . . . . . . . .
10.5.1 Interpretation of FDL models .
10.5.2 Interpretation of ARDL models
11 Regression in matrix notation
11.1 Denitions . . . . . . . . . . .
11.2 Addition and Subtraction . .
11.3 Multiplication . . . . . . . . .
11.4 The PRF . . . . . . . . . . .
11.5 Matrix Inverse . . . . . . . .
11.6 OLS in matrix notation . . .
11.6.1 Proof . . . . . . . . .
11.7 Unbiasedness of OLS . . . . .
11.8 Time series regressions . . . .

1
1.1

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

regression
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

112
113
113
124
124
125

.
.
.
.
.
.
.
.
.

127
127
128
129
129
130
131
131
132
132

Review of simple regression


The Sample Regression Function

Regression is the primary statistical tool used in econometrics to understand the relationship
between variables. To illustrate, consider the dataset introduced in Example 2.3 of Wooldridge
for relating the salary paid to corporate chief executive o cers to the return on equity achieved
by their rms. Data is available for 209 rms. The idea is to examine whether the salaries paid to
CEOs is related to the earnings of their rms, and specically whether rms with higher incomes
reward their CEOs with higher salaries. A scatter plot of the possible relationship is shown in
Figure 1, which reveals the possibility of increasing returns to equity corresponding to higher
CEO salaries, but with some apparently high salaries for a small number of CEOs also included
(these are known as outliers, to be discussed later).
A regression line can be t to this data using the method of Ordinary Least Squares (OLS), as
shown in Figure 2. The OLS method works as follows. The dependent variable for the regression
is denoted yi , where the subscript i refers to the number of the observation for i = 1; : : : ; n. In the
example we have n = 209 and yi corresponds to the CEO salary for each of the 209 rms. The
explanatory variable, or regressor, is denoted xi for i = 1; : : : ; n and corresponds to the Return
on Equity for each of the 209 rms. The data are shown in Table 1. The rst observation in the
dataset is y1 = 1095 and x1 = 14:10, meaning that the CEO of the rst rm earns $1,095,000 and
the rms Return on Equity is 14.10%. The second observation is y2 = 1001 and x2 = 10:90, the
last observation is y209 = 626 and x209 = 14:40, and so on.
The regression line is a linear function of xi that is used to calculate a prediction of yi , denoted
y^i . This regression line is expressed
y^i = ^ 0 + ^ 1 xi ; i = 1; : : : ; n:

(1)

This is called the Sample Regression Function (SRF). The hat on top of any quantity implies
that it is a prediction or an estimate that is calculated from the data. The method of OLS is
used to calculate ^ 0 and ^ 1 , respectively the intercept and the slope of the regression line. The
prediction errors, or regression residuals, are denoted
u
^ i = yi

y^i ; i = 1; : : : ; n;
3

(2)

and OLS chooses the values of ^ 0 and ^ 1 such that the overall residuals (^
u1 ; : : : ; u
^n ) are minimised,
in the sense that the Sum of Squared Residuals (SSR)
SSR =

n
X

u
^2i

i=1

n
X

y^i )2

(yi

i=1

is as small as possible. This is the sense in which the OLS regression line is known as the line of
best t.
The formulae for ^ 0 and ^ 1 are given by
Pn
(xi x) (yi y)
^ = i=1
;
(3)
Pn
1
x)2
i=1 (xi

and

^ =y
0

^ x;
1

(4)

where y and x are the sample means


y=

i=1

i=1

1X
1X
yi ; x =
xi .
n
n

The derivations are given below.


For the CEO salary data, the coe cients of the regression line can be calculated to be ^ 0 =
963:191 and ^ 1 = 18:501, so the regression line can be written
y^i = 963:191 + 18:501xi ,
or equivalently using the names of the variables:
d = 963:191 + 18:501 RoEi :
salary
i

The interpretation of this regression line is that it gives a prediction of CEO salary in terms of
the return on equity of the rm. For example, for the rst rm the predicted salary on the basis
of return on equity is
y^1 = 963:191 + 18:501xi
= 963:191 + 18:501

14:10

= 1224:1;
or $1; 224; 100, and the residual is
u
^ 1 = y1

y^1

= 1095
=

1224:1

129:1;

or $129; 100. That is, the CEO of the rst company in the dataset is earning $129; 100 less
than predicted by the rms return on equity. Table 2 gives some of the values of y^i and u
^i
corresponding to those values of yi and xi given in Table 1.

16,000
14,000
12,000

SALARY

10,000
8,000
6,000
4,000
2,000
0
0

10

20

30

40

50

60

ROE

Figure 1: Scatter plot of CEO salaries against Return on Equity

16,000
14,000
12,000

SALARY

10,000
8,000
6,000
4,000
2,000
0
0

10

20

30

40

50

60

ROE

Figure 2: CEO salaries vs Return on Equity with OLS regression line

Table 1: Data on CEO salaries and Return on Equity


Observation (i) Salary (yi ) Return on Equity (xi )
1
1095
14.10
2
1001
10.90
3
1122
23.50
..
..
..
.
.
.
208
555
13.70
209
626
14.40
Table 2: CEO salaries and Return on Equity, with regression predictions and residuals
Observation (i) Salary (yi ) Return on Equity (xi ) Predicted Salary (^
yi ) Residual (^
ui )
1
1095
14.10
1224.1
129:1
2
1001
10.90
1164.9
163:9
3
1122
23.50
1398.0
276:0
..
..
..
..
..
.
.
.
.
.
208
555
13.70
1216.7
661:7
209
626
14.40
1229.6
603:6

1.2

Interpretation of regression as prediction

The intrepretation of the regression coe cients ^ 0 and ^ 1 relies on the interpretation of regression
as giving predictions for yi using xi . For a general regression equation
y^i = ^ 0 + ^ 1 xi ;
the interpretation of ^ 0 is that it is the predicted value of yi when xi = 0. It depends on the
application whether xi = 0 is practically relevant. In the CEO salary example, a rm with zero
return on equity (i.e. net income of zero) is predicted to have a CEO with a salary of $963,191.
Such a prediction has some value in this case because it is possible for a rm to have zero net
income in a particular year, and the data contains observations where the return on equity is quite
close to zero. As a dierent example, if we had a regression of individual wages on age of the form
wage
d i = ^ 0 + ^ 1 agei , it would make no practical sense to predict the wage of an individual of age
zero! In this case the intercept coe cient ^ 0 does not have a natural interpretation.
The slope coe cient ^ 1 measures the change in the predicted value y^i that would follow from a
one unit increase in the regressor xi . The predicted value of yi given the regressor takes the value
xi is y^i = ^ 0 + ^ 1 xi , while the predicted value of yi given the regressor takes the value xi + 1 is
y^i = ^ 0 + ^ 1 (xi + 1). The change in prediction for yi based on this change in xi is y^i yi = ^ 1 . In
the CEO salary example, an increase of 1% in a rms return on equity corresponds to a predicted
increase of 18.501 ($18; 501) in CEO salary. This quanties how increases in rm income change
our prediction for CEO salary. Econometrics is especially concerned with the estimation and
interpretation of such slope coe cients.

1.3

Regression in Eviews

Eviews is statistical software designed specically for econometric analysis. Data can be read in
from Excel les and then easily analysed using OLS regression. The steps to carry out the CEO
salary analysis in the previous section are presented here.

Figure 3: Excel spreadsheet for CEO salary data


Figure 3 shows part of an Excel spreadsheet containing the CEO salary data. The variable
names are in the rst row, followed by the observations for each variable. To open this le in
Eviews, go to File - Open - Foreign Data as Workle... as shown in Figure 4, and select the
Excel le in the subsequent window. On opening the le, the dialog boxes in Figures 5, 6 and 7
can often be left unchanged. The rst species the range of the data within the workle (in this
case the rst two columns of Sheet 1), the second species that the variable names are contained
in the rst row of the spreadsheet, and the third species that a new workle be created in Eviews
to contain the data. For simple data sets such as this, the defaults in these dialog boxes will be
correct. More involved data sets will be considered later. On clicking Finish in the nal dialog
box, the new workle is displayed in Eviews, see Figure 8.
The Range of the workle species the total number of observations available for analysis, in
this case 209. The Sample of the workle species which observations are currently being used
for analysis, and this defaults to the full range of the workle unless otherwise specied. There
are four objects displayed in the workle c, resid, roe and salary. The rst two of these will be
present in any workle. The c and resid objects contain the coe cient values and residuals
from the most recently estimated regression. The objects roe and salary contain the data on those
two variables. For example, double clicking on salary gives the object view shown in Figure 9,
where the observations can be seen. Many other views are possible, but a common and important
rst step is to obtain some graphical and statistical summaries by selecting View - Descriptive
Statistics & Tests - Histogram and Statsas shown in Figure 10. This results in Figure 11, where
the histogram gives an idea of the distribution of the variable and the descriptive statistics provide
an idea of the measures of central tendency (mean, median), dispersion (maximum, minimum,
standard deviation) and other measures. The mean CEO salary is $1,281,120 while the median
is $1,039,000. The substantial dierence between these two statistics is because there are at
least three very large salaries that are very inuential on the mean, but not the median. These
observations were also evident in the scatter plot in Figure 1. The same descriptive statistics can

Figure 4: To open an Excel le in Eviews...

Figure 5: ... specify the range of the data in the spreadsheet...

Figure 6: ... specify that the rst header line contains the variables names (salary and ROE)...

Figure 7: ... and specify that a new undated workle be created.

Figure 8: New Eviews workle for CEO salary data

10

Figure 9: Contents of the salary object


be obtained for the Return on Equity variable.
The scatter plots in Figure 1 or 2 can be obtained by selecting Quick - Graph...as shown in
Figure 12, entering roe salaryinto the resulting Series List box as shown in Figure 13, and then
specifying a Scatter with regression line (if desired) as shown in Figure 14. The result is Figure 2.
The regression equation itself can be computed by selecting Quick - Estimate Equation...
as shown in Figure 15, and then specifying the equation as shown in Figure 16. The dependent
variable (salary) for the regression equation goes rst, the c refers to the intercept of the
equation ^ 0 and then the explanatory variable (roe). The results of the regression calculation
are shown in Figure 17. In particular the values of the intercept

^ = 963:1913 and the slope


0

coe cient on RoE ^ 1 = 18:50119 can be read from the Coe cient column of the tabulated
results. The equation can be named as shown in Figure 18, which means that it will appear as
an object in the workle and can be saved for future reference.
d for the regression, click on the Forecast button and
To obtain the predicted values salary
i
enter a new variable name in the Forecast namebox, say salary_hat, as shown in Figure 19.
A new object called salary_hat is created in the workle and double clicking on it reveals the
values shown in the Figure ??, the rst three of which correspond to the values given in Table 2
for y^i .
To obtain the residuals u
^i for the regression select Proc - Make Residual Series in the
equation window as shown in Figure 20 and name the new residuals object as shown in the Figure
21. The resulting residuals for the CEO salary regression are shown in Figure 22, the rst three
of which correspond to the values given in Table 2 for u
^i .

11

Figure 10: Obtaining descriptive statistics

90

Series: SALARY
Sample 1 209
Observations 209

80
70

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

60
50
40
30
20

1281.120
1039.000
14822.00
223.0000
1372.345
6.854923
60.54128

Jarq ue-Bera 30470.10


Probability
0.000000

10
0
0

2000

4000

6000

8000

10000

12000

14000

Figure 11: Descriptive statistics and histogram for the CEO salaries

12

Figure 12: Selecting Quick - Graph...

Figure 13: Variables to plot roe rst because it goes on the x-axis.

13

Figure 14: Selecting a scatter plot with regression line.

Figure 15: Selecting a Quick Equation...

14

Figure 16: Specifying a regression of CEO salary on an intercept and Return on Equity

Figure 17: Regression results for CEO salary on Return on Equity

15

Figure 18: Naming an equation to keep it in the workle.

Figure 19: Use the Forecast procedure to calculate predicted values from the regression.

16

Figure 20: Make residuals object from a regression

Figure 21: Name the residuals series u_hat

17

Figure 22: The residuals from the CEO salary regression.

1.4

Goodness of t

The equation (2) that denes the regression residuals can we written
yi = y^i + u
^i ;

(5)

which states that the regression decomposes each observation into a prediction (^
yi ) that is a
function of xi , and the residual u
^i . Let var
c (yi ) denote the sample variance of y1 ; : : : ; yn :
var
c (yi ) =

n
X

(yi

y)2 ;

i=1

and similarly var


c (^
yi ) and var
c (^
ui ) are the sample variances of y^1 ; : : : ; y^n and u
^1 ; : : : ; u
^n . Some
simple algebra (in section 1.5.3 below) shows that
var
c (yi ) = var
c (^
yi ) + var
c (^
ui ) :

(6)

(Note
Pn that (6) does not follow automatically from (5) and requires the additional property that
^i u
^i = 0.) Equation (6) shows that the variation in yi can be decomposed into the sum of
i=1 y
the variation in the regression predictions y^i and the variation in the residuals u
^i . The variation
of the regression predictions is referred to as the variation in yi that is explained by the regression.
A common descriptive statistic is
var
c (^
yi )
R2 =
;
var
c (yi )
which measures the goodness of t of the a regression as the proportion of variation in the
dependent variable that is explained by the variation in xi . The R2 is known as the coe cient
of determination and lies between 0 and 1. The closer is R2 to one, the better the regression is
said to t. Note that this is just one criteria by which to evaluate the quality of a regression, and
others will be given during the course.
18

It is common, as in Wooldridge, to express the R2 denitions and algebra in terms of sums of


squares rather than sample variances. Equation (6) can be written
1
n

Pn

n
X

(yi

y) =

i=1

n
X

(^
yi

y) +

i=1

^i = 0, which in turn implies


where use is made of i=1 u
the derivation). Cancelling the 1= (n 1) gives

Pn

i=1 yi

1
=

n
X

u
^2i ;

i=1

Pn

^i
i=1 y

(see section 1.5.3 for

SST = SSE + SSR;


where
SST

n
X

(yi

y)2 total sum of squares

i=1

SSE =
SSR =

n
X

i=1
n
X

y^i

y^

u
^2i

explained sum of squares

residual sum of squares.

i=1

In this case R2 can equivalently be dened


R2 =

SSE
:
SST

The R2 for the CEO salary regression in Figure 17 is 0.0132, so that just 1.32% of the variation
in CEO salaries is explained by the Return on Equity of the rm. This low R2 (i.e. close to zero)
need not imply the regression is useless, but it does imply that CEO salaries are determined by
other important factors besides just the protability of the rm.
Some intuition for what R2 is measuring can be found in Figures 23 and 24, which show two
hypothetical regressions with R2 = 0:185 and R2 = 0:820 respectively. The data in Figure 24 are
less dispersed around the regression line, so that changes in xi more precisely predict changes in
y2;i than y1;i . There is more variation in y1;i that is left unexplained by the regression.
Figure 25 gives one example of how R2 does not always provide a foolproof measure of the
quality of a regression. The regression in Figure 25 has R2 = 0:975, very close to the maximum
possible value of one, but the scatter plot clearly reveals that the regression does not explain
an important feature of the relationship between y3;i and xi there is some curvature or nonlinearity that is not captured by the regression. A high R2 is a nice property for a regression to
have, but is neither necessary nor su cient for a regression to be useful.

1.5
1.5.1

Derivations
Summation notation

It will be necessary to know some simple properties of summation operators to follow the derivations. The summation operator is dened by
n
X

ai = a1 + a2 + : : : + an :

i=1

19

2.5
2.0
1.5

Y1

1.0
0.5
0.0
-0.5
-1.0
0.0

0.2

0.4

0.6

0.8

1.0

0.8

1.0

Figure 23: R2 = 0:185

2.5
2.0
1.5

Y2

1.0
0.5
0.0
-0.5
-1.0
0.0

0.2

0.4

0.6
X

Figure 24: R2 = 0:820

20

2.5
2.0
1.5

Y3

1.0
0.5
0.0
-0.5
-1.0
0.0

0.2

0.4

0.6

0.8

1.0

Figure 25: R2 = 0:975


It follows that

n
X

(ai + bi ) =

i=1

n
X

ai +

i=1

n
X

bi :

i=1

If c is a constant (i.e. does not vary with i) then


n
X
i=1

Similarly

c = c| + c +{z: : : + }c = nc:
n times

n
X

cai = c

i=1

n
X

ai ;

i=1

which is an extension of taking c outside the brackets in (ca1 + ca2 ) = c (a1 + a2 ).


The sample mean of a1 ; : : : ; an is
n
1X
ai ;
a=
n
i=1

and then

n
X

(ai

a) =

i=1

n
X

ai

i=1

= na

n
X

i=1

na

= 0;
so that the sum of deviations around the sample mean is always exactly zero.
21

(7)

The sum of squares of ai around the sample mean a can be expressed


n
X

(ai

n
X

a)2 =

i=1

i=1
n
X

a2i

2ai a + a2

a2i

2a

i=1

n
X

n
X

ai +

i=1

n
X

a2

i=1

a2i

2na2 + na2

a2i

na2 :

i=1

n
X

i=1

1.5.2

Derivation of OLS

Consider a prediction for yi of the form


y~i = b0 + b1 xi ;
where b0 and b1 could be any coe cients. The residuals from this predictor are
u
~ i = yi

y~i = yi

b0

b1 xi :

The idea of OLS is to choose the values of b0 and b1 that minimise the sum of squared residuals
SSR (b0 ; b1 ) =

n
X

u
~2i =

i=1

n
X

(yi

b1 xi )2 :

b0

i=1

The minimisation can be done using calculus. The rst derivatives of SSR (b0 ; b1 ) with respect
to b0 and b1 are
@SSR (b0 ; b1 )
@b0

@SSR (b0 ; b1 )
@b1

2
2

n
X
i=1
n
X

(yi

b0

xi (yi

b1 xi )
b0

b1 xi ) :

i=1

Setting these rst derivatives to zero at the desired estimators ^ 0 and ^ 1 gives the rst order
conditions
n
X

yi

x i yi

^ xi
1

= 0

(8)

^ xi
1

= 0:

(9)

i=1

n
X
i=1

(See equations (2.14) and (2.15) of Wooldridge, who takes a dierent approach to arrive at these
equations.)
The rst equation can be written
n
X

yi

n^

i=1

n
X
i=1

22

xi = 0;

which is equivalent (after dividing both sides by n) to


^

^ x = 0:
1

Solving for ^ 0 gives (4).


Substituting this expression for ^ 0 into the second equation gives
n
X

xi (yi

^ (xi
1

y)

x) = 0;

i=1

or

n
X

xi (yi

y)

i=1

Notice that
n
X

(xi

x) (yi

n
X

xi (xi

y) =

i=1

n
X

xi (yi

y)

i=1

and similarly

x) = 0:

i=1

n
X

(xi

n
X

(yi

y) =

i=1

x)2 =

i=1

n
X

n
X

xi (yi

y) ;

i=1

xi (xi

x) ;

i=1

so the rst order condition for ^ 1 can be written


n
X

(xi

x) (yi

y)

i=1

n
X

(xi

x)2 = 0;

i=1

which leads to (3).


1.5.3

Properties of predictions and residuals

The OLS residuals


u
^i = yi

y^i = yi

satisfy

n
X

^ xi ;
1

u
^i = 0;

(10)

i=1

^ = 0. Similarly because of (9) the residuals satisfy


because of (8), and hence u
n
X

xi u
^i = 0:

(11)

i=1

From these two it follows that


n
X

y^i =

i=1

n
X

(yi

u
^i ) =

i=1

n
X

yi

i=1

n
X
i=1

u
^i =

n
X

yi ;

(12)

i=1

so the OLS predictions y^i have the same sum, and hence the same sample mean, as the original
dependent variable yi . Also
n
X
i=1

y^i u
^i = ^ 0

n
X

u
^i + ^ 1

i=1

n
X
i=1

23

xi u
^i = 0:

(13)

Now consider the total sum of squares


SST

n
X

(yi

y)2

(yi

y^i + y^i

i=1

n
X

y)2

i=1

n
X

y)2

(^
ui + y^i

i=1

n
X

u
^2i

+2

i=1

n
X

u
^2i

i=1

+2

n
X

u
^i (^
yi

i=1
n
X

y) +

n
X

(^
yi

y)2

i=1

u
^i y^i

i=1

n
X
i=1

= SSR + SSE:

u
^i

n
X

(^
yi

y)2

i=1

The last step uses (13) and (10) to cancel the middle two terms. It also uses (12) to identify SSE =
P
P
Pn
2
2
yi y)2 . Also (10) implies that u = 0 so that SSR = ni=1 u
^i u
^
^i y^ = ni=1 (^
i=1 y
does not require the u
^. Dividing this equality through by n 1 gives (6).

Statistical Inference and the Population Regression Function

The review of regression in section 1 suggests that regression is a useful tool for summarising
the relationship between two observed variables and for calculating predictions for one variable
based on observations on the other. In econometrics we want to do more than this. We want to
use the information contained in a sample to carry out inductive inference (statistical inference)
on the underlying population from which the sample was drawn. For example, we want to take
the sample of 209 CEO salaries in section 1 as being representative of the salaries of CEOs in
the population of all rms. In practice it is necessary to be very careful about the denition of
this population. This dataset, taken from the American textbook of Wooldridge, would best be
taken as being representative of only US rms, rather than all rms in the world, or all rms in
OECD countries. In fact the population may be US publicly listed rms, since rms unlisted on
the stock market may have quite dierence processes for executive salaries. Nevertheless, with
the population carefully dened, the idea of statistical inference is to make statistical statements
about that population, not only the sample that has been observed.

2.1

Simple random sample

Suppose there is a well dened population in which we are interested, eg. the population of publicly
listed rms in the US. A simple random sample is one in which each rm in the population has
an equal probability of being included in the sample. Moreover each rm in the sample is chosen
independently of all the others. That is the probability of inclusion or exclusion of one rm into
the sample does not depend on the inclusion or exclusion of any other rm.
For each rm included in the sample, we take one or more measurements of interest (eg. CEO
salary and the rms Return on Equity). Mathematically these are represented as the random
variables yi and xi for i = 1; : : : ; n, where n is the sample size. The concept of a random variable
reects the idea that the values taken in the sample would have been dierent if a dierent random
sample had been drawn. In the observed sample we had y1 = 1095, y2 = 1001, etc, but if another
24

simple random sample had been drawn then dierent rms should (most likely) have been chosen
and the yi values would have been dierent. That is, random variables take dierent values if a
dierent sample is drawn.
In the population of all rms, there is a distribution of CEO salaries. The random variables
y1 ; y2 ; : : : are independent random drawings from this distribution. That is, each of y1 ; y2 ; : : :
are independent of each other and are drawn from the same underlying distribution. They are
therefore called independent and identically distributed random variables, always abbreviated as
i.i.d..
There are many other sampling schemes that may arise in practice, some of which will be
introduced later. Our initial discussion of regression modelling will be conned to cross-sectional
data drawn as a simple random sample, i.e. conned to i.i.d. random variables.

2.2

Population distributions and parameters

The population distribution of interest will be characterised by certain parameters of interest.


For example, the distribution of CEO salaries for the population of publicly listed rms in the
US will have some population mean that could be denoted . The population mean is dened
mathematically as the expected value of the population distribution. This expected value is the
weighted average of all possible valuesR in the population, with weights given by the probability
distribution denoted f (y), i.e.
= yf (y) dy. (The evaluation of such integrals will not be
required here, but see Appendix B.3 of Wooldridge for some more details.) Since each of the
random variables y1 ; y2 ; : : : represent random drawings from the population distribution, each of
them also has a mean of . This is written
E (yi ) = ; i = 1; 2; : : : :

(14)

Similarly the population distribution of CEO salaries has some variance that could be denoted
2 , which is dened in terms of y as
i
i
h
2
= var (yi ) = E (yi
)2 ; i = 1; 2; : : : :
(Or, in terms of integrals, as

2.3

Population vs Sample

(y

)2 f (y) dy.)

It will be important throughout to be clear on the distinction between the population and the
sample. The population is too large or unwieldly or simply impossible to fully observe and
measure. Therefore a quantity such as the population mean
= E (yi ) is also impossible to
observe. Instead we take a sample, which is a subset of the population, and attempt to estimate
the population mean based on that P
sample. An obvious (but not the only) statistic to use to
estimate is the sample mean y = n1 ni=1 yi . A sample statistic such as y is observable, for eg
y = 1281:12 for the CEO salary data (see Figure 11).
It is vital at all times to keep clear the distinction between an unobservable population parameter like = E (yi ) about which we wish to learn, and an observable sample statistic y that
we use to estimate . More generally we want to use y (and perhaps other statistics) to draw
statistical inferences about .

2.4

Conditional Expectation

In econometrics we are nearly always interested in at least two random variables in a population,
eg yi for CEO salary and xi for Return on Equity, and the relationships between them. Of central
25

interest in econometrics is the conditional distribution of yi given xi . That is, rather than being
interested in the distribution of CEO salaries in isolation (the so-called marginal distribution of
yi ), we are interested in how the distribution of CEO salaries changes as the Return on Equity of
the rm changes. For regression analysis, the fundamental population quantity of interest is the
conditional expectation of yi given xi , which is denoted by the function
E (yi jxi ) =

(xi ) :

(15)

(As
R outlined in Appendix B.4 of Wooldridge, this conditional expectation is dened as (x) =
yfY jX (yjx) dy, where fY jX is conditional distribution of yi given xi .) Much of econometrics is
devoted to estimating conditional expectations functions.
The idea is that E (yi jxi ) provides the prediction of yi corresponding to a given value of xi
(i.e. the value of yi that we would expect given some value of xi ). For example (10) is the
population mean of CEO salary for a rm with Return on Equity of 10%. This conditional mean
will be dierent (perhaps lower?) than (20) which is the population mean of CEO salary for a
rm with Return on Equity of 20%. If the population mean of yi changes when we change the
value of xi , there is a potentially interesting relationship between yi and xi to explore.
Consider the dierence between the unconditional mean
= E (yi ) given in (14) and the
conditional mean (xi ) = E (yi jxi ) given in (15). These are dierent population quantities with
dierent uses. The unconditional mean provides an overall measure of central tendency for
the distribution of yi but provides no information on the relationship between yi and xi . The
conditional mean (xi ), by contrast, describes how the predicted/mean value of yi changes with
xi . For example, is of interest if we want to investigate the overall average level of CEO salaries
(perhaps to compare them to other occupations say), while (xi ) is of interest if we want to start
to try to understand what factors may help explain the level of CEO salaries.
Note also that is, by denition, a single number. On the other hand (xi ) is a function,
that is it is able to take dierent values for dierent values of xi .

2.5

The Population Regression Function

The Population Regression Function (PRF) is, by denition, the conditional expectations function
(15). In a simple regression analysis, it is assumed that this function is linear, i.e.
E (yi jxi ) =

1 xi :

(16)

This linearity assumption need not always be true and is discussed more later. This PRF species
the conditional mean of yi in the population for any value of xi . It species one important aspect
of the relationship between yi and xi .
Statistical inference in regression models is about using sample information to learn about
E (yi jxi ), which in the case of (16) amounts to learning about 0 and 1 . Consider the SRF
introduced in (1), restated here:
y^i = ^ 0 + ^ 1 xi :
(17)
The idea is that ^ 0 and ^ 1 are the sample OLS estimators that we calculate to estimate the
unobserved population coe cients 0 and 1 . Then for any xi we can use the sample predicted
value y^i to estimate the conditional expectation E (yi jxi ).

2.6

Statistical Properties of OLS

An important question is whether the OLS SRF (17) provides a good estimator of the PRF (16)
in some sense. In this section we address this question assuming that
26

A1 (yi ; xi )ni=1 are i.i.d. random variables (i.e. from a simple random sample)
A2 the linear form (16) of the PRF is correct.
Estimators in statistics (such as a sample mean y or regression coe cient ^ 0 ; ^ 1 ) can be
considered to be random variables since they arePfunctions of the random variables that represent
that data. For example the sample mean y = n1 ni=1 yi is a random variable because it is dened
in terms of the random variables y1 ; : : : ; yn . That is, if a dierent random sample had been drawn
for y1 ; : : : ; yn then a dierent value for y would be obtained. The distribution of an estimator
is called the sampling distribution of the estimator. The statistical properties of an estimator is
derived from its sampling distribution.
2.6.1

Properties of Expectations

The properties of a sampling distribution are often dened in terms of its mean and variance and
other similar quantities. To work these out, it is necessary to use some simple properties of the
expectations operator E and the conditional expectations operator, summarised here.
Suppose z1 ; : : : ; zn are i.i.d random variables and c1 ; : : : ; cn are non-random. Then
E1 E (

Pn

E2 var (

i=1 ci zi )

Pn

i=1 ci zi )

Pn

i=1 ci E

(zi )

Pn

2
i=1 ci var (zi )

E3 E (ci ) = ci , var (ci ) = 0:


Property E1 continues to hold if zi are not i.i.d. (for example, if they are correlated with each
other) but Property E2 does not continue to hold if zi are correlated. Recall from Assumption A1
that, at least for now, we are assuming that the random variables yi and xi are each i.i.d. across
i. Property E3 simply states that the expectation of a constant (ci ) is itself, and that a constant
has no variation.
In view of the denition of the PRF (16), conditional expectations are fundamental to regression analysis. It turns out to be useful to be able to work with not only E (yi jxi ) but
E (yi jx1 ; : : : ; xn ), which is the conditional expectation of yi given information on the explanatory variables for all observations, not only observation i. The reason for this becomes clear in
the following section. Under Assumption A1
E (yi jxi ) = E (yi jx1 ; : : : ; xn ) :

(18)

This can be proven formally, but the intuition is simply that under independent sampling, information in explanatory variables xj for j 6= i is not informative about yi since (yi ; xi ) and (yj ; xj )
are independent for all j 6= i. That is, knowing xj for j 6= i does not change our prediction of yi .
For example, our prediction of the CEO salary for rm 1 is not improved by knowing the Return
to Equity of any other rms, it is assumed to be explained only by the performance of rm 1.
That is
E (salaryi jRoEi ) = E (salaryi jRoE1 ; : : : ; RoEn ) :
Equation (18) is reasonable under Assumption A1, but not in other sampling situations such as
time series data considered later.
The conditional variance of a random variable is a measure of its conditional dispersion around
its conditional mean. For example
h
i
var (yi jxi ) = E (yi E (yi jxi ))2 xi :
27

h
i
(Compare this to the unconditional variance var (yi ) = E (yi E (yi ))2 .) The conditional variance of yi is the variation in yi that remains when xi is given a xed value. The unconditional variance of yi is the overall variation in yi , averaged across all xi values. It follows that
var (yi jxi ) var (yi ). If yi and xi are independent then var (yi jxi ) = var (yi ). It is frequently the
case in practice that var (yi jxi ) may vary in important ways with xi . For example, it may be that
CEO salaries are more highly variable for more protable rms than less protable rms. Or, if
yi is wages and xi is individual age, then it is likely that the variation in wage across individuals
become greater as age increases. If var (yi jxi ) varies with xi then this is called heteroskedasticity.
If var (yi jxi ) is constant across xi then this is called homoskedasticity.
Under Assumption A1, the conditional expectations operator has properties similar to E1 and
E2. Suppose c1 ; : : : ; cn are either non-random or functions of x1 ; : : : ; xn only (i.e. not functions
of y1 ; : : : ; yn ). Then
CE1 E (

Pn

CE2 var (

i=1 ci yi jx1 ; : : : ; xn )

i=1 yi jx1 ; : : : ; xn )

Pn

Pn

i=1 ci E

Pn

(yi jxi )

2
i=1 ci var (yi jxi )

P
Without i.i.d. sampling (eg. time series), CE1 would continue to hold in the form E ( ni=1 yi jx1 ; : : : ; xn ) =
P
n
i=1 E (yi jx1 ; : : : ; xn ) while CE2 would generally not be true.
The nal very useful property of conditional expectations is the Law of Iterated Expectations:
LIE For any random variables z and x, E [z] = E [E (zjx)].
The LIE may appear odd at rst but is very useful and has some intuition. Leaving aside
the regression context, let z represent the outcome from a roll of a die, i.e. a number from
1; 2; : : : ; 6. The expected value of this random variable is E (z) = 16 (1 + 2 + : : : + 6) = 3:5, since
the probability of each possible outcome is 16 . Now suppose we dene another random variable
x that takes the value 0 if z is even and 1 if z is odd. That is x = 0 if z = 2; 4; 6; and x = 1 if
z = 1; 3; 5, so that Pr (x = 0) = 21 and Pr (x = 1) = 12 . It should be clear that E (zjx = 0) = 4
and E (zjx = 1) = 3, which illustrates the idea that conditional expectations can take dierent
values (4 or 3) when the conditioning variables take dierent values (0 or 1). The expected
value of the random variable E (zjx) is taken as an average over the possible x values, that is,
E [E (zjx)] = 12 (4 + 3) = 3:5, since the probability of each possible outcome of E (zjx) is 12 . This
illustrates the LIE, i.e. E (z) = E [E (zjx)] = 3:5. While E [E (zjx)] may appear more complicated
than E (z), it frequently turns out to be easier to work with.
The LIE also has a version in variances:
LIEvar var (z) = E [var (zjx)] + var [E (zjx)].
This shows that the variance of a random variable can be decomposed into its average conditional
variance given x and the variance of the regression function on x.
2.6.2

Unbiasedness

An estimator is dened to be unbiased if the mean of its sampling distribution is equal to the
true value of the parameter being estimated. If ^ is any estimator of a parameter , it is unbiased
if E ^ = . The idea is that an unbiased estimator is one that does not systematically underestimate or over-estimate the true value . Some samples from the population will give values of
^ below and some samples will give values of ^ above , and these dierences average out. In
practice we only get to observe a single value of ^ of course, and this single value may dier from
28

by being too large or small. It is only on average over all possible samples that the estimator gives
. So unbiasedness is a desirable property for a statistical estimator, although not one that occurs
very often. However in linear regression models there are situations where the OLS estimator can
be shown to be unbiased. We consider the unbiasedness of the sample mean rst, and then the
OLS estimator of the slope coe cient in a simple regression.
Let y = E (yi ) denote the population mean of the i.i.d. random variables y1 ; : : : ; yn . Then
!
n
n
n
1X
1X
1X
E (y) = E
yi =
E (yi ) =
(19)
y = y;
n
n
n
i=1

i=1

i=1

where the second step uses Property E1 above, and this shows that the sample mean is an unbiased
estimator of the population mean.
Under Assumptions 1 and 2 above, the OLS estimators ^ 0 and ^ 1 can be shown to be unbiased.
Just ^ 1 is considered here. First recall the property of zero sums around sample means (7), which
implies
n
X
(xi x) = 0;
(20)
i=1

and

n
X

(xi

x) (yi

y) =

i=1

and similarly

n
X

(xi

x) yi

i=1

n
X

(xi

x)2 =

x) =

n
X

(xi

n
X

(xi

x) yi ;

(21)

i=1

x) xi :

(22)

i=1

Using (21) allows ^ 1 in (3) to be written


1

(xi

i=1

i=1

n
X

=
=
=

Pn

Pi=1
n

(xi

x) yi

i=1 (xi

n
X

i=1
n
X

x)2

(xi x)
yi
Pn
x)2
i=1 (xi

an;i yi :

(23)

i=1

This shows that ^ 1 is a weighted sum of y1 ; : : : ; yn , with the weight on each observation yi being
given by
(xi x)
an;i = Pn
;
(24)
x)2
i=1 (xi
which for each i depends on all of x1 ; : : : ; xn (hence the subscript n included in the an;i notation).
Now use the LIE to write
E ^ 1 jx1 ; : : : ; xn

=
=

n
X

i=1
n
X

an;i E (yi jx1 ; : : : ; xn )


an;i (

1 xi )

i=1

n
X
i=1

29

an;i +

n
X
i=1

an;i xi ;

(25)

where the second lines uses (18), which holds under Assumption A1. Using (20) gives
Pn
n
X
(xi x)
= 0;
an;i = Pni=1
x)2
i=1 (xi
i=1
and using (22) gives

n
X
i=1

Substituting these into (25) gives

Pn

an;i xi = Pi=1
n

(xi

x) xi

i=1 (xi

x)2

E ^ 1 jx1 ; : : : ; xn =

= 1:

1;

and hence, applying the LIE,


h i
h
i
E ^ 1 = E E ^ 1 jx1 ; : : : ; xn = E [

This shows that ^ 1 is an unbiased estimator of


2.6.3

(26)

1]

1:

1.

Variance

The variance of an estimator measures how dispersed values of the estimator can be around the
mean. In general it is preferred for an estimator to have a small variance, implying that it tends
not to produce estimates very far from its mean. This is especially so for an unbiased estimator,
for which a small variance implies the distribution of the estimator is closely concentrated around
the true population value of the parameter of interest.
For the sample mean, consider again the i.i.d. random variables y1 ; : : : ; yn each with population
mean y = E (yi ) and population variance 2y . Then
!
n
n
2
1X
1 X
y
var (y) = var
;
(27)
yi = 2
var (yi ) =
n
n
n
i=1

i=1

the second equality following from Property E2. This formula shows what factors inuence the
precision of the sample mean the variance 2y and the sample size n. Specically having
a population with a small variance 2y leads to a more precise estimator y of y , which makes
intuitive sense. Similarly intuitively, a larger sample size n implies a smaller variance of y, implying
that more precise estimates are obtained from larger sample sizes.
Now consider the variance of the OLS slope estimator ^ 1 . Using Property LIEvar above, the
variance of ^ 1 can be expressed
var ^ 1

h
i
h
i
= E var ^ 1 jx1 ; : : : ; xn + var E ^ 1 jx1 ; : : : ; xn
i
h
= E var ^ 1 jx1 ; : : : ; xn + var [ 1 ]
h
i
= E var ^ 1 jx1 ; : : : ; xn ;

where (25) is used to get the second line and then Property E3 (the variance of a constant is zero)
to get the third line. The conditional variance of ^ 1 given x1 ; : : : ; xn is
var ^ 1 jx1 ; : : : ; xn

=
=

n
X

a2n;i var (yi jxi )

i=1
Pn
i=1 (xi

Pn

x)2 var (yi jxi )

i=1 (xi

30

x)2

using Property CE2 to obtain the rst line and then substituting for an;i to obtain the second
line. This implies
3
2
Pn
2
(xi x) var (yi jxi ) 7
6
(28)
var ^ 1 = E 4 i=1P
5;
2
2
n
(x
x)
i
i=1
which is a fairly complicated formula that doesnt shed a lot of light on the properties of ^ 1 , but
it does have later practical use when we talk about hypothesis testing.
A simplication of the variance occurs under homoskedasticity, that is when var (yi jxi ) = 2
for every i. If the conditional variance is constant then
#
"
2
var ^ 1
= E Pn
x)2
i=1 (xi
2

n
X

where
s2x

1
n

1
;
s2x

(29)

x)2

(xi

i=1

is the usual sample variance of the explanatory variable xi . Formula (29) is simple enough to
understand what factors in a regression inuence the precision of ^ 1 . The variance will be small
for small values of 2 and large values of n 1 and s2x . This implies practically that slope
coe cients can be precisely estimated in situations where the sample size is large, where the
regressor xi is highly variable, and where the dependent variable yi has small variation around
the regression function (i.e. small 2 ).
2.6.4

Asymptotic normality

Having discussed the mean and variance of a sampling distribution, it is also possible to consider
the entire sampling distribution. This becomes important when we discuss hypothesis testing.
First consider the sample mean of some i.i.d. random variables y1 ; : : : ; yn with mean y and
variance 2y . Recall from (19) and (27) that the sample mean y has mean E (y) = y and variance
var (y) = 2y =n. In general the sampling distribution of y is not known, but in the special case we
it is known that each yi is normally distributed, then it also follows that y is normally distributed.
That is if yi i:i:d:N y ; 2y then
y

y;

2
y

(30)

If the distribution of yi is not normal, then the distribution of y is also not normal. In econometrics
is it very rare to know that each yi is normally distributed, so it would appear that (30) has only
theoretical interest. However, there is a powerful result in probability called the Central Limit
Theorem that states that even if yi is not normally distributed, the sample mean y can still be
taken to be approximately normally distributed, with the approximation generally working better
for larger values of n. Technically we say that y converges to a normal distribution as n ! 1, or
that y is asymptotically normal, and we will write this in the form
!
y

y;

31

2
y

(31)

.20

F_GAMMA

.16

.12

.08

.04

.00
0

10

11

12

13

14

Figure 26: The Gamma distribution with parameters b = r = 2


with the a denoted the fact that the normal distribution for y is asymptotic (i.e. as n ! 1)
or more simply is approximate.
The proof of the Central Limit Theorem goes beyond our scope, but it can be illustrated using
simulated data. Suppose that y1 ; : : : ; yn are i.i.d. random variables with a Gamma distribution
as shown in Figure 26. The mean of this distribution is y = 4. The details of the Gamma
distribution are not important for this discussion, although it is a well-known distribution for
modelling certain types of data in econometrics. For example, the skewed shape of the distribution
can make it suitable for income distribution modelling, in which many people or households make
low to moderate incomes and a relative few make high to very high incomes. Clearly this Gamma
distribution is very dierent in shape from a normal distribution! We can use Eviews to draw a
sample of size n from this Gamma distribution and to compute y. Repeating this many times
builds up a picture of the sampling distribution of y for the given n. The results of doing this are
given in Figures 27-31.
Figure 27 shows the simulated sampling distribution of y when n = 5. The skewness of the
population distribution of yi in Figure 26 remains evident in the distribution of y in Figure 27,
but to a reduced extent. The approximation (31), which is meant to hold for large n, does not
work very well for n = 5. As n increases, however, through n = 10; 20; 40; 80 in Figures 28-31,
it is clear that the sampling distribution of y becomes more and more like a normal distribution,
even though the underlying data from yi is very far from being normal. This is the Central
Limit Theorem at work and is why, for reasonable sample sizes, we are prepared to rely on an
approximate distribution such as (31) to carry out statistical inference.
Two other features of the sampling distributions in Figures 27-31 are worth noting. Firstly
the mean of each sampling distribution is known to be y = 4 because y is unbiased for every
n. Secondly the variance of the sampling distribution becomes smaller as n increases because
var (y) = 2y =n. That is, the sampling distribution becomes more concentrated around y = 4 as
n increases (note carefully the scale on the horizontal axis changing as n increases).
The same principle applies to the regression coe cients ^ 0 and ^ 1 . Each can be shown to be
asymptotically normal because of the Central Limit Theorem. For ^ 1 , the Central Limit Theorem

32

YBAR_5
900
800
700

Frequency

600
500
400
300
200
100
0
0

10

11

Figure 27: Sampling distribution of y with n = 5 observations from the Gamma(2; 2) distribution.

YBAR_10
1,200

1,000

Frequency

800

600

400

200

0
1

Figure 28: Sampling distribution of y with n = 10 observations from the Gamma(2; 2) distribution.

33

YBAR_20
2,000

Frequency

1,600

1,200

800

400

0
2

Figure 29: Sampling distribution of y with n = 20 observations from the Gamma(2; 2) distribution.

YBAR_40
1,000

Frequency

800

600

400

200

0
2.4

2.8

3.2

3.6

4.0

4.4

4.8

5.2

5.6

6.0

Figure 30: Sampling distribution of y with n = 40 observations from the Gamma(2; 2) distribution.

34

YBAR_80
1,400
1,200

Frequency

1,000
800
600
400
200
0
2.8

3.0

3.2

3.4

3.6

3.8

4.0

4.2

4.4

4.6

4.8

5.0

5.2

5.4

Figure 31: Sampling distribution of y with n = 80 observations from the Gamma(2; 2) distribution.
applies to the sum (23) and gives the approximate distribution
^
where in general

2
1 ; ! 1;n

6
! 21;n = var ^ 1 = E 4

Pn

i=1 (xi

(32)

x) var (yi jxi ) 7


5;
2
Pn
2
x)
i=1 (xi

(33)

as given in (28). Under homoskedasticity, this simplies to


! 21;n =

1
;
s2x

(34)

as shown in (29).

2.7

Summary

In introductory econometrics the topic of statistical inference and its theory is typically the most
di cult to grasp, both in its concept and formulae. What follows is a summary of the important
concepts of this section.
Populations and Samples
Statistical inference is the process of attempting to learn about some characteristics of a
population based on a sample drawn from that population.
The most straightforward sampling approach is a simple random sample, in which every
element in the population has an equal chance of being included in the sample.
Mean and variance in the Population and the Sample
35

Population characteristics
h such as imeans and variances are dened as the expectations
2
2
yi
.
y = E (yi ) and y = E
y

Sample estimators of means and variances are dened as the sums y =


P
s2y = n 1 1 ni=1 (yi y)2 .

1
n

Pn

i=1 yi

and

Regression in the Population and the Sample

The Population Regression Function (PRF) is dened in terms of the conditional expectations operator E (yi jxi ) = 0 + 1 xi :
The Sample Regression Function (SRF) is dened in terms of the OLS regression line y^i =
^ + ^ xi .
0
1
Statistical properties
Under simple random sampling
y
^

N
a

y;

2 =n
y

2
1 ; ! 1;n

, where in general
2

6
! 21;n = E 4

Pn

i=1 (xi

or under homoskedasticity

! 21;n =

x) var (yi jxi ) 7


5;
2
2
(x
x)
i
i=1

Pn

36

1
:
s2x

Hypothesis Testing and Condence Intervals

The idea of statistical inference is that we use the observable sample information summarised by
the SRF
y^i = ^ 0 + ^ 1 xi
to make inferences about the unobservable PRF
E (yi jxi ) =

1 xi :

For example, in the CEO salary regression in Figure 17, we take ^ 1 = 18:50 to be the point
estimate of the unknown population coe cient 1 . This point estimate is very useful, but on
its own it doesnt communicate the uncertainty that is implicit in having taken a sample of just
n = 209 rms from all rms in the population. If we had taken a dierent sample of rms, we
would have obtained a dierent value for ^ 1 . This uncertainty is summarised in the sampling
distribution of ^ 1 in equation (32), which quanties (approximately) the entire distribution of
^ that could have been obtained by taking dierent samples from the underlying population.
1
The techniques of hypothesis testing and condence intervals provide ways of making probabilistic
statements about 1 that are more informative, and more honest about the statistical uncertainty,
than a simple point estimate.

3.1
3.1.1

Hypothesis testing
The null hypothesis

The approach in hypothesis testing is to specify a null hypothesis about a particular value for
a population parameter (say 0 or 1 ) and then to investigate whether the observed provides
evidence for the rejection of this hypothesis. For example, in the CEO salary regression, we might
specify a null hypothesis that rm protability has no predictive power for CEO salary. In the
PRF
E (Salaryi jRoEi ) = 0 + 1 RoEi ;
(35)
the null hypothesis would be expressed
H0 :

= 0:

(36)

If the null hypothesis were true then E (Salaryi jRoEi ) = 0 , which states that average CEO
salaries are constant ( 0 ) across all levels of rm protability.
Note that the hypothesis is expressed in terms of the population parameter 1 , not the sample
estimate ^ 1 . Since we know that ^ 1 = 18:50, it would be nonsense to investigate whether ^ 1 = 0....
it isnt! Instead we are interested in testing to see whether ^ 1 = 18:50 diers su ciently from
zero such that we can conclude that 1 also diers from zero, albeit with some level of uncertainty
that acknowledges the sampling variability inherent in ^ 1 .
3.1.2

The alternative hypothesis

After the specifying the null hypothesis, the next requirement is the alternative hypothesis. The
alternative hypothesis is specied as an inequality, as opposed to the null hypothesis which is an
equality. In the case of a null hypothesis specied as (36), the alternative hypothesis would be
one of the following three possibilities
H1 :

H1 :

> 0 or

H1 :

< 0;

37

6= 0 or

depending on the practical context. The alternative H1 : 1 6= 0 is called a two-sided alternative


(falling on both sides of the null hypothesis) while H1 : 1 > 0 or H1 : 1 < 0 are called one-sided
alternatives. A one-sided alternative would be specied in situations where the only reasonable or
interesting deviations from the null hypothesis lie on one side. In the case of the null hypothesis
H0 : 1 = 0 in (35), we might specify H1 : 1 > 0 if the only interest were in the hypothesis that
protable rms reward their CEOs with higher salaries. However there is also a possibility that
some less protable rms might try to improve their fortunes by attempting to attract proven
CEOs with oers of a higher salary. With two conicting stories like this, the sign of the possible
relationship would be unclear and we would specify H1 : 1 6= 0. One very important point is
that we must not use the sign of the sample estimate to specify the alternative hypothesis the
hypothesis testing methodology requires that the hypotheses be specied before looking at any
sample information. The hypotheses must be specied on the basis of the practical questions of
interest. Both one and two sided testing will be discussed below.
3.1.3

The null distribution

The idea in hypothesis testing is to make a decision whether or not to reject H0 in favour of H1
on the basis of the evidence in the data. For testing (36), an approach to making this decision
can be based on the sampling distribution (32). Specically, if H0 : 1 = 0 is true then
^

a
1

N 0; ! 21;n ;

where ! 21;n is given in (33), or (34) in the special case where homoskedasticity can be assumed.
This sampling distribution can also be written
^

! 1;n

N (0; 1) ;

which is useful because the distribution on the right hand side is now a very well known one, the
standard normal distribution, for which derivations and computations are relatively straightforward. However this expression is not yet usable in practice because ! 1;n depends on population
expectations (i.e. it contains an E and a var) and is not observable. It can, however, be estimated
using
qP
n
x)2 u
^2i
i=1 (xi
!
^ 1;n = Pn
;
(37)
x)2
i=1 (xi
which is obtained from (33) by dropping the outside expectation, replacing var (yi jxi ) by the
squared residuals u
^2i , and then taking the square root to turn !
^ 21;n into !
^ 1;n . This quantity !
^ 1;n is
^
called the standard error of 1 . It can then be shown (using derivations beyond our scope) that
^

!
^ 1;n

N (0; 1) :

That is, replacing the unknown standard deviation ! 1;n with the observable standard error !
^ 1;n
does not change the approximate distribution of ^ 1 . However, often a practical better approximation is provided by
^ a
1
tn 2 ;
(38)
!
^ 1;n
where tn 2 denotes the t distribution with n 2 degrees of freedom. For large values of n the
tn 2 and N (0; 1) distributions are almost indistinguishable (indeed limn!1 tn 2 = N (0; 1)) but
38

for smaller n using (38) can often give a more accurate approximation. Equation (38) provides
a practically usable approximate null distribution for ^ 1 (its called the null distribution because
a
recall we imposed the null hypothesis to obtain ^ 1 N 0; ! 21;n in the rst step above).
If it is known that the conditional distribution of yi given xi is homoskedastic (i.e. that
var (yi jxi ) is constant) then (34) can be used to justify the alternative estimator
s
^2
!
^ 1;n = Pn
;
(39)
x)2
i=1 (xi
where

^2 =

n
X

u
^2i ;

i=1

is the sample variance of the OLS residuals. In small samples the standard error estimated using
(39) may be more precise than that estimated by (37), provided the assumption of homoskedasticity is correct. If the assumption of homoskedasticity is incorrect, however, the standard error
in (39) is not valid. In econometrics the standard error in (37) is referred to as Whites standard
error while (39) is referred to the OLS standard error. Modern econometric practice is to
favour the robustness of (37), and we will generally follow that practice.
If it is known that the conditional distribution of yi given xi is both homoskedastic and
^ 1;n
normally distributed (written yi jxi N 0 + 1 xi ; 2 ) then the null distribution (38) with !
given in (39) is exact, no longer an approximation. This is a beautiful theoretical result, but since
it is very rarely known that yi jxi
N 0 + 1 xi ; 2 in practice, we should acknowledge that
(38) is an approximation.
3.1.4

The alternative distribution

If H0 :

= 0 is not true then the approximate sampling distribution (32) can be written
^

a
1

+ N 0; ! 21;n ;

which is informal notation that represents a normal distribution with a constant 1 added to it
(which is identical to a N 1 ; ! 21;n distribution). Then repeating the steps leading to (38) gives
^

! 1;n

! 1;n

+ N (0; 1) ;

and then replacing ! 1;n by the standard error !


^ 1;n gives
^

!
^ 1;n

! 1;n

+ tn

2:

(40)

This equation says that if the null hypothesis is false then the distribution of the ratio ^ 1 =^
! 1;n
is no longer approximately tn 2 , but instead is tn 2 with a constant 1 =! 1;n added to it. That
is, the distribution is shifted (either positively or negatively depending on the sign of 1 ) relative
to tn 2 distribution. The dierence between (38) under the null and (40) under the alternative
provides the basis for the hypothesis test.

39

3.1.5

Decision rules and the signicance level

In hypothesis testing we either reject H0 or do not reject H0 (we dont accept hypotheses, more
on this soon). A hypothesis test requires a decision rule, that species when H0 is to be rejected.
Because we have only partial information, i.e. a random sample rather than the entire population, there is some probability that any decision we make will be incorrect. That is, there is a
chance we might reject H0 when H0 is in fact true, which is called a Type I error. There is also a
chance that we might not reject H0 when H0 is in fact false, which is called a Type II error. The
four possibilities are summarised in this table.

Decision
Reject H0
Do not reject H0

Truth in the population


H0 true
H0 false
Type I error Correct
Correct
Type II error

Clearly we would like a hypothesis test to minimise the probabilities of both Type I and II errors,
but there is no unique way of doing this. The convention is to set the signicance level of the
hypothesis to a small xed probability , which species the probability of a Type I error. The
most common choice is = 0:05, although = 0:01 and = 0:10 are sometimes used.
3.1.6

The t test theory

The t statistic for testing H0 :

= 0 is
^
t=

!
^ 1;n

(41)

From (38) we know that t tn 2 if H0 is true, while from (40) we know that t is shifted away
from the tn 2 distribution if H0 is false. First consider testing H0 : 1 = 0 against the one-sided
alternative H1 : 1 > 0, implying the interesting deviations from the null hypothesis induce a
positive shift of t away from the tn 2 distribution. We will therefore dene a decision rule based
on t that states that H0 is reject if t takes a larger value than would be thought reasonable from
the tn 2 distribution. The way we formalise the statement t takes a larger value than would be
thought reasonable from the tn 2 distribution is to use the signicance level. The decision rule
is dened to reject H0 if t takes a larger value than a critical value c , which is dened by the
probability
Pr (tn 2 > c ) =
for signicance level . The distribution of t under H0 is tn 2 , so the value of c can be computed
from the tn 2 distribution, as shown graphically in Figure 32 for = 0:05 and n 2 = 30. The
critical value in this case is c0:05 = 1:697, which can be found in Table G.2 of Wooldridge (p.833)
or computed in Eviews.
For testing H0 : 1 = 0 against H1 : 1 < 0, the procedure is essentially a mirror image. The
decision rule is to reject H0 if t takes a smaller value than the critical value, which is shown in
Figure 33. If c is the -signicance critical value for testing against H1 : 1 > 0, then c is the
-signicance critical value for testing against H1 : 1 < 0. That is
Pr (tn

<

c )= :

The critical value for = 0:05 and n 2 = 30 is therefore simply c0:05 = 1:697.
For testing H0 : 1 = 0 against H1 : 1 6= 0, the potentially interesting deviations from the null
hypothesis might induce either a positive or negative shift of t away from the tn 2 distribution.
40

Figure 32: The tn


H1 : 1 > 0.

distribution with

= 0:05 critical value for testing H0 :

= 0 against

Therefore we need to check in either direction. That is, we will reject H0 if either t takes a larger
value than considered reasonable for the tn 2 distribution, or a smaller value. The decision rule
is to reject H0 if t > c =2 or t < c =2 , which can be expressed more simply as t > c =2 , where
c =2 satises
Pr tn

>c

Pr jtn

2j

=2

or equivalently
The critical value for
3.1.7

= 0:05 and n

>c

=2

2 = 30 is c

=2

= :
= 2:042.

The t test two sided example

Every hypothesis test needs to specify the following elements:


1. The null hypothesis H0 :
2. The alternative hypothesis H1 :
3. A signicance level :
4. A test statistic. (in this case t, but we will see others soon)
5. A decision rule that states when H0 is rejected.
6. The decision, and its interpretation.
Consider the CEO salary regression, which has PRF
E (Salaryi jRoEi ) =
41

1 RoEi ;

(42)

Figure 33: The tn


H1 : 1 < 0.

distribution with

= 0:05 critical value for testing H0 :

= 0 against

Figure 34: The tn


H1 : 1 6= 0.

distribution with

= 0:05 critical value for testing H0 :

= 0 against

42

Figure 35: Choosing to use White standard errors that allow for heteroskedasticity
and the hypotheses H0 : 1 = 0 and H1 : 1 6= 0;so that we are interested in either positive or
negative deviations from the null hypothesis, i.e. any role for rm protability in predicting CEO
salaries, whether positively or negatively. We will choose = 0:05, which is the default choice
unless specied otherwise.
The test statistic will be the t statistic given in (41). This statistic can be computed in Eviews
using either of (37) or (39), with the default choice being (39) which imposes the homoskedasticity
assumption. This assumption can frequently be violated in practice, and can be tested for, but
we will play it safe for now and use the (37) version of !
^ 1;n which allows for heteroskedasticity.
This requires an additional option to be changed in Eviews. When specifying the regression in
Eviews in Figure 16, click on the Options tab to reveal the options shown in Figure 35, and
select White for the coe cient covariance matrix as shown. The resulting regression is shown
in Figure 36, with the selection of the appropriate White standard errors highlighted. We now
have enough information to carry out the hypothesis test. The details are as follows.
1. H0 :

=0

2. H1 :

6= 0

3. Signicance level:

= 0:05

4. Test statistic: t = 2:71


5. Reject H0 if jtj > c0:025 = 1:980
6. H0 is rejected, so Return on Equity is a signicant predictor for CEO Salary.
The critical value of c0:025 = 1:96 is found from the table of critical values on p.833 of Wooldridge,
reproduced in Figure 37. For this regression with n = 209, the relevant t distribution has n 2 =
207 degrees of freedom. This many degrees of freedom is not included in the table, so we choose
the closest degrees of freedom that is less than this number, i.e. 120. The test is two-sided with
signicance level of = 0:05, so the critical value of c0:025 = 1:980 can be read from the third
column of critical values in the table.

43

Figure 36: CEO salary regression with White standard errors


3.1.8

The t test one sided example

The assessment for ETC2410/ETC3440 in semester two of 2013 consisted of 40% assignments
during the semester and a 60% nal exam. Descriptive statistics for these marks, both expressed
as percentages, are shown in Figures 38 and 39. It may be of interest to investigate how well
assignment marks earned during the semester predict nal exam marks. In particular, we would
expect that those students who do better on assignments during the semester will go on to also do
better on their nal exams. The scatter plot in Figure 40 show that such a relationship potentially
does exist in the data, so we will carry out a formal hypothesis test in a regression.
The PRF has the form
E (exami jasgnmti ) =

1 asgnmti ;

(43)

and we will test H0 : 1 = 0 (that assignment marks have no predictive power for exam marks)
against the one-sided alternative H1 : 1 > 0 (that higher assignment marks predict higher exam
marks). The estimates are given in Figure 41, in which the SRF is
exam
d i = 23:763 + 0:548 asgnmti :
(5:360)

(0:095)

The numbers in parentheses below the coe cients are the standard errors. This is a common
way of reporting an estimated regression equation, since it provides su cient information for
the reader to carry out some inference themselves if they wish. The hypothesis test of interest
proceeds as follows.
1. H0 :

=0

2. H1 :

>0

3. Signicance level:

= 0:05

4. Test statistic: t = 5:766


44

Figure 37: Critical values from the t distribution from Wooldridge

45

16

Series: ASGNMT
Sample 1 118
Observations 118

14
12

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

10
8
6

59.67415
61.92188
83.62500
19.25000
13.27081
-0.738493
3.340469

Jarq ue-Bera 11.29557


Probability
0.003525

2
0
20

25

30

35

40

45

50

55

60

65

70

75

80

85

Figure 38: Assignment marks for ETC2410 / ETC3440 in semester two of 2013.
12

Series: EXAM
Sample 1 118
Observations 118

10

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

56.49154
56.61960
93.24731
0.000000
16.74263
-0.790440
4.709527

Jarq ue-Bera 26.65651


Probability
0.000002

0
0

10

20

30

40

50

60

70

80

90

Figure 39: Exam marks for ETC2410 / ETC3440 in semester two of 2013.
5. Reject H0 if t > c0:05 = 1:662
6. H0 is rejected, so there is evidence that higher assignment marks predict signicantly higher
nal exam marks.
The critical value in this case is found from the table in Figure 37 using 90 degrees of freedom
(n 2 = 116 in this case) and the column corresponding to the = 0:05 level of signicance for
a one-sided test.
3.1.9

p-values

A convenient alternative way to express a decision rule for a hypothesis test is to use p-values
rather than critical values, where they are available.
First consider testing H0 : 1 = 0 against H1 : 1 > 0. The critical value for this t test is c0:05
as shown in Figure 32. Recall that c0:05 is dened to satisfy Pr (tn 2 > c0:05 ) = 0:05, which means
that the area under the tn 2 distribution to the right of c0:05 is 0.05. Any value of the test statistic
t that falls above c0:05 leads to a rejection of the null hypothesis, and the area under the tn 2
distribution to the right of such a value of t must be less than 0.05. So instead of dening a decision
46

100

80

EXAM

60

40

20

0
10

20

30

40

50

60

70

80

90

ASGNMT

Figure 40: Scatter plot of exam marks against assignment marks.

Figure 41: Regression of exam marks on assignment marks

47

rule in terms of t > c0:05 , we could equivalently dene decision in terms of Pr (tn 2 > t) < 0:05.
That is the decision rules reject H0 if t > c0:05 and reject H0 if Pr (tn 2 > t) < 0:05 yield
identical tests. Similarly if we are testing H0 : 1 = 0 against H1 : 1 < 0, the decision rules
reject H0 if t < c0:05 and reject H0 if Pr (tn 2 < t) < 0:05 yield identical tests.
For the two sided problem H0 : 1 = 0 against H1 : 1 6= 0, the decision rule is reject H0
if jtj > c0:025 . Recall that c0:025 is dened to satisfy Pr (tn 2 > c0:025 ) = 0:025, see Figure 34.
The condition jtj > c0:025 therefore implies that Pr (tn 2 > jtj) < 0:025, because jtj is further
out into the tail of the tn 2 distribution than c0:025 . Multiplying this inequality by 2 gives
2 Pr (tn 2 > jtj) < 0:05, so the critical value decision rule reject H0 if jtj > c0:025 is equivalent
to reject H0 if 2 Pr (tn 2 > jtj) < 0:05.
It is conventional in econometrics and statistics (and in Eviews!) to dene the p-value for a
regression t statistic as
p = 2 Pr (tn 2 > jtj) :
(44)
Therefore the decision rule for testing H0 :

= 0 against H1 :

6= 0 is

reject H0 if p < 0:05,


where p is the value printed out by Eviews under the Prob. column of the regression output.
The two-sided test of the signicance of RoEi in the model (42) can be re-expressed in terms of
p values as follows.
1. H0 :

=0

2. H1 :

6= 0

3. Signicance level:

= 0:05

4. Test statistic: p = 0:0073


5. Reject H0 if p < 0:05
6. H0 is rejected, so Return on Equity is a signicant predictor for CEO Salary.
The p-value in item 4 is read directly from the regression output in Figure 36. Clearly having a pvalue available makes the hypothesis test more convenient to carry out because it is not necessary
to look up or compute a critical value. The vast majority of hypothesis tests computed in modern
econometrics and statistics software are accompanied by a p-value for easy testing.
For testing against one-sided alternative hypotheses, a small modication is required. In the
introductory discussion it was shown that the decision rule for testing H0 : 1 = 0 against H1 :
0 then Pr (tn 2 > t) = Pr (tn 2 > jtj) = p=2
1 > 0 is to reject H0 if Pr (tn 2 > t) < 0:05. If t
using (44). On the other hand if t < 0 then Pr (tn 2 > t) = 1 Pr (tn 2 > jtj) > 0:5 (by the
symmetry of the tn 2 distribution) so H0 will never be rejected if t < 0. This makes intuitive
sense since t < 0 can only occur if ^ 1 < 0, and an estimate of ^ 1 < 0 cannot provide evidence to
reject H0 : 1 = 0 in favour of H1 : 1 > 0. So the decision rule for testing H0 : 1 = 0 against
H1 : 1 > 0 is reject H0 if t > 0 and p=2 < 0:05, or more simply
reject H0 if t > 0 and p < 0:10.
That is, to carry out a one-sided test at the 5% level of signicance, the comparison of the p-value
is made with 0.10 not 0.05. The reason is that the p-value provided by Eviews is (44), which is
for testing against two-sided alternatives.

48

The decision rule for testing H0 :


upper-tailed version, that is

= 0 against H1 :

< 0 is a mirror image of the

reject H0 if t < 0 and p < 0:10,


so that the null is rejected only for negative estimates of 1 whose p-value is less than 0.10.
The one-sided test of the signicance of assignment marks in (43) can therefore be re-expressed
as follows.
1. H0 :

=0

2. H1 :

>0

3. Signicance level:

= 0:05

4. Test statistic: t = 5:766, p = 0:0000


5. Reject H0 if t > 0 and p < 0:10
6. H0 is rejected, so there is evidence that higher assignment marks predict signicantly higher
nal exam marks.
The outcome of a hypothesis test carried out using critical values or p-values will always be
the same, the choice comes down to one of convenience. Most often p-values are more convenient
and most often used in practice, and we will generally rely on them from now on.
3.1.10

Testing other null hypotheses

By far the most common hypotheses testing in regression models have the null in the form H0 :
1 = 0. However there are other null hypotheses that can also be of interest. For example, in the
exam marks application we might want to test whether an extra 1% gained on assignment marks
predicts an extra 1% gained on the nal exam. In the regression model (43), this would translate
to a null hypothesis of the form H0 : 1 = 1.
In general, consider testing a null hypothesis of the form H0 : 1 = b1 , where b1 is a specied
number (eg, 0,1, etc). The t statistic for testing this null hypothesis is
^
t=

b1
:
!
^ 1;n

(45)

Obviously this reduces to (41) when b1 = 0. The decision rules presented above remain unchanged,
both for critical values and p-values.
To illustrate, consider testing H0 : 1 = 1 against H1 : 1 6= 1 in the exam regression (43).
From the results in Figure 41 we can calculate
t=

0:548 1
=
0:095

4: 758:

The hypothesis test using critical values can then proceed as follows.
1. H0 :

=1

2. H1 :

6= 1

3. Signicance level:

= 0:05
49

(46)

4. Test statistic: t =

4:758

5. Reject H0 if jtj > c0:025 = 1:987:


6. H0 is rejected, so the predicted change in nal exam scores corresponding to a 1% higher
assignment score is signicantly dierent from 1%.
Note that a two-sided alternative was used in this case because there is no prior expectation before
the analysis whether the coe cient should be greater than one or less than one. Having estimated
the regression it appears the coe cient is less than one, but we must not use that information to
formulate the alternative hypothesis.
In order to avoid re-calculating the t statistic manually as we did above, and also in order to
obtain convenient p-values, the regression model can be re-estimated in a form that makes testing
a null hypothesis H0 : 1 = b1 very easy. Suppose in general we have a PRF of the form
E (yi jxi ) =

1 xi :

Subtracting b1 xi from both sides gives


E (yi

b1 xi jxi ) =

+(

b1 ) xi

1 xi :

The null hypothesis H0 : 1 = b1 in the original regression of yi on xi can therefore be equivalently


re-expressed as H0 : 1 = 0 in the regression of (yi b1 xi ) on xi .
For testing H0 : 1 = 1 against H1 : 1 6= 1 in the exam regression (43), the PRF is re-written
E (exami

asgnmti jasgnmti ) =

+(

1) asgnmti

1 asgnmti :

The results from regressing (exami asgnmti ) on asgnmti are given in Figure 42, which shows
that ^ 1 = 0:452 with t = 4:746. (This latter t statistic diers from (46) only because of the
rounding error induced by the calculation in (46) being carried out using three decimal places
in numerator and denominator. Without this rounding error, the two would be identical.) The
hypothesis test in terms of p-values then proceeds as follows.
1. H0 :

=1

2. H1 :

6= 1

3. Signicance level:

= 0:05

4. Test statistic: p = 0:0000


5. Reject H0 if p < 0:05
6. H0 is rejected, so the predicted change in nal exam scores corresponding to a 1% higher
assignment score is signicantly dierent from 1%.
The same conclusions will always be found from this approach (re-specifying the regression) and
the previous approach that manually computes the t statistic and uses a critical value. It will
usually be more convenient to practice to re-specify the regression and use the p-value that is
then automatically provided.

50

Figure 42: Regression of (exami asgnmti ) on asgnmti

3.2

Condence intervals

Condence intervals provide an alternative method for summarising the uncertainty due to sampling in coe cient estimates. A condence interval is a pair of numbers that form an interval
within which the true value of the parameter is contained with a pre-specied probability. This
probability, called the condence level, is typically chosen to be 1
, where is the usual significance level used in hypothesis tests. So, for a regression coe cient 1 , the aim is to nd numbers
1 and 1 such that
Pr

=1

(47)

The derivation of the condence interval follows from hypothesis tests of the form H0 : 1 = b1
against H1 : 1 6= b1 . If we imagine testing these hypothesis for all possible values of b1 , the
condence interval is formed by those values of b1 for which H0 : 1 = b1 is not rejected using a
two-sided t test with signicance level . To show where this leads, for any b1 the null hypothesis
H0 : 1 = b1 is not rejected if the t statistic in (45) satises jtj c =2 , which implies
^
c

=2

b1
!
^ 1;n

=2 :

These two inequalities can be re-arranged to give


^
That is, H0 :

^ 1;n
=2 !

b1

^ +c
1

^ 1;n :
=2 !

= b1 will not be rejected for all b1 in the interval


h
i
i h
^
^
c
!
^
;
+
c
!
^
;
=
=2 1;n
=2 1;n ;
1
1
1
1

(48)

(49)

which is the desired condence interval. It has the desired level because when b1 is the true
value of the parameter, the null hypothesis H0 : 1 = b1 is rejected with probability (this is the
denition of the signicance level of the test), which implies that it is not rejected with probability
51

1
1

. Therefore the true value 1 is included in the condence interval (49) with probability
as required.
To illustrate, consider a condence interval for the slope coe cient in the salary PRF (42).
From the results in Figure 36 we see that ^ 1 = 18:501 and !
^ 1;n = 6:829. The critical value for a
two-sided t test with signicance level = 0:05 is c0:025 = 1:980. The 95% condence interval for
1 is therefore
h
i
;
= [18:501 1:980 6:829; 18:501 + 1:980 6:829]
1
1
= [4:980; 32:022] :

(50)

The interpretation of this interval is that it contains the true value of 1 with probability 95%.
(In fact this probability of 95% is an approximation because the distribution of t in (38) on
which it is based is also approximate. In practice though we usually just talk about a 95%
condence interval, rather than an approximate or asymptotic 95% condence interval.) The
95% condence interval (or interval estimate) of the coe cient implies that an increase of 1%
in a rms Return on Equity predicts an increase in CEO salary of between $4,980 and $32,022.
A condence interval provides a convenient and informative way to report the ndings of a
regression. The mid-point of the interval is the point estimate ^ 1 , while its width represents how
much uncertainty there is about the estimate. A narrow condence interval implies the sample has
provided a precise estimate of the coe cient. The width of the condence interval is determined
by the standard error !
^ 1;n , so a small standard error implies a precise estimate and a narrow
condence interval.
From a hypothesis testing perspective, the condence interval provides a nice summary of all
the null hypotheses that would not be rejected by a two-sided t test (those values within the
interval) and all of the null hypotheses that would be rejected (those values outside the interval).
Clearly this is much more informative than simply reporting a coe cient estimate and whether or
not it is signicantly dierent from zero (which does happen sometimes...). A condence interval
that does not include zero, such as (50) above, immediately conveys the information that the
coe cient estimate is signicantly dierent from zero, but it contains much more information as
well.
These ideas also emphasise why in a hypothesis test we never claim
H0 , we only
h to accept
i
say that we do not reject H0 . Consider the condence interval
; 1 = [4:980; 32:022]
1
constructed above. This implies that H0 : 1 = b1 would not be rejected for all b1 between 4.980
and 32.022. It would be illogical to say that we accept H0 : 1 = 5 and H0 : 1 = 10 and
H0 : 1 = 25 and so on, we cannot accept that 1 is equal to several dierent values at once!
Instead we say that the sample does not provide su cient evidence to reject those values at the
specied level of signicance.

3.3

Prediction intervals

Suppose we want to make a prediction for yi for a particular xed value x of xi . For example,
to predict average CEO salary for Return on Equity of x = 15%, or nal exam marks for an
assignment mark of x = 75%. The prediction is given by
y^ (x) = ^ 0 + ^ 1 x;

(51)

and this can be taken as an estimator of the true value


y

(x) = E (yi jxi = x) =


52

1 x:

Just like a condence interval for the population parameter 1 , a prediction


interval can
i be
h
calculated for the population conditional mean E (yi jxi = x), i.e. an interval y (x); y (x) such
that
Pr y (x)
;
y (x)
y (x) = 1
compare to (47) for 1 .
The distribution of y^ (x) as an estimator of
y^ (x)
where
! 2n;

=E

"

n
X
i=1

1
+ (x
n

(x) can be derived to be

(x) ; ! 2n;

;
#

x) an;i

var (yi jx1 ; : : : ; xn ) ;

and an;i was given in (24). This leads to the prediction interval
h
i
(x);
(x)
= y^ (x) c =2 !
^ n; ; y^ (x) + c
y
y
where

!
^ 2n; =

n
X
i=1

1
+ (x
n

^ n;
=2 !

(52)

x) an;i

u
^2i :

Fortunately there is a convenient way to calculate !


^ 2n; without dealing with the formula. If
we take the usual SRF y^i = ^ 0 + ^ 1 xi and subtract the prediction formula at x given by (51), we
obtain
y^i = y^ (x) + ^ 1 (xi x) :
This shows that an OLS regression of yi on an intercept and (xi x) will provide an intercept
that corresponds to y^ (x), and then !
^ n; required for the condence interval is simply the standard
error on this estimate.
As an example, consider making a prediction for the average nal exam mark for an assignment
mark of x = 75%. A regression in Eviews specied as exam c (asgnmt-75) will produce an
intercept corresponding to y^ (75). The Eviews output is shown in Figure 43. The prediction is
y^ (75) = 64:90%, with standard error !
^ n; = 2:34. The 95% prediction interval based on (52) is
therefore
h
i
= [64:90 1:987 2:34; 64:90 + 1:987 2:34]
y (75); y (75)
= [60:25; 69:55] ;

where c0:025 = 1:987 is obtained from the t distribution table with 90 degrees of freedom (the
closest to n 2 = 116 in this example). The interpretation of this interval is that it contains the
population conditional mean y (75) = E (exami jasgnmti = 75) with probability of 95%.
3.3.1

Derivations

These derivations of the distribution of the prediction follow easily from the preceding derivations
we did for y and ^ 1 , but this subsection is not required for the course.
First recall the representation
n
X
^ =
an;i yi ;
1
i=1

53

Figure 43: Predicting nal exam mark for an assignment mark of 75%
where

in which

Pn

i=1 an;i

= 0 and

xi x
;
x)2
i=1 (xi

an;i = Pn

Pn

i=1 an;i xi

^ =y
0

= 1. This can be used to give a representation for ^ 0 :


^ x=
1

n
X
i=1

1
n

xan;i yi

and substituting for ^ 0 and ^ 1 into (51) gives


n
X

y^ (x) =

i=1

1
+ (x
n

x) an;i yi ;

which shows that y^ (x) is a weighted sum of y1 ; : : : ; yn . Its mean and variance can therefore be
derived in the same way as we did for ^ 1 .
The mean of y^ (x) is
" n
#
X 1
E [^
y (x)] = E
+ (x x) an;i E (yi jx1 ; : : : ; xn )
by the LIE
n
i=1
" n
#
X 1
= E
+ (x x) an;i ( 0 + 1 xi )
substituting the PRF
n
i=1

= E[

0+

1x +

1 (x

x)]

using

n
X
i=1

1x

(x) ;

so y^ (x) is an unbiased estimator of

(x).
54

an;i = 0;

n
X
i=1

an;i xi = 1

The variance is
! 2n; = var (^
y (x)) = E

"

n
X
i=1

1
+ (x
n

x) an;i

1
+ (x
n

x) an;i

var (yi jx1 ; : : : ; xn ) ;

using LIEvar, which can be estimated by


!
^ 2n; =

n
X
i=1

u
^2i ;

where u
^i are the OLS residuals.
The approximate normality of y^ (x) follows from the Central Limit Theorem.

Multiple Regression

An extremely useful feature of regression modelling is that it easily allows for the inclusion of
more than one explanatory variable. This is very useful for interpreting the roles of individual
explanatory variables and potentially for improving predictions. The techniques for OLS estimation and inference that we have discussed for simple regression extend straightforwardly to
the multiple regression setting. The models and methods will be discussed here, with formulae
postponed until the section on matrix notation for regression.

4.1

Population Regression Function

A linear PRF with multiple explanatory variables x1;i ; : : : ; xk;i takes the form
E (yi jx1;i ; : : : ; xk;i ) =

1 x1;i

+ ::: +

k xk;i :

(53)

That is, the population conditional mean of yi given x1;i ; : : : ; xk;i is specied as a weighted sum
of x1;i ; : : : ; xk;i .
The interpretation of the coe cients 1 ; : : : ; k is similar to that in a simple regression, with
an important qualication. To interpret 1 , consider the predicted value of yi with x1;i increased
by one unit and with x2;i ; : : : ; xk;i unchanged:
E (yi jx1;i + 1; : : : ; xk;i ) =

1 (x1;i

+ 1) + : : : +

k xk;i :

Then
E (yi jx1;i + 1; : : : ; xk;i )

E (yi jx1;i ; : : : ; xk;i ) =

1;

so that we interpret 1 as the change in the prediction of yi corresponding to a one unit increase
in x1;i , holding x2;i ; : : : ; xk;i constant. This aspect of holding all of the other explanatory variables
constant leads to the regression coe cient being called a marginal e ect or partial e ect. In general, for any j = 1; : : : ; k, the parameter j is the change in the predicted value of yi corresponding
to a one unit increase in xj;i , holding xh;i constant for all h 6= j.
The intercept 0 only has a meaningful interpretation if it makes sense for all of x1;i ; : : : ; xk;i
to take the value zero. In that case 0 is the predicted value of yi when x1;i = : : : = xk;i = 0.

4.2

Sample Regression Function and OLS

The SRF that estimates the PRF in (53) is


y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i ;
55

(54)

where ^ 0 ; ^ 1 ; : : : ; ^ k are the values that minimise the sum of squared residuals
SSR (b0 ; b1 ; : : : ; bk ) =

n
X

(yi

b0

b1 x1;i

:::

bk xk;i )2 :

i=1

The separate formulae for ^ 0 ; ^ 1 ; : : : ; ^ k are messy and omitted for now, but can easily be expressed in matrix notation later. The OLS residuals are denoted
u
^i = yi
= yi
The R2 for the regression is

y^i
^

^ x1;i
1

:::

Pn
(^
yi
SSE
= Pi=1
R =
n
SST
i=1 (yi
2

^ xk;i :
k
y)2
y)2

which has the same derivation, properties and interpretation as the R2 in a simple regression.
That is, R2 measures the proportion of the variance in yi explained by the regression.

4.3

Example: house price modelling

An example data set from Chapter 4 of Wooldridge contains the following data on house prices
and explanatory variables.
price :
selling price of the house ($000)
assess : assessed value prior to sale ($000)
lotsize : size of the block in square feet
sqrft :
size of the house in square feet
bdrms : number of bedrooms
The histogram and descriptive statistics for the dependent variable price are shown in Figure It
may be expected that increases in each of the explanatory variables assess, lotsize, sqrft, bdrms
would predict an increase in the selling price of a house. The PRF in this case is
E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

0+

1 assessi +

2 lotsizei +

3 sqrfti +

4 bdrmsi :

(55)

The specication of a multiple regression in Eviews simply involves a list of variables as shown
in Figure 45, with the dependent variable price rst, followed by the explanatory variables. The
results are shown in Figure 46. The SRF can be written
di =
price

38:89 + 0:908 assessi + 0:000587lotsizei

(23:77)

(0:119)

(0:000210)

0:000517sqrfti + 11:60bdrms
(0:0174)

(5:55)

n = 88; R = 0:83

The intercept of ^ 0 = 38:89 has no meaningful interpretation since none of the explanatory
variables would reasonably take the value zero. The slope coe cients are interpreted as follows.
1. ^ 1 = 0:908 : an increase in assessed value of a house of $1,000 predicts an increase in the
sale price of $908, holding lot size, house size and number of bedrooms xed. That is, the
coe cient measures the eect of variations in assessed value for a house of particular size.
It is therefore capturing variations in other aspects that aect the price of the house besides
its size, for example, its kitchen and bathroom quality, its suburb, proximity to transport,
shops, schools and major roads, architectural style, renovated or not, and so on.

56

24

Series: PRICE
Sample 1 88
Observations 88

20

Mean
Median
Maximum
Minimum
Std. Dev .
Skewness
Kurtosis

16

12

293.5460
265.5000
725.0000
111.0000
102.7134
1.998857
8.393914

Jarque-Bera 165.2787
Probability
0.000000

0
100

150

200

250

300

350

400

450

500

550

600

650

700

750

Figure 44: Histogram and descriptive statistics of house price data


2. ^ 2 = 0:000587 : each extra square foot of lot size predicts an increase in sale price of 58.7
cents, holding the other explanatory variables xed. The interpretation could equivalently
be expressed as saying that an extra 1000 square feet of lot size predicts an increase in sale
price of $587, which may make the magnitudes more relevant. Note that this coe cient
measures the eect of lot size on average sale price while holding house size and bedrooms
xed. That is, it measures the eect of a larger lot for a house of a given size. It does not
measure the eect of a larger lot size with a larger house on it. It isolates the eect of lot
size alone.
3. ^ 3 = 0:000517 : each extra square foot of house size predicts a decrease in sale price of
51.7 cents, holding the other explanatory variables xed. This nding is highly counterintuitive, but when we look at t test in this regression it will be seen that the coe cient is
not signicantly dierent from zero, so this interpretation can be ignored.
4. ^ 4 = 11:60 : each extra bedroom in a house predicts an increase in sale price of $11,600.
Note that this interpretation holds house size constant, so it is specically measuring the
eect of number of bedrooms, not overall size of house. Generally these two variables would
be positively related (a correlation of 0.53 in this sample) but the multiple regression allows
their eects to be estimated separately.
The regression has an R2 of 83% and so explains a high proportion of the variation in selling
prices of houses in this sample.

4.4

Statistical Inference

The derivations of OLS properties in multiple regression are simply in matrix notation, but messy
otherwise. For now they are simply stated. If (yi ; xi )ni=1 are i.i.d. and the PRF is given by (53),
each OLS coe cient ^ j for j = 0; 1; : : : ; k is unbiased and satises
^

a
j

2
j ; ! j;n

(56)

where ! 2j;n is a variance that depends on the conditional variance var (yi jx1;i ; : : : ; xk;i ). The
implications are the same as in the simple regression.
57

Figure 45: Specifying the multiple regression for house prices in Eviews

Figure 46: Results for house price multiple regression

58

To carry out a hypothesis test of H0 :

= bj for any given bj the t statistic is


^

t=

bj

!
^ j;n

where !
^ j;n is the standard error of ^ j that is computed to estimate ! j;n . As in simple regressions,
the computation can be done imposing homoskedasticity (OLS standard errors) or allowing for
heteroskedasticity (Whites standard errors). The approximate null distribution of this statistic
can be derived from (56) and is given by
t

tn

k 1;

which is the t distribution with n k 1 degrees of freedom. The degrees of freedom in a multiple
regression is the sample size less the number of regression coe cient estimated. The decision rules
for a hypothesis test at the = 0:05 signicance level are summarised in the following table, in
which c0:025 and c0:05 are critical values from the tn k 1 distribution.
Rejection rule for H0 : j = bj
Critical value
p value
H1 :
H1 :
H1 :

j
j
j

6= bj
> bj
< bj

jtj > c0:025


t > c0:05
t < c0:05

A 95% condence interval for a parameter


h
i h
^
;
c
j =
j
j

p < 0:05
t > 0 and p < 0:10
t < 0 and p < 0:10

is given by

^ j;n ; ^ j + c
=2 !

i
!
^
=2 j;n ;

where again c0:025 is the critical value for the tn k 1 distribution.


To make a prediction from a multiple regression, values x1 ; : : : ; xk need to be specied for the
explanatory variables. Then
y^ (x1 ; : : : ; xk ) = ^ 0 + ^ 1 x1 + : : : + ^ k xk :
Substracting this from (54) and rearranging gives
y^i = y^ (x1 ; : : : ; xk ) + ^ 1 (x1;i

x1 ) + : : : + ^ k (xk;i

xk ) :

(57)

That is, y^ (x1 ; : : : ; xk ) can be calculated as the intercept in a regression of yi on an intercept and
(x1;i x1 ) ; : : : ; (xk;i xk ).

4.5

Applications to house price regression

The signicance of the regression coe cients reported in Figure 46 can tested using t tests. Here
is the test of whether increased lot size predicts increased sale price.
1. H0 :

=0

2. H1 :

>0

3.

= 0:05
59

4. Test statistic : t = 2:80; p = 0:0064


5. Decision rule : reject H0 if t > 0 and p < 0:10
6. Reject H0 , so increased lot size does predict increased selling price, holding the other three
regressors xed.
The same analysis shows that assessed value and number of bedrooms also predict increased
selling price. However the house size (sqrft), with p value of 0.9764, is not signicant the
implication is that once we control for the size of the block the house is on and the number of
bedrooms, the overall size of the house itself has no further predictive power for the selling price.
It may be of interest to test the null hypothesis H0 : 1 = 1. Under this null we could take the
assessed value as being an unbiased predictor1 of the sale price, in this sense that changes in the
assessed value would be matched one-for-one by changes in the predictor of the sale price. The
test is as follows.
1. H0 :

=1

2. H1 :

6= 1

3.

= 0:05

4. Test statistic : t = (0:908

1) =0:119 =

0:773

5. Decision rule : reject H0 if jtj > c0:025 = 2:000


6. Do not reject H0 , so there is no evidence to suggest that assessed value is not an unbiased
predictor of the sale price.
A 95% condence interval can be constructed for 4 in order to give an interval estimate of
the contribution to the selling prices of each bedroom. The calculation is
h
i
h
i
^ + c0:025 !
^
c
!
^
;
^
;
=
0:025
4;n
4;n
4
4
4
4
= [11:602

2:000

5:552; 11:602 + 2:000

5:552]

= [0:498; 22:706]
which states that the predicted increase in selling value corresponding to an extra bedroom lies
in the interval [$498; $22; 706] with condence level of 95%.
Suppose we want to predict the selling price of a four bedroom house with assessed value of
$350,000, lot size of 6000 square feet, house size of 2000 square feet. Following (57), we specify
the SRF
d i = price
d (350; 6000; 2000; 4) + ^ 1 (assessi 350) + ^ 2 (lotsizei
price
+ ^ (sqrtft
2000) + ^ (bdrmsi 4) :
3

6000)

The specication of this SRF in Eviews is shown in Figure 47, with results in Figure 48. This gives
d (350; 6000; 2000; 4) = 327:913, or a predicted selling price of $327,913. The 95% prediction
price
interval is
d (350; 6000; 2000; 4)
price

= [327:913

2:000

c0:025 !
^

;n

8:035; 327:913 + 2:000

8:035]

= [311:843; 343:983] ;

so the predicted selling price lies within $311,843 and $343,983 with condence level of 95%.
1

"Unbiased" has a dierent meaning in prediction, as opposed to statistical estimation.

60

Figure 47: Specication of the regression for prediction of selling price of a four bedroom house
with assessed value of $350,000, lot size of 6000 square feet, house size of 2000 square feet

Figure 48: OLS regression for predicting the selling price of a four bedroom house with assessed
value of $350,000, lot size of 6000 square feet, house size of 2000 square feet
61

4.6

Joint hypothesis tests

In multiple regressions it can be interesting to test hypotheses about more than one coe cient at
a time. The most common example is to jointly tests that all slope coe cients are equal to zero,
which implies that none of the explanatory variables have any predictive power for the dependent
variable. The null hypothesis in (53) takes the form
H0 :

= ::: =

= 0;

i.e. all k slope coe cients are set to zero. The alternative hypothesis is
H1 : at least one of

1; : : : ;

not equal to 0,

which covers the possibilities than one or some or all of the slope coe cients are not zero. The
alternative hypothesis implies that the regression provides some explanatory power for yi . The
most common way of testing this null is using an F test. The test statistic is
F =

(SSR0 SSR1 ) =k
;
SSR1 = (n k 1)

where SSR0 is the sum of squared residuals from the SRF under H0 :
y~i = ~ 0
and SSR1 is the sum of squared residuals from the SRF under H1 :
y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i :
The null distribution of the F statistic is an Fk;n k 1 distribution; that is, an F distribution with
k and n k 1 degrees of freedom. Tables of critical values are provided for this distribution in
Wooldridge, but Eviews provides convenient p values. For the regression results for house prices
in Figure 46, the F statistic is reported (F = 100:7409) along with its p-value (p = 0:0000). The
test proceeds as follows.
1. H0 :

2. H1 : at least one of
3.

4
1;

=0
2;

3;

not equal to zero

= 0:05

4. Test statistic : p = 0:0000


5. Decision rule : reject H0 if p < 0:05:
6. Reject H0 , at least one of the regressors has signicant explanatory power for housing prices.
This F test is very convenient and hence popular for this hypothesis, but unfortunately is not
valid in the presence of heteroskedasticity.
Just as Whites standard errors can be used to construct a t test that is valid in the presence
of heteroskedasticity, there is a modication of the F test to allow for heteroskedasticity. The
formula is not easily expressed without matrix notation, but the implementation in Eviews is
straightforward. To test H0 : 1 = 2 = 3 = 4 = 0 in (55), make sure the regression has been
estimated using Whites standard errors, and then select View -- Coefficient Diagnostics
-- Wald Test - Coefficient Restrictions... as shown in Figure 49. In the subsequent
dialogue box shown in Figure 50, specify the null hypothesis for the test. Eviews uses the syntax
62

Figure 49: Selecting a Wald test in a regression equation.

Figure 50: Specifying the null hypothesis for a Wald test


63

Figure 51: Results of the Wald F test on the house price regression.
c(1), c(2),. . . corresponding to our regression coe cients 0 ; 1 ; : : :, so the null hypothesis 1 =
2 = 3 = 4 = 0 is entered as shown in the Figure. The results of the test are shown in Figure
51. The heteroskedasticity-robust F statistic is F = 55:90, with p = 0:0000. The presentation
and outcome of the test is therefore unchanged from that given preceding this paragraph, but at
least now we know that the result is still valid even if there is heteroskedasticity.
It is possible to test other joint hypotheses as well. For example, in the housing regression (55)
we might be interested in testing H0 : 2 = 3 = 4 = 0, which would imply that the assessed
value of the house fully takes into account all of the information about the size of the house and
its block. That is, under H0 the PRF would be
E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

1 assessi ;

which states that once we have the assessors valuation, there is no extra explanatory power in
the block size, house size, or number of bedrooms. This could be taken as a test of the e ciency
of the assessors valuation. The Wald test is carried out in Eviews following the same steps as in
Figures 49 and 50, except that the hypothesis is now entered as only c(3)=0, c(4)=0, c(5)=0.
The results are given in Figure 52, resulting in the following hypothesis test.
1. H0 :

= 0,

2. H1 : at least one of
3.

2;

3;

not equal to zero

= 0:05

4. Test statistic : p = 0:0013


5. Decision rule : reject H0 if p < 0:05:
6. Reject H0 , so the size of the house and block have signicant predictive power for house
prices in addition to that in the assessed value. This is evidence that the assessed value
does not e ciently capture all information about the pricing of the house.
The question might be raised why do we need this joint test of 2 ; 3 ; 4 when we can already
see from the individual t tests that ^ 2 and ^ 4 are signicantly dierent from zero? The answer
to this lies in the signicance level . In hypothesis testing, we aim to make a decision about
a hypothesis with probability of type I error equal to . If we do three separate t tests to test
a single hypothesis about three coe cients then each of these t tests have a signicance level of
, so the three of them together have a signicance level that will be greater than . Intuitively
there are three opportunities for this procedure to make a type I error instead of just one. So in
order to test a hypothesis about three coe cients and keep the signicance level controlled at ,
it is necessary to do a single test (a Wald test) and not three separate tests.
64

Figure 52: Wald test of H0 :

= 0 in the house price regression

As a nal example of a joint test, we can combine the hypotheses about the unbiasedness and
e ciency of the assessed value of the house as a predictor of the selling price. We can test the
null hypothesis
H0 : 1 = 1 and 0 = 2 = 3 = 4 = 0;
under which the PRF reduces to
E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) = assessi ;
which states that the assessors value is an unbiased predictor of the selling price (adjusts onefor-one and is not systematically too high or low) and is e cient is the sense of capturing all of
the size characteristics of the house. The alternative hypothesis is
H1 :

6= 1 and/or at least one of

0;

2;

3;

not equal to zero.

The Wald test in Eviews is carried out by specifying the null hypothesis as in Figure 53 to give
the results in Figure 54. The hypothesis test is therefore done as follows.
1. H0 :

= 1,

2. H1 :

6= 1 and/or at least one of

3.

= 0,
0;

2;

3;

not equal to zero.

= 0:05

4. Test statistic : p = 0:0000


5. Decision rule : reject H0 if p < 0:05:
6. Reject H0 , so that joint unbiasedness and e ciency of the assessors value as a predictor of
selling price is rejected.

4.7

Multicollinearity

In multiple regression there are potentially issues associated with the degree of correlation between
the explanatory variables. These go under the general heading of multicollinearity, which refers
to linear relationships among the explanatory variables.
4.7.1

Perfect multicollinearity

Perfect multicollinearity means that there is an exact linear relationship among some or all of the
explanatory variables. This makes the computation of the OLS estimator impossible and Eviews
will return an error message if this is attempted.

65

Figure 53: Specifying a Wald test of H0 :

Figure 54: Wald test of H0 :

= 1 and

= 1 and

=0

= 0 in the house price equation

A version of the problem can occur in a simple regression if there is no variation in the
explanatory variable xi . Suppose we want to estimate the PRF
E (yi jxi ) =

1 xi ;

but we have an unfortunate (or ill-designed) sample in which x1 = x2 = : : : = xn = c for some


constant c. For example, we might want to regress yi = wagei on xi = agei , but in our sample
every individual is the same age. Without variation in age in our sample, we cant expect to be
able to estimate the eect of variations in age on wages. In the formulae for the OLS estimator
^
1
Pn
(xi x) (yi y)
^ = i=1
;
Pn
1
x)2
i=1 (xi
note that xi = c for every i implies that x = c and hence that xi

x = 0 for every i, so that

^ = 0;
1
0
which is undened. Even in more complicated multiple regression cases, perfect multicollinearity
induces this sort of divide by zero problem for the OLS estimator.
To illustrate in the house price example, suppose we wanted to include the size of the garden
as a possible predictor of selling price. The size of the garden can be taken to be
gardeni = lotsizei

sqrfti ;

that is, that part of the block not taken up by the house. The PRF
E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

0 + 1 assessi + 2 lotsizei + 3 sqrfti + 4 bdrmsi + 5 gardeni

66

Figure 55: Generating the garden variable


is then subject to the perfect multicollinearity problem because of the perfect linear relationship
between gardeni , lotsizei and sqrfti . To see what happens in Eviews if we try to estimate this
regression, we rst generate the gardeni variable as in Figure 55 and then specify the regression
as in Figure 56. Attempting to estimate this equation gives the error message shown in Figure
57.
Perfect multicollinearity can easily be xed by removing one or more explanatory variables
until the problem disappears. In the example, any one of gardeni , lotsizei or sqrfti can be removed
from the regression. The choice of which to drop depends on the practical interpretation of the
variables in each case. In this example, it might be argued that the most natural variable to
drop is lotsizei , since in the original PRF (55) the interpretation of 2 is really capturing garden
size anyway. Recall that 2 measures the change in predicted selling price for a one square foot
increase in lot size, holding all the other explanatory variables constant. A one square foot increase
in lot size holding house size constant must be a one square foot increase garden size, so the clarity
of the practical interpretation of the model could be improved by including gardeni instead of
lotsizei . The results are shown in Figure 58. Nearly all of the results are identical, except for the
coe cient on sqrfti , which can be explained by substituting lotsizei = sqrfti + gardeni into (55)
to obtain
E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

1 assessi

2 (sqrfti

+ gardeni ) +

1 assessi

2 gardeni

+(

3 sqrfti

3 ) sqrfti

4 bdrmsi

4 bdrmsi .

This show the PRF with gardeni included is the same as that with lotsizei included except that
the coe cient on sqrfti is changed to 2 + 3 . Therefore the coe cient on sqrfti in Figure 58 (i.e.
0.0000693) is the sum of the coe cients on lotsizei (i.e. 0.000587) and sqrfti (i.e. 0:000517) in
Figure 46. The other coe cient estimates and goodness of t are unchanged so overall the two
regressions statistically equivalent and the choice can be made on the grounds of which choice of
variables is more meaningfully interpretable.

67

Figure 56: Attempting to estimate the house price regression with gardeni included.

Figure 57: The Eviews error message when there is perfect multicollinearity.

68

Figure 58: House price regression with garden size instead of lot size.
4.7.2

Imperfect multicollinearity

Imperfect multicollinearity is a situation where some or all of the regressors are highly correlated
with each other, but not with a correlation of 1 that would come with an exact linear relationship that implies perfect multicollinearity. Imperfect multicollinearity does not invalidate any
assumptions of OLS estimation, so computation of the estimator can proceed and its unbiasedness
and distributional properties hold. The issue with imperfect multicollinearity is that the standard
errors of the estimated regression coe cients can be quite large as a result, implying the estimates
are not very precise and hence condence intervals will be quite wide. One symptom of perfect
multicollinearity is a regression whose coe cients are insignicant according to the individual t
tests (because of the large standard errors) but are signicant according to the joint F test (or
its Wald heteroskedasticity-consistent variant). More details are given in Wooldridge p.9497 for
the homoskedastic case.

Dummy Variables

A dummy variable (or indicator variable) can be used to include qualitative or categorical variables
in a regression. In the simplest case this refers to variables for which there are two categories,
for example an individual can be categorised as male/female or employed/unemployed or have
some/no private health insurance and so on. The inclusion of such characteristics in regression
models can be extremely informative.

5.1

Estimating two means

Consider the CEO salary data again. The sample of n = 209 CEOs have been drawn from several
industries, one of which is summarised as Utility, which includes rms in the transport and
utilities industries (utilities includes electricity, gas and water rms). We can dene a dummy
69

200

Series: UTILITY
Sample 1 209
Observations 209

160

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

120

80

40

0.172249
0.000000
1.000000
0.000000
0.378503
1.735986
4.013648

Jarq ue-Bera 113.9231


Probability
0.000000

0
0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 59: Histogram of the utilities dummy variable


variable, or indicator variable, as
utilityi =

1; if rm i is in transport or utilities
0; if rm i is in any other industry

A histogram of this variable is shown in Figure 59, where it can be seen the variable is taking
only values 0 or 1. There are 36 rms in the sample in
transport and utilities, with 173 in other
36
1 P209
= 0:1722, as shown in
industries. The mean of this variable is therefore 209 i=1 utilitiesi = 209
the Figure.
Consider the simple regression
E (salaryi jutilityi ) =

1 utilityi :

(58)

If rm i is not in either transport or utilities then utilityi = 0, giving


E (salaryi jutilityi = 0) =

0;

so that 0 is the population mean of CEO salaries across all industries except transport and
utilities. If rm i is in either transport or utilities then utilityi = 1, giving
E (salaryi jutilityi = 1) =

1;

so that the population mean of CEO salaries in the transport and utilities industries is 0 + 1 .
Therefore 1 measures the dierence between average salaries in transport and utilities versus all
other industries. We can therefore use (58) to estimate the mean salaries for these two industry
groups and also test for dierences between them.
Figure 60 shows the results of an OLS regression of CEO salary on an intercept and utilityi .
The SRF is
d = 1396:225 668:523utility ;
salary
i
i
(112:402)

(12:229)

implying ^ 0 = 1396:225 and ^ 1 = 668:523. The estimated average CEO salary across all industries other than transport and utilities is therefore $1,396,225, while the estimated average CEO
salary in transport and utilities is $1; 396; 225 $668; 253 = $727; 972. A test of the signicance
of ^ 1 is a test of whether average salaries dier in the transport and utilities industries from the
others.
70

Figure 60: SRF of CEO salary on an intercept and utility dummy variable
1. H0 :

= 0 (average CEO salaries across all industries are equal)

2. H1 :

6= 0 (average CEO salaries are dierent in transport and utilities)

3.

= 0:05

4. Test statistic: p = 0:0000


5. Decision rule: reject H0 if p < 0:05
6. Reject H0 , so average CEO salaries are signicantly dierent in transport and utilities
compared to other industries.

5.2

Estimating several means

Dummy variables can be used to estimate several means (or dierences between means) at once. In
the CEO salary dataset the rms are classied into four industries utilities/transport, nancial,
industrial and consumer products. The additional dummy variables are
nancei =
indusi =
consprodi =

1; if rm i is in the nance industry


0; if rm i is in any other industry
1; if rm i is in industrial production
0; if rm i is in any other industry
1; if rm i is in consumer products
0; if rm i is in any other industry

Each rm in the sample falls into one of these four categories. The following PRF can be specied:
E (salaryi jutilityi ; nancei ; indusi ) =

1 utilityi

2 nancei

3 indusi :

(59)

The implied mean for CEO salaries in the consumer product industry is found from setting
utilityi = nancei = indusi = 0, giving
E (salaryi jutilityi = 0; nancei = 0; indusi = 0) =
71

0:

The average salaries for the other three industries are dened relative to the consumer product
industry (the base category in this case). For utilities/transport we have utilityi = 1 and
nancei = indusi = 0, so
E (salaryi jutilityi = 1; nancei = 0; indusi = 0) =

1:

For the nance industry we have nancei = 1 and utilityi = indusi = 0, so


E (salaryi jutilityi = 0; nancei = 1; indusi = 0) =

2:

For industrial production we have indusi = 1 and nancei = utilityi = 0, so


E (salaryi jutilityi = 0; nancei = 0; indusi = 0) =

3:

We do not also include the consprodi dummy variable in the PRF because this would cause a
perfect multicollinearity problem because each rm in the sample is categorised as one (and only
one) of utilities, nancial, industrial or consumer product, we have the perfect linear relationship
utilityi + nancei + indusi + consprodi = 1;
where 1 is the regressor for an intercept term. Therefore a PRF
E (salaryi jutilityi ; nancei ; indusi ; consprodi ) =

0 + 1 utilityi + 2 nancei + 3 indusi + 4 consprodi

has an exact linear relationship among its regressors and therefore has perfect multicollinearity
and cannot be estimated. One of the ve explanatory variables needs to be omitted in order for
the PRF to be estimated. In (59) we chose to omit consprodi , but any one of the other regressors
could have been omitted, including the intercept.
The SRF corresponding to (59) is shown in Figure 61. The estimated average salary for CEOs
in the consumer products industry is ^ 0 = 1722:417, or $1,722,417. The estimated average salary
for CEOs in the nance industry is ^ 0 + ^ 2 = 1722:417 377:5036 = 1344:913 or $1,344,913.
The nance dummy variable is not signicant at the 5% level (p = 0:2473 so that H0 : 2 = 0
would not be rejected against H1 : 2 6= 0) so there is no evidence of a signicant dierence
between CEO salaries in the nancial and consumer products industries. The interpretations
for the utilities and industrial dummies follows similarly, with average CEO salaries in utilities
diering signicantly (p = 0:0009) from the consumer products industry, while those in industrial
production do not dier signicantly from consumer products (only just, with p = 0:0519).

5.3

Dummy variables in general regressions

Dummy variables are useful for more than just estimating means, they can be used in more general
regression models as well. The dataset cochlear.wf1 contains observations on n = 91 severely
hearing impaired children who have received Cochlear Implants (CIs) to enable some form of
hearing. Some children have a single CI in one ear (a unilateral CI) while others have received
two CIs, one in each ear (bilateral CIs). It is believed that bilateral CIs provide an advantage
to children in real world listening and learning situations because they allow better directional
recognition of sounds and voices and also better hearing in background noise. However, CIs are
expensive (approximately $25,000 per implant), which must be borne by public or private health
insurance or the families themselves. Also the implantation of a CI involves damage to the inner
ear that then rules out the use of any newly discovered surgical procedure or device in the future
that might deliver improved performance. This background provides motivation for why it is
important to be able to detect and quantify improvements in listening and language that children
can achieve through the use of either one or two CIs.
72

Figure 61: SRF for average CEO salary classied by industry


The datale contains outcomes for young children (ages 58) with either unilateral or bilateral
CIs on the standardised2 Peabody Picture Vocabulary Test (PPVT). The histogram and descriptive statistics for this dependent variable are shown in Figure 62. The datale also contains the
dummy variable bilati , which takes the value 1 if child i has bilateral CIs and the value 0 if they
have unilateral CIs. The PRF
E (PPVTi jbilati ) =

1 bilati

allows a test of the dierence of means between the bilateral and unilateral outcomes. The SRF
in Figure 63 shows that ^ 0 = 85:21 is the estimated average score for unilateral children, while
^ + ^ = 85:21 + 9:36 = 94:57 is the estimated average score for bilateral children. The null
0
1
hypothesis H0 : 1 = 0 is rejected against H1 : 1 6= 0 at the 5% level of signicance (p = 0:0045)
so there is a signicant dierence in outcomes between bilateral and unilateral children.
5.3.1

Dummies for intercepts

There is also clinical experience that children should not be made to wait too long to receive their
CIs. There is a window early in life when the young brain needs to receive sounds and language
inputs in order to develop best to be able to hear and understand language. Delaying the CIs can
result in developmental delays that are very di cult to later catch up. A PRF to analyse this
question is
E (PPVTi jbilati ; ageCI1i ; ageCI2i ) =

1 bilati

2 ageCI1i

3 ageCI2i ;

(60)

where ageCI1i and ageCI2i are the respective ages in years when the rst and second CIs were
switched on. (For children with only a unilateral CI, ageCI2i = 0.) Histograms and descriptive
2

The average score in the normal-hearing population is standardised to be 100.

73

statistics for these ages are shown in Figures 64 and 65 and the SRF is shown in Figure 66. It
has the form
d i = 92:77 + 16:22bilati 3:81 ageCI1i 3:13 ageCI2i :
ppvt
(4:64)

(5:91)

(1:86)

(1:53)

A way to interpret this SRF is to think of it as containing two dierent SRFs: one for unilateral
children (bilati = 0 and ageCI2i = 0):
d i = 92:77
ppvt

3:81ageCI1i ;

and one for bilateral children (bilati = 1)


d i = 108:99
ppvt

3:81ageCI1i

3:13ageCI2i :

The role of the bilati dummy variable in this SRF is to allow the regression to have dierent
intercepts for unilateral and bilateral children in this the intercept for bilateral children is higher
(108.99 vs 92.77) reecting the higher average scores for them relative to unilateral children. The
bilati dummy variable is signicant at the 5% level (p = 0:0074) so the dierence between the
regression lines is a statistically signicant one.
To interpret its practical signicance, suppose we compare the dierence between the predicted
outcomes for a unilateral child with ageCI1 = a1 against a bilateral child also with ageCI1 = a1
and who received their second CI at age 2 (the average age of second implant being 2.16 years).
The unilateral prediction would be
da1 ; 0) = 92:77
ppvt (0;

3:81a1

and the bilateral prediction would be

da1 ; 2) = 108:99
ppvt (1;
= 102:73

3:81a1

3:13

3:81a1 :

The dierence between these two is the predicted dierence due to the bilateral CI:
da1 ; 2)
ppvt (1;

da1 ; 0) = 102:73
ppvt (0;

92:77 = 9:96:

(61)

Relative to the average standardised score of 100, the bilateral child is predicted to score approximately 10% better on the PPVT language test. A method of computing this dierence and its
standard error is to recognise that

and that

+2

da1 ; 2)
ppvt (1;

da1 ; 0) = ^ + 2 ^ ;
ppvt (0;
1
3

can be directly estimated by re-parameterising the PRF as

E (PPVTi jbilati ) =

+(

+2

3 ) bilati

2 ageCI1i

3 (ageCI2i

2bilati ) :

That is, the coe cient on bilati in a regression of PPVTi on an intercept, bilati , ageCI1i and
(ageCI2i 2bilati ) delivers the estimated eect of a second CI at age 2 versus sticking with a
unilateral CI. This results of this regression are shown in Figure 67, where it can be seen that
^ + 2 ^ = 9:95 (which diers from (61) only because of rounding) with standard error 3.86. The
1
3
95% condence interval for this eect is therefore
[9:95

2:00

3:85] = [2:25; 17:65] :

74

12

Series: PPVT
Sample 1 91
Observations 91

10

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

92.09890
94.00000
139.0000
59.00000
16.43104
0.244310
3.076533

Jarq ue-Bera 0.927466


Probability
0.628932

0
60

70

80

90

100

110

120

130

140

Figure 62: Histogram of language scores for children with Cochlear Implants

Figure 63: SRF for PPVT scores on bilateral CI dummy variable

75

Series: AGE_CI1
Sample 1 91
Observations 91

8
7
6
5
4
3

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

1.532857
1.430000
3.790000
0.340000
0.815059
0.725634
2.778721

Jarque-Bera
Probability

8.171589
0.016810

2
1
0
0.5

1.0

1.5

2.0

2.5

3.0

3.5

Figure 64: Histogram and descriptive statistics for age at rst CI.

25

Series: AGE_CI2
Sample 1 91
Observations 91

20

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

15

10

2.162637
2.020000
5.820000
0.000000
1.754301
0.250492
1.876018

Jarq ue-Bera 5.741798


Probability
0.056648

0
0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

Figure 65: Histogram and descriptive statistics for age at second CI

76

Figure 66: SRF for PPVT languages scores on bilateral CIs and implant ages.

Figure 67: SRF for the marginal eect of a second CI at age 2

77

5.3.2

Dummies for slopes

Dummy variables can be used to allow slope coe cients to change for dierent categories of
observations as well. For example we could specify
E (PPVTi jbilati ; ageCI1i ; ageCI2i ) =

0 + 1 bilati + 2 (bilati

ageCI1i )+

ageCI1i )+ 4 ageCI2i ;
(62)
where unilati = 1 bilati is a dummy variable that takes the value 1 for a child with a unilateral
CI and 0 for a child with a bilateral CI. Regressors that involve products of explanatory variables
like this are often called interactions. In this case the eect of the bilateral CI is being allowed to
interact with the age of rst implant. The PRF for a unilateral child is then
E (PPVTi jbilati = 0; ageCI1i ; ageCI2i ) =

3 (unilati

3 ageCI1i ;

while for a bilateral child it is


E (PPVTi jbilati = 1; ageCI1i ; ageCI2i ) = (

1)

2 ageCI1i

4 ageCI2i :

The unilateral and bilateral PRFs therefore have potentially dierent intercepts and slope coe cients on ageCI1i . The statistical signicance of these dierences can be tested.
The SRF is shown in Figure 66. The slope coe cients on ageCI1i are dierent for unilateral
and bilateral children, and in fact is only signicant (p = 0:0299) for bilateral children. A year of
delay in the rst CI predicts a fall in the PPVT outcome of 5.43 points for bilateral children but
only 0.62 points for unilateral children. The prediction equations are

for a unilateral child and

d i = 86:43
ppvt

d i = (86:43 + 23:69)
ppvt
= 110:12

0:62ageCI1i

5:43ageCI1i

5:43ageCI1i

2:76ageCI2i

2:76ageCI2i

for a bilateral child.


A Wald test can be used to test H0 : 2 = 3 against H1 : 2 6= 3 . Under H0 the slope
coe cients on ageCI1i are equal for unilateral and bilateral children and the PRF would simplify
back to (60). The results of the Wald test (specied as c(3)=c(4)) are shown in Figure 69.
The details are
1. H0 :

2. H1 :

6=

3.

in (62)

= 0:05

4. Test statistic : F statistic p-value = 0:1639:


5. Decision rule : reject H0 if p < 0:05:
6. Do not reject H0 , the PRF could be simplied back to (60).

78

Figure 68: SRF of regression for language outcomes with bilateral interactions

Figure 69: Wald test for equality of ageCI1i coe cients

79

Some non-linear functional forms

In all cases so far we have assumed that the PRF is a linear function of the explanatory variables.
Non-linearities can be introduced in an endless variety of ways. The interactions involving dummy
variables in the previous section was one step in this direction. Here we look at some common
examples of non-linear regression models that can be handled using the methods of OLS estimation
and inference considered so far.

6.1

Quadratic regression

Some non-linearity can be introduced into a PRF by including the square of an explanatory
variable. Consider a simple regression
E (yi jxi ) =

1 xi

2
2 xi .

(63)

It no longer makes sense to interpret 1 as the change in E (yi jxi ) from a unit increase in xi , since
a unit increase in xi necessary increases x2i as well. The term 1 xi + 2 x2i needs to be interpreted
as a whole. The sign of 2 dictates the shape of the parabola a positive sign giving a convex
shape (a valley) and a negative sign giving a concave shape (a hill). The general approach to
regression interpretation is to consider the dierence between the conditional mean at some value
x:
E (yi jxi = x) = 0 + 1 x + 2 x2
(64)
and the conditional mean at x + 1 :
E (yi jxi = x + 1) =

1 (x

+ 1) +

2 (x

+ 1)2 :

This gives an expression for the change in the conditional mean that results from a one unit
increase in xi :
E (yi jxi = x + 1) E (yi jxi = x) = 1 + 2 (2x + 1) :
(65)

For a linear regression (i.e. one without the x2i term) the expression is E (yi jxi = x + 1)
E (yi jxi = x) = 1 . The eect of the quadratic term is to introduce 2 (2x + 1) into the marginal
eect of xi . An important dierence from the linear model is that this marginal eect now depends on x, implying that the eect of increasing xi by one unit now depends on what value of xi
we start from. The following application will illustrate the sense of this property. Sometimes the
derivative of the regression function with respect to x is used as an approximation to the marginal
eect shown in (65). The derivative is
dE (yi jxi = x)
=
dx

+2

2 x;

which diers from (65) by 2 . Estimated values of 2 are often small, so that the derivative is
close to (65). We will focus on (65) as the exact marginal eect of a unit change in xi .
For a given value x, the estimation of E (yi jxi = x) and its standard error follows by subtracting
(64) from (63) to obtain
E (yi jxi ) = E (yi jxi = x) +

1 (xi

x) +

x2i

x2 ;

so that computing this prediction involves a regression of yi on an intercept, (xi x) and x2i x2 ,
and taking the intercept estimate and its standard error. The approach is identical to that for

80

linear models. The estimation of the marginal eect in (65) for a given value x can proceed by
re-arranging (63) to give
E (yi jxi ) =

0
0

+(
+

1 xi

2 (2x

+ 1)) xi +

2
2 xi

x2i

(2x + 1) xi

(2x + 1) xi :

That is, the marginal eect (65) for a given x is estimated from a regression of yi on an intercept,
xi and x2i (2x + 1) xi , and taking the coe cient on xi and its standard error.
A quadratic function has a turning point, either a maximum (for 2 < 0 ) or a minimum (for
>
0). The location of this turning point is found by setting the derivative equal to zero:
2
dE (yi jxi = x)
=0 ) x=
dx

This is the point at which the eect of xi on E (yi jxi ) changes from being positive to negative (for
^ = 2^ ,
2 < 0) or negative to positive ( 2 > 0). The estimated turning point is therefore
1
2
^
^
where
and
are the usual OLS estimators of the PRF (63).
1

6.1.1

Example: wages and work experience

A quadratic term is commonly included when modelling wages in terms of labour market experience. The workle wages.wf1 contains data on n = 1260 individuals with their wages ($/hour)
and various potential explanatory variables. A straightforward linear PRF would take the form
E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi ;

where educi is years of education, experi is years of labour force experience and femalei is a dummy
variable that takes the value 1 if individual i is female and 0 otherwise. The SRF is shown in
Figure 70. Each of the slope coe cients are signicant at the 5% level and each has interpretable
signs and magnitudes. An extra year of education increases the estimated conditional mean of
wages by $0.45, and extra year of experience increases the estimated conditional mean of wages
of $0.08, and the average wage of females is $2.57 below that of males with the same levels of
education and experience.
However, experience is generally not modelled in a linear form in a wage equation like this.
The idea is that the initial years of work experience involve the greatest learning and greatest
increases in productivity for an employee, resulting in the greater increases in wages at that time.
As experience increases, the rates of growth in productivity and hence wages slows. This eect
can be captured by including a quadratic term into the PRF
E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi

2
4 experi ;

(66)

The interpretations of the femalei and wagei variables in this model are unchanged, but the interpretation of experience must be altered as shown above. The coe cients 3 and 4 are no longer
individually interpretable because it is impossible to make a one year increase in experi while holding exper2i xed (or visa versa). Instead the marginal eect on E (wagei jfemalei ; educi ; experi ) of
one extra year of work experience follows from (65):
E (wagei jfemalei ; educi ; experi + 1)

E (wagei jfemalei ; educi ; experi ) =

+ 1) :
(67)
The eect on average wage of an extra year of work experience depends on the amount of experience obtained so far. For an individual with one year of work experience, a second year of work
81

4 (2experi

experience will change the expected wage by 3 + 3 4 dollars per hour. For an individual with 20
years of work experience, the next year of experience will change the expected wage by 3 + 41 4
dollars per hour.
The SRF for (66) is shown in Figure 71, showing ^ 3 = 0:2527 and ^ 4 = 0:0039. The
quadratic term in experience is signicant at the 5% level so it adds some explanatory power for
wages. Figure 72 gives a graphical representation of the contribution of the experience variables
to the wages PRF, given by ^ 3 experi + ^ 4 exper2i plotted over the range of observed values of
experi . Also shown for comparison is the linear term in experience from the SRF in Figure 70,
given by 0:0847experi . The quadratic function has a positive slope for all levels of experience from
one year up to the turning point experi = ^ 3 = 2 ^ 4 = 0:2527= (2
0:0039) = 32:40 years.
After that the quadratic has a negative slope. The implication is that extra experience increases
average wages, at a decreasing rate, until experience reaches about 32.4 years. After that, extra
work experience has a negative eect on average wages. The same information is also displayed
in Figure 73, which graphs the eects of an extra year of work experience on average wages. The
eect in the linear case is forced to be constant for all experience at ^ 3 = 0:0847, while the eect
in the quadratic case is ^ 3 + ^ 4 (2experi + 1) = 0:2527 0:0039 (2experi + 1). Again this shows
that an extra year of experience raises average wages until experience reaches 32.4 years, at which
point it cross the x-axis and implies decreases in average wages.
Prediction in a quadratic regression works in the same way as a linear model. Suppose we
want to calcuate the average wage for a female with 15 years of education and 10 years of work
experience. Figure 74 shows the regression for this purpose, re-specied in terms of (femalei 1),
d (1; 15; 10) = $5:08
(educi 15), (experi 10) and exper2i 102 . The resulting prediction is wage
with standard error of 0.23 that can be used to compute the 95% prediction interval
[5:0827

1:980

0:2261] = [$4:63; $5:53] :

Suppose we also want a 95% condence interval for the eect of an extra year of work experience on average wages for an individual with these characteristics. The desired eect is (67) with
experience set to 10 years, i.e. 3 + 21 4 . The PRF (66) can be re-written as
E (wagei jfemalei ; educi ; experi ) =

0 + 1 femalei + 2 educi +( 3

+ 21

4 ) experi + 4

exper2i

21experi ;

so that a regression of wagei on an intercept, femalei , educi , experi and exper2i 21experi will
provide the desired coe cient on experi . The SRF is shown in Figure 75, which shows that
^ + 21 ^ = 0:17 with 95% condence interval computed as
3
4
[0:1707

6.2

1:980

0:0154] = [$0:14; $0:20] :

Regression with logs explanatory variable

It is common practice to work with variables in logs rather than their original levels. Consider a
PRF
E (yi jxi ) = 0 + 1 log xi :
(68)
Taking logs is only possible when xi only takes values greater than zero, so this specication is
not always available. It can be used for a positive variable like years of work experience though.
The interpretation of the eect of xi in this PRF needs to be derived. Following the same general
approach as in the quadratic model, we consider a xed value x and compare
E (yi jxi = x) =
82

1 log x

(69)

Figure 70: Linear SRF for wages

Figure 71: SRF for wages with quadratic in experience

83

Linear
Quadratic

0
0

10

15

20

25

30

35

40

45

50

EXPER

Figure 72: Quadratic and linear in experience components of SRFs

.25
.20
.15
.10
.05
.00
-.05
Linear
Quadratic

-.10
-.15
0

10

15

20

25

30

35

40

45

50

EXPER

Figure 73: Eects of an extra year of work experience on average wages

84

Figure 74: SRF to predict the wages for females with 15 years of education and 10 years of work
experience

Figure 75: SRF to estimate the eect on average wages of an extra year of experience for an
individual with 10 years of experience

85

and
E (yi jxi = x + 1) =

1 log (x

+ 1) :

The eect on the conditional mean of yi of a one unit increase in xi is therefore


E (yi jxi = x + 1)

E (yi jxi = x) =

1 (log (x

+ 1) log x)
1
:
1 log 1 +
x

(70)

For a xed value x this can be estimated by re-arranging (68) as


E (yi jxi ) =

1 log

1+

1
x

log xi
log 1 + x1

log xi
;
log 1 + x1

so the desired marginal eect (70) for a given value x is estimated as the slope coe cient from a
regression of yi on an intercept and log xi = log 1 + x1 .
An alternative and very common interpretation of this PRF is to consider a 1% increase in xi
rather than a one unit increase. That is, instead of comparing E (yi jxi ) at xi = x and xi = x + 1,
we compare it at xi = x and xi = 1:01x. This gives
E (yi jxi = 1:01x)

E (yi jxi = x) =

1 (log (1:01x)

1 (log 1:01

1 log (1:01)
1

100

log x)

+ log x

log x)

1
where the last step uses log 1:01 = 0:00995 0:01 = 100
. Therefore a 1% increase in xi results in a
change of 1 =100 in E (yi jxi ). This interpretation is common because the result is not dependent
on x, so it gives 1 a convenient interpretation without reference to some starting value x.
Prediction is carried in the usual way by subtracting (69) from (68) to obtain

E (yi jxi ) = E (yi jxi = x) +

1 (log xi

log x) ;

so the estimate of E (yi jxi = x) is the estimated intercept in a regression of yi on an intercept and
(log xi log x).
6.2.1

Example: wages and work experience

In the context of wages and work experience, using the log of work experience provides an alternative non-linear specication to a quadratic. Consider the PRF
E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 log (experi ) :

(71)

Including experience in logs instead of linearly allows the eect of an extra year of experience
on average wages to decrease as experience increases. The eect of an extra year of experience
(holding femalei and educi constant) on the conditional mean of wages is obtained from (70) to
be
1
:
(72)
3 log 1 +
experi

86

Alternatively an increase of 1% in work experience increases average wages by approximately


3 =100 dollars per hour.
The SRF for (71) is shown in Figure 76, and the experience component is illustrated in Figure
77, with the quadratic component from Figure 71 for comparison. Both functional forms capture
the initially larger gains to work experience at the beginning of a career and the reduction of
those gains as experience grows. The dierence is that the log specication does not include a
negative eect of experience at any level, rather showing a continuing gradual increase in wages
with experience at all levels. Figure 78 shows the estimated eect on average wages of an extra
year of work experience, comparing the results from the log and quadratic models. The biggest
dierences occur at the ends of the distribution of experience, where data is most sparse, so
choosing between the two specications will not be simple. For now we consider each as providing
a reasonable approximation to the role of experience and return later to the topics of model
comparison and selection.
The average wages can again be estimated for a female with 15 years of education and 10
years of work experience, see Figure 79. This gives wage
d (1; 15; 10) = $5:27 with 95% prediction
interval
[5:2728 1:980 0:2231] = [$4:83; $5:71] :
This interval mostly overlaps that from the quadratic model (i.e. [$4:63; $5:53]) so the predictions
from the two models are very similar for this type of individual. We can expect them to dier
more for very small or very large values of experience.
To estimate the eect of an extra year of experience for this individual, which from (72) would
be
1
= 3 log (1:1) ;
3 log 1 +
10
consider the re-specied PRF
E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

log (experi )
;
log (1:1)

where 3 = 3 log (1:1). The results are shown in Figure 80, from which we nd the estimated
increase in average wages is $0.12 with 95% condence interval
[0:1246

1:980

0:0099] = [$0:11; $0:14] :

This interval does not overlap that constructed with the quadratic model, so the two models make
dierent predictions in this case.
For any values of the explanatory variables (i.e. regardless of work experience) an increase
of work experience by 1% increases the average wage by approximately $0.013. In this case a
one year change in experience is the more natural unit to consider, but in other applications a
percentage change is very natural.

6.3

Regression with logs dependent variable

It is standard in econometrics to model a variable like wages in log form rather than its levels.
There is no denite rule for choosing between logs and levels, but variables like wages or incomes
that are positive (logs only apply to positive numbers) and generally highly positively skewed and
non-normal are rendered less skewed and closer to normal by a log transformation. Figures 81
(wages in levels) and 82 (wages in logs) illustrate the point. This can lessen the impact of the
small number of very large incomes. Also the approximate normal, t and F distributions that
rely on the Central Limit Theorem will tend to work better for more symmetrically distributed
data.
87

Figure 76: SRF for wages with log of work experience

Quadratic
Log

0
0

10

15

20

25

30

35

40

45

50

EXPER

Figure 77: Comparison of experience in quadratic and log form in the wage equations

88

1.0
Quadratic
Log

0.8

0.6

0.4

0.2

0.0

-0.2
0

10

15

20

25

30

35

40

45

50

EXPER

Figure 78: Estimated eects of a year of extra work experience of average wages for the log and
quadratic models

Figure 79: Prediction for a female with 15 years education and 10 years of work experience

89

Figure 80: Estimating the eect of an extra year of work experience for an individual with 10
years of work experience

500

Series: WAGE
Sample 1 1260
Observations 1260

400

300

200

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

6.306690
5.300000
77.72000
1.020000
4.660639
4.813465
54.01341

Jarque-Bera
Probability

141489.9
0.000000

100

0
0

10

20

30

40

50

60

70

80

Figure 81: Histogram of wages

90

240

Series: LWAGE
Sample 1 1260
Observations 1260

200

160

120

80

40

Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis

1.658800
1.667705
4.353113
0.019803
0.594508
0.083235
3.425003

Jarque-Bera
Probability

10.93785
0.004216

0
0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Figure 82: Histogram of log of wages


Consider the simple regression
E (log yi jxi ) =

1 xi :

(73)

The interpretation of this regression in terms of log yi is simple. However we are rarely interested
in log yi for practical purposes, we are interested in yi . So we want to work out the implications of
this model in log yi for yi itself. That is, we would like to deduce an expression for E (yi jxi ), but
the fundamental di culty is that E (log yi jxi ) 6= log E (yi jxi ). The log is a non-linear function so it
cannot be interchanged with the expectations operator. In fact we know from Jensens inequality
that E (log yi jxi ) < log E (yi jxi ). Instead we write (73) as
log yi =

1 xi

+ (log yi

1 xi

+ ui ;

E (log yi jxi ))

(74)

where
ui = log yi

E (log yi jxi ) :

Taking the exponential of both sides of (74) gives


yi = exp (

1 xi ) exp (ui ) :

(75)

Making any progress with the interpretation of this model requires the assumption that ui is
independent of xi . This is a di cult assumption to interpret or test, but tends to be made
in practice without any discussion, so we will do so here. Under this assumption, taking the
conditional expectation of both sides of (75) gives
E (yi jxi ) = exp (

= exp (

=
where

1 xi ) E

1 xi ) E

0 exp ( 0

[exp (ui ) jxi ]

[exp (ui )]

1 xi )

= E [exp (ui )]. Now


E (yi jxi = x) =

0 exp ( 0

91

1 x)

(76)

and
E (yi jxi = x + 1) =

0 exp ( 0

1 (x

+ 1))

0 exp ( 0

1 x) exp ( 1 )

= E (yi jxi = x) exp (

1)

so
E (yi jxi = x + 1)

E (yi jxi = x) = E (yi jxi = x) (exp (

1)

1) :

This is frequently expressed as the percentage change


E (yi jxi = x + 1) E (yi jxi = x)
100% = (exp (
E (yi jxi = x)

1)

1) 100%:

(77)

That is, a one unit increase in xi produces a (exp ( 1 ) 1) 100% change in E (yi jxi ).
It is common to approximate exp ( 1 ) 1 by 1 (an approximation that works best for small
1 ) so that the interpretation of the model becomes
E (yi jxi = x + 1) E (yi jxi = x)
100% =
E (yi jxi = x)

100%:

(78)

That is, a one unit increase in xi produces an approximate 1 100% change in E (yi jxi ). The
convenience of using 1 instead of having to compute (exp ( 1 ) 1) means this approximate interpretation is more often used in practice. The approximation can also be derived using calculus:
d
E (yi jxi = x) =
0 exp ( 0 +
dx
= E (yi jxi = x)

1 x)

1;

which implies
1
d
E (yi jxi = x) 100% =
E (yi jxi = x) dx

100%;

each side of which is an approximation to each side of (77). We will proceed with (78) as the
interpretation of (73).
Estimation of E (yi jxi ) is more di cult in a model than is expressed in terms of log yi . It is
straightforward to obtain the SRF
dyi = ^ + ^ xi ;
log
0
1
from an OLS regression of log yi on an intercept and xi . The prediction equation
logd
y (x) = ^ 0 + ^ 1 x

provides the right way to estimate E (log yi jxi = x). It is then tempting to use
exp logd
y (x) = exp ^ 0 + ^ 1 x

to estimate E (yi jxi = x), but this would not be right because it omits the 0 term in the correct
expression (76). Since 0 is necessarily greater than one (because E (log yi jxi ) < log E (yi jxi )),
using y^ (x) = exp ^ 0 + ^ 1 x will systematically under-estimate E (yi jxi = x), i.e. it will be
negatively biased. An estimator of 0 is required to correct this bias. Since 0 = E [exp (ui )],
(the population mean of ui ) a natural estimator is the sample mean
n

1X
^0 =
exp (^
ui ) ;
n
i=1

92

Figure 83: SRF for log(wagei )


where
^

u
^i = log yi

^ xi ;
1

are the usual OLS residuals from the SRF corresponding to (73). The prediction equation
y^ (x) = ^ 0 exp ^ 0 + ^ 1 x
can then be used.
6.3.1

Example: modelling the log of wages

A common base specication for a wage equation takes the form of the PRF
E (log (wagei ) jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi

2
4 experi :

The interpretation of this PRF is simple in terms of log (wagei ), but it is generally wagei which
is the quantity of interest. Using (77), an extra year of education increases the average wage by
(exp ( 2 ) 1) 100%. The SRF for this model is given in Figure 83, in which ^ 2 = 0:0679. The
estimated eect of an extra year of education in this regression is to increase the average wage
1 100% = exp (0:0679) 1 = 7:03%. The approximation (78) gives the eect
by exp ^ 2
as 6.79%, which is practically very similar. Moreover the standard error for this latter estimate
^ is immediately available for computing a condence interval, whereas computing a standard
2
1 goes beyond our scope.
error for exp ^ 2
The interpretation of the eect of work experience on wages involves a non-linear transformation for both variables. The general form is
E (log yi jxi ) =

93

1 xi

2
2 xi ;

which can be written, following the steps leading to (76) above, as


E (yi jxi ) =

0 exp

1 xi

2
2 xi

Now to nd the marginal eect of xi we take


E (yi jxi = x) =

0 exp

1x

2x

and
E (yi jxi = x + 1) =

0 exp

1 (x

0 exp

1x

+ 1) +

= E (yi jxi = x) exp (

2x
1

+ 1)2

2 (x

exp (
2 (2x

2 (2x

+ 1))

+ 1)) :

Thus the marginal eect can be expressed


E (yi jxi = x + 1) E (yi jxi = x)
100% = (exp (
E (yi jxi = x)
( 1+

2 (2x

2 (2x

+ 1))

1) 100%

+ 1)) 100%:

This shows that the intepretation of a quadratic regression with a logged dependent variable
simply combines the two elements of each of the transformations. The marginal eect of the
quadratic regression is ( 1 + 2 (2x + 1)) as before, but the presence of the logged dependent
variable means that this eect needs to be interpreted as a percentage change in E (yi jxi ), rather
than an absolute change in E (yi jxi ).
To compute the marginal eect of an extra year of experience for an individual with x years
of experience, we would re-arrange the PRF to give

E (log wagei jfemalei ; educi ; experi )


0

1 femalei

2 educi

+(

4 (2x

+ 1)) experi +

exper2i

(2x + 1) experi

so the required marginal eect is the coe cient on experi in a regression of wagei on an intercept,
femalei , educi , experi and exper2i (2x + 1) experi . For individuals with 10 years of experience,
the SRF is shown in Figure 84. The interpretation of the estimate is that an extra year of work
experience increases average wages by approximately 2.73%. A condence interval for this is
obtained from
0:0273 1:980 0:00238 = [0:0226; 0:0320] ;
or [2:26%; 3:20%] .
6.3.2

Choosing between levels and logs for the dependent variable

Comparing (71) and (83) shows that the model in log wages has a much higher R2 , which would
appear to suggest it is superior. However, models with dierent dependent variables can not be
compared using R2 . Instead, the predictions from one of the models needs to be transformed to
match those of the other model to allow a valid comparison.
Suppose we want to compare
E (yi jxi ) = 0 + 1 xi
and
E (log yi jxi ) =
94

1 xi :

Figure 84: SRF for computing the marginal eect of one extra year of experience on wages for an
individual with 10 years of experience
Let
y^i = ^ 0 + ^ 1 xi ;

(79)

denote the usual SRF for the levels yi . For the log of yi , write the SRF as
gyi = ~0 + ~1 xi ;
log

where ~0 and ~1 are the usual OLS estimators from log yi on an intercept and xi , the dierent
notation only being used to distinguish them from ^ 0 and ^ 1 in the levels SRF. Now transform
gyi , into tted values for yi , denoted
the tted values for log yi , denoted log
y~i = exp ~0 + ~1 xi :

These tted values are not unbiased estimators of E (yi jxi ) since they omit consideration of 0
in (76), but this turns out not to matter for the R2 comparison. The comparison is made by
computing the R2 from a regression of yi on an intercept and y~i , and then comparing this with
the R2 from (79). Whichever is larger suggests whether yi should be logged or not. Note that this
comparison is valid only when the two regressions (for yi and log yi ) contain the same explanatory
variables.
This procedure is made more convenient in Eviews because it oers the option of computing
tted values for yi directly from a regression estimated in log yi . In the regression for log wagei ,
choose Proc - Forecast... as shown in Figure 85, and then ensure that the tted values
are obtained for wage, and not log(wage), as shown in Figure 86. This creates a variable called
wagefin the workle (its name can be changed if desired). Figure 87 shows the regression used
to compute R2 = 0:211, which is the percentage of variation in wagei explained by the regression
for log (wagei ). This R2 is slightly higher than that for the regression for wagei , implying the log
transformation for wages is to be preferred in this case.
95

Figure 85: Choosing to calculate tted values

Figure 86: Choosing to calculate tted values for wage, not log(wage)

96

Figure 87: Regression to compute R2 for wagei from the log wagei regression

6.4

Practical summary of functional forms

Dene the general notation y (x) = E (yi jxi = x). The following gives a summary of the functional forms and their marginal eects and prediction equations.
Linear
PRF
SRF
Marginal eect
Prediction y (x)

E (yi jxi ) = 0 + 1 xi
y^i = ^ 0 + ^ 1 xi
y (x + 1)
y (x) = 1
E (yi jxi ) = y (x) + 1 (xi

Quadratic x
PRF
SRF
Marginal eect
Marginal eect estimation
Prediction y (x)
Turning point
Log x
PRF
SRF
Marginal eect
Prediction y (x)

x)

E (yi jxi ) = 0 + 1 xi + 2 x2i


y^i = ^ 0 + ^ 1 xi + ^ 2 x2i
y (x + 1)
y (x) = 1 + 2 (2x + 1) = 1
E (yi jxi ) = 0 + 1 xi + 2 x2i (2x + 1) xi
E (yi jxi ) = y (x) + 1 (xi x) + 2 x2i x2
x=

E (yi jxi ) = 0 + 1 log xi


y^i = ^ 0 + ^ 1 log xi
y (1:01x)
y (x)
1 =100
E (yi jxi ) = y (x) + 1 log (xi =x)

97

Log y
PRF
SRF
Marginal eect
Prediction
Prediction

(x)
y (x)

log y

E (log yi jxi ) = 0 + 1 xi
dyi = ^ + ^ xi
log
1
0
y (x + 1)
y (x)
100%
1 100%
y (x)
E (log yi jxi ) = log y (x) + 1 (xi x)
^ y (x) = ^ 0 exp ^ log y (x)
P
dyi
^ 0 = n1 ni=1 exp log yi log

dyi
R2 from SRF of yi on intercept and exp log

R2 for yi

Log y + quadratic x
PRF
SRF
Marginal eect

Marginal eect estimation


Prediction log y (x)
Prediction y (x)

100%

dyi
R2 from SRF of yi on intercept and exp log

R2 for yi

E (log yi jxi ) = 0 + 1 xi + 2 x2i


dyi = ^ + ^ xi + ^ x2
log
0
1
2 i
y (x + 1)
y (x)
100% ( 1 + 2 (2x + 1)) 100% =
y (x)
E (log yi jxi ) = 0 + 1 xi + 2 x2i (2x + 1) xi
E (log yi jxi ) = log y (x) + 1 (xi x) + 2 x2i x2
^ y (x) = ^ 0 exp ^ log y (x)
P
dyi
^ 0 = n1 ni=1 exp log yi log

Comparing regressions

Comparing the t of dierent regressions for the same dependent variable can be done in many
dierent ways, there is not one correct approach. Four statistics will be discussed here for the
purpose. First note, however, that while R2 is a useful descriptive statistic for a single regression,
it has only very limited use for comparing dierent regressions. It can only be used for comparing
regressions with the same number of explanatory variables. The problem with R2 is that it will
never decrease when a new explanatory variable is added to a regression, no matter how little
explanatory power it has. So comparing regressions with R2 will always end up giving preference
to the largest model. The four closely related statistics given here do not have this problem and
should be used for regression comparison.

7.1

Adjusted R2

For any SRF


y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i ;
recall the denitions
SST

n
X

(yi

y)2 total sum of squares

i=1

SSE =
SSR =

n
X

i=1
n
X

y^i
u
^2i

y^

explained sum of squares

residual sum of squares,

i=1

98

which satisfy
SSR = SSE + SSR:
The

R2

is dened as
R2 =

SSE
SST SSR
=
=1
SST
SST

SSR= (n
SST = (n

1)
:
1)

The adjusted R2 , denoted R2 , is dened as


SSR= (n k 1)
;
SST = (n 1)

R2 = 1

the adjustment being the inclusion of the degrees of freedom as the divisor of SSR in the numerator. A result of this change is that R2 may decrease if an explanatory variable with little
predictive power is added to a regression, so it is legitimate strategy to compare regressions with
dierent numbers of explanatory variables based on R2 (as long as they have the same dependent
variable). Another result of the change is that R2 0 need not always hold as it does for R2 . A
negative R2 is a sign of a regression with very little overall explanatory power.

7.2

Information criteria

There are three closely related information criteria that can be used for comparisons of regression
models the Akaike, Schwarz and Hannan-Quinn criteria. They have the general form
IC = log

SSR
p
+ (k + 1) ;
n
n

where p is a penalty term taking the values


Akaike:
Schwarz:
Hannan-Quinn:

p=2
p = log n
p = 2 log log n

A regression is preferred to another if it has a smaller IC, whichever of the three is used.
The problem with having four dierent criteria for model comparison is that it is unclear
which to rely on. All four methods are widely used in practice and each of them is derived from
dierent principles and has dierent desirable (and undesirable) properties. In order, the R2
is most included to prefer the larger of two regression models, followed by the Akaike IC, the
Hannan-Quinn IC and then the Schwarz IC, which is most likely of the four to prefer the smaller
of two regression models. We will rely on the Akaike criterion in this subject.

7.3

Adjusted R2 as an IC

The R2 at rst sight appears quite dierent from the other three ICs, but in fact is very closely
related. Choosing a model with a larger value of R2 is identical to choosing a model with a smaller
value of log 1 R2 , and

log 1

R2

SSR
SST
log
n k 1
n 1
SSR
1
SST
= log
+ log
log
k+1
n
n 1
1
n

= log

log

SSR k + 1
+
n
n
99

log

SST
:
n 1

Therefore choosing a regression with larger R2 is (almost) equivalent to choosing a regression with
smaller
SSR k + 1
log
+
;
n
n
implying p = 1.

Functional form

An additional issue that can occur is one of incorrect functional form. This has implications for
the estimation of a conditional mean, whether or not causal inference is of interest. In general
suppose the true conditional expectation is
E (yi jxi ) =

1 xi

+ g (xi )

but a linear SRF


y^i = ^ 0 + ^ 1 xi
is estimated. Recall the slope coe cient ^ 1 has the representation
^ =
1

n
X

an;i yi ;

i=1

where

and

Pn

i=1 an;i

(xi x)
an;i = Pn
x)2
i=1 (xi

Pn

= 1. Then
" n
#
h i
X
E ^1 = E
an;i E (yi jxi )

= 0 and

i=1 an;i xi

= E
=

6=

"

i=1
0

n
X

an;i +

i=1

+E

"

n
X

n
X

an;i xi +

i=1

n
X
i=1

an;i g (xi )

an;i g (xi )

i=1

P
where ni=1 an;i g (xi ) has the interpretation of being the slope coe cient from a regression of
g (xi ) on xi .
The general conclusion from this is that a misspecied functional form results in biased estimates of the conditional mean E (yi jxi ). This is a dierent problem from omitted variables, which
does not bias conditional mean estimates, although it will generally bias estimates of causal eects
if this is of interest.

Regression and Causality

A regression is a statistical model for the conditional mean of a dependent variable given some
explanatory variables. To take a simple example, the PRF
E (wagei jeduci ) =
100

1 educi

(80)

measures how the average wage changes with dierent values of education. With 1 > 0 , we
would nd the average wage of individuals with 15 years of education is higher than the average
wage of individuals with 12 years of education, and the dierence between these two averages
would be 3 1 .
It is common in practice to want to take the interpretation of a regression further and to claim
a causal relationship. For example, that an individual who undertakes a university degree (hence
increasing their years of education from 12 to 15) can expect to increase their wages by 3 1 as
a result of this extra education. This causal statement is a much stronger interpretation of (80)
than simply saying that higher educated individuals have higher average wages, and is far more
di cult to justify. Much research at the frontier of econometrics focusses on if and how dierent
statistical models might be given causal interpretations. It is generally necessary to go beyond
statistical arguments to a clear understanding of the nature of the practical question and the way
that the data has been obtained.
In order to give (80) a causal interpretation, it is necessary that an individuals wages be
caused in a manner that satises the mathematical relationship
wagei =

1 educi

+ ui ;

(81)

where ui is the disturbance term that captures all of the other factors that cause wages besides
education, and it is necessary that this disturbance term satisfy
E (ui jeduci ) = 0:

(82)

Taking the conditional expectation of both sides of (81) given educi and applying (82) gives
(80). It is necessary that both (81) and (82) hold in order for the regression (80) to be given the
interpretation that an extra year of education causes an individuals wage to rise by 1 . Sometimes
this interpretation may be possible, but there are many ways in which (81) and especially (82)
may be violated, even though (80) may be a valid representation of the conditional mean of wages.
Note that (82) requires that education have no explanatory power for any of the factors that make
up the disturbance term ui , a requirement that can be very di cult to satisfy in practice.

9.1

Notation

One aspect of the notation here diers from that of the textbook. A regression is a statistical
model of a conditional expectation, and so for our purposes is always represented explicitly as
a conditional expectation as in (80). In Wooldridge and other textbooks, it is common to also
represent a regression in the form (81), as well as sometimes in the form (80). In these notes the
notation (81) will be reserved for an equation representing how the dependent variable is caused.
This causal equation may or may not correspond to a regression equation, as we will now discuss.
To be clear, a regression model represents the conditional mean of the dependent variable and
is therefore written in terms of that conditional mean (eg E (wagei jeduci )). A causal equation
represents how the dependent variable itself is determined and is therefore written in terms of
that dependent variable (eg wagei ). The regression model always measures the conditional mean,
but if the regression model and the causal equation happen to coincide then the regression can
also be given a causal interpretation.

9.2

Regression for prediction

Before discussing causal interpretations further, it should be noted that many regressions are
not meant to be causal in the rst place. Regressions for prediction / forecasting are a leading
example. Consider the PRF for nal exam marks
E (exami jasgnmti ) =
101

1 asgnmti :

(83)

This provides a statistical model for how predicted nal exam marks vary with assignment marks.
It may have interest for both students and teachers in summarising the relationship between oncourse assessment and the nal exam. It is clearly not a causal regression though. Assignment
marks do not cause exam marks. A better causal story would be that both assignment and
exam marks are caused by some combination of study during the semester (including lecture and
tutorial participation, reading and revision and so on) and pre-existing ability (extent of previous
exposure to statistics, general intelligence and so on). A highly stylised causal model of this might
be
exami =

1 studyi

asgnmti =

1 studyi

2 abilityi

+ ui

2 abilityi

+ vi ;

where ui and vi represent the disturbances capturing all the other causal factors that inuence
individual marks. Presumably all of 1 ; 2 ; 1 ; 2 are positive, so that the causal model generates
a positive statistical relationship between assignment and exam marks, and this statistical relationship is captured by (83). So estimates of (83) may be useful for estimating predicted nal
exam marks, but they do not attempt to uncover any causal factors that produce either of those
marks in the rst place. Regression (83) is an example of the saying that correlation need not
imply causation.
This discussion reveals one way in which an attempt at causal modelling may fail. A regression
model E (yi jxi ) = 0 + 1 xi may be specied in the belief that xi causes yi , when the true story
is that some other factor zi causes both yi and xi and produces a purely statistical relationship
between them.

9.3

Omitted variables

Omitted explanatory variables is a common reason that regression models fail to measure causal
eects. The case of wages and education is famous for this problem in econometrics. Suppose
wages are truly caused by
wagei =

1 educi

2 abilityi

+ ui ;

(84)

where
E (ui jeduci ; abilityi ) = 0:

(85)

This is a highly simplied model of wages, but is su cient for this discussion. Natural ability
is a di cult concept involving intelligence of various sorts, persistence, resilience and other such
factors. Numerical measurement of natural ability is probably impossible and wage regressions
do not contain this variable in practice. Nevertheless, ability is surely an important causal factor
for an individuals productivity, and hence their wages, implying 2 > 0 in (84).
In addition, more able individuals will generally obtain higher levels of education, since they
can use their ability to qualify for higher education opportunities and also will benet more from
taking up such opportunities. We might therefore expect to nd a statistical relationship between
education and ability of the form
E (abilityi jeduci ) =

1 educi ;

(86)

with 1 > 0. This education / ability may or may not be causal, or causation may run in the
opposite direction, but it doesnt matter for the discussion of interpretation of (84).
Now suppose we specify a PRF of the form
E (wagei jeduci ) =
102

1 educi ;

(87)

not including ability. The omission of ability does not introduce a problem for the SRF as an
estimator of this PRF (it is still unbiased, asymptotically normal coe cients and t statistics and
so on), so the estimation of the conditional mean of wages given education is correct. The question
is whether 1 measure the causal eect of education on wages, i.e. whether 1 = 1 in (84)?
To answer this requires an extension of the LIE E [y] = E [E (yjx)]. A more general version is
E [yjz] = E [E (yjx; z) jz] :
This has exactly the same structure as the basic LIE, but each of the expectations has z as an
additional conditioning variable. In the current context this extended LIE can be used to write
E (wagei jeduci ) = E [E (wagei jeduci ; abilityi ) jeduci ]

(88)

Taking the conditional expectation of (84) given educi and abilityi and applying (85) gives
E (wagei jeduci ; abilityi ) =

1 educi

2 abilityi ;

and substituting this into (88) gives


E (wagei jeduci ) = E [
=

1 educi

1 educi

2E

2 abilityi jeduci ]

(abilityi jeduci ) :

Substituting (86) then gives


E (wagei jeduci ) =

= (

1 educi
2 0)

+(

2( 0
1

1 educi )

2 1 ) educi :

Comparison with (87) reveals the relationship


1

2 1:

That is, 1 does not measure the causal eect 1 . Instead it measures a mixture of coe cients
from both (84) and (86). The fact that the SRF for (87) estimates 1 and not 1 is generally
referred to as omitted variable bias. It is not bias in the statistical sense, since ^ 1 is unbiased
for 1 regardless. The so-called bias is an estimator property, but is really the fact that the model
(87) does not match the causal mechanism (84) and therefore has dierent parameter values.
In this case it is plausible to think that 2 > 0 and 1 > 0, which implies 1 > 1 . That is,
the regression (87) will over-state the causal eect of education on wages. For some intuition for
this, imagine comparing average wages between two groups of individuals, the rst group with
12 years of education, the second group with 15 years of education. The average wage for the
second group will be higher (by 3 1 ). But this dierent is due to two factors the second group
has extra education, but will also consist of individuals of generally higher ability. So the average
wage dierence between the two groups is due to both education and ability dierences, not
education alone. Attributing the entire average wage dierence to education is an error because
the comparison fails to control for ability dierences.
Note that omitted variables would not be a problem for causal estimation if 1 = 0. (It is
assumed that 2 6= 0 for this discussion, otherwise abilityi would be irrelevant anyway and could
be safely omitted.) That is, if the included explanatory variable has no explanatory power for the
omitted variable, there will be no omitted variable bias, i.e. 1 = 1 .

103

9.4

Simultaneity

Another problem with causal interpretations of regression models arises when the causality between two variables runs in both directions. That is, there is causality from the explanatory
variable to the dependent variable of the regression, but also causality in the other direction
from the dependent variable to the explanatory variable. In this case we say the variables are
simultaneously determined.
Consider the CEO salary example, where rm protability (as measured by Return on Equity)
was used as an explanatory. It was found that average CEO salary varied with rm protability.
This is not the same thing, however, as saying that the level of CEO salary is caused by rm
protability. This may be true, or it may be that highly paid CEOs are more competent and
cause rms to be more highly protable, or a mixture of the two eects. If rms determine their
CEOs salary on the basis of their protability, and highly paid CEOs also cause higher prots,
we would say the two outcomes are simultaneously determined. This might be represented in
equation form as
salaryi =

1 roei

+ ui

roei =

1 salaryi

(89)

+ vi ;

(90)

where ui represents the other factors that determine CEO salary and vi represents all the other
factors that determine the rms return on equity. In order for each of these equations to be given
some sort of statistical interpretation, it is necessary to say something about ui and vi . In the
rst equation we would like to assume that E (ui jroei ) = 0, while in the second E (vi jsalaryi ) = 0.
These assumptions would allow each of these equations to be given regression representations. Unfortunately neither assumption is possible when there is simultaneity. For example, E (ui jroei ) = 0
implies that ui and roei must be uncorrelated, but the simultaneous structure of the equations
dictates that any factors that causes the CEOs salary must then also be a factor causing Return
on Equity because of salarys presence in the second equation. This can be made explicit by
substituting the equation for salaryi into the equation for roei and re-arranging to give
roei =

1 0

1 1

(1

1 1)

ui +

1
(1

1 1)

vi :

This correlation between roei and ui implies that E (ui jroei ) = 0 is not possible. Therefore the
PRF
E (salaryi jroei ) = 0 + 1 roei
does not have the same parameters as the causal equation (89), i.e. 1 6= 1 . The PRF provides
a representation of the conditional mean of CEO salary given Return on Equity, and an unbiased
estimate is provided by the SRF, but the conditional mean diers from the causal equation because
of the simultaneity.

9.5

Sample selection

Sample selection problems can result in dierences between the parameters of a PRF and the
underlying causal mechanism. The problem arises when a simple random sample is not available,
and instead the sample is chosen at least partly based on the dependent variable itself, or some
other factor correlated with the dependent variable.
In Tutorial 5 it was found that a rms CEO salary was a positive predictor of the risk of the
rms stock. Suppose there is a causal relationship
jreturni j =

+
104

1 salaryi

+ ui ;

(91)

with 1 > 0 implying the higher CEO salaries cause higher risk in the stocks (greater magnitude
movements in share price, either positive or negative). Further suppose for this story that
E (ui jsalaryi ) = 0;
so that
E (jreturni j jsalaryi ) =

1 salaryi :

(92)

However, the risks undertaken by some highly paid CEOs may be been so large and gone so
wrong that their rms went bankrupt. Such rms with very large negative returns may therefore
be excluded from the sample (if, for example, their bankruptcy resulted in them being removed
from a database of currently trading rms). To make the story simple, suppose we only observed
rms for whom returni > 90 say, such that rms that lost more than 90% of their value went
bankrupt and were excluded for the database. (This 90% gure is just made up for this story,
rm bankruptcy is more complicated in practice of course!) In that case our regression model for
E (jreturni j jsalaryi ) is in fact a regression model for E (jreturni j jsalaryi ; returni > 90). That is,
if rms with returni
90 are unavailable for our sample, our regression model is really
E (jreturni j jsalaryi ; returni >

90) =

1 salaryi :

(93)

However E (jreturni j jsalaryi ; returni > 90)


E (jreturni j jsalaryi ), because the latter averages
over some larger absolute returns that are excluded from the former. The main point is that the
PRF (93) based on the available sample would not match the PRF (92) derived from the causal
equation (91) for all rms, so the coe cients in (93) would dier from the causal coe cients in
(91).

10

Regression with Time Series

Time series data diers in important respects from cross-sectional data. Time series data on a
variable is collected over a period of time, as opposed to a cross-section which is collected at (at
least approximately) a single point in time. Examples of time series data include observations on
a share price or market index recording each minute or each day or at any other frequency, or
exchange rates measured similarly, or macroeconomic variables like price ination or GDP growth
that are measured monthly or quarterly, and so on. This time series aspect introduces dierent
features to the data compared to a cross section. Firstly the observations are ordered, meaning
that there is a natural ordering in time that does not apply to cross sections. When we take a
simple random sample of individuals or rms or countries there is no single order of observations
that is naturally imposed (although they can of course be ordered according to any criterion we
wish after they are collected).
Statistically a very interesting feature of time series data is that there is generally some form of
temporal dependence that is interesting to model. Temporal dependence means there is statistical
dependence (i.e. correlation or predictability) that exists between observations at dierent points
in time. For example there may be information in todays stock prices that is useful to predict
movements in prices tomorrow, or information in this months ination gure about next months
ination or interest rates or GDP growth, and so on. Modelling this dependence over time is of
great interest both for forecasting / prediction purposes and also for attempts at causal modelling
with time series. The dependence also means that the theory underlying regression using OLS
is dierent, because the i.i.d. assumption is generally no longer applicable. That is, time series
data is cannot be collected using a simple random sample.
Variables with time series data are generally denoted as yt and xt rather than yi and xi . The
dierence is purely convention, but helps to remind which type of data is in use for a particular
105

model. Following Wooldridge, it will be useful to begin by denoting the dependent variable as yt
and an explanatory variable as zt . The switch to zt instead of xt will be become clear, but isnt
very important. A simple static regression with time series then looks like
E (yt jzt ) =

1 zt ;

(94)

which has the same structure as a cross sectional regression, but needs dierent theoretical underpinnings without the structure of an i.i.d. sample. We will not pursue this, but instead discuss
some more interesting time series models that are used both for forecasting and causal modelling.

10.1

Dynamic regressions

A dynamic regression is one that models the ways in which the relationships between variables can
evolve over time. There are many ways of doing this, but just two of the most popular approaches
will be covered here.
10.1.1

Finite Distributed Lag model

In specifying a regression model, the concept of conditioning is obviously fundamental. A regression model is a model of a conditional mean. In time series analysis it becomes important to be
clear about exactly what is being conditioned in any regression model. It is often of most practical
interest to not just condition on zt as in (94), but also on the past values as well. That is, a time
series regression model is often specied conditional on all values of the explanatory variable that
are observable at time t. The conditional expectation is written E (yt jzt ; zt 1 ; : : : z1 ). The idea is
that previous values of zt might also be useful for explaining yt . A regression of the form
E (yt jzt ; zt

1 ; : : : z1 )

0 zt

1 zt 1

+ ::: +

q zt q

(95)

is often used. A variable of the form zt j (for any j > 0) is called a lag of zt . The regression (95)
is called a Finite Distributed Lag (FDL) model (the Finitepart not being used by all authors).
The number of lags q to include in this model can be determined on the basis of the sample size
(using few lags if few observations are available), the frequency of the data (sometimes q = 12 for
monthly data, q = 5 for daily data etc) or most commonly on the basis of statistical analysis of
the model to see what value of q seems most appropriate for explaining yt .
An FDL model captures the idea that the full eect of a change in zt on the mean of yt may
not occur immediately, but may take several time periods. For example, a central bank may raise
o cial interest rates in order to attempt to reduce the level of ination, but it is well known there
are lags in adjustments in the economy such that interest rate changes take some months (as
many as 12-18 months) for their eects to be fully felt. A very simple FDL model with monthly
data to capture this idea would take the form
E (inft jrt ; rt

1 ; : : : ; r1 )

0 rt

1 rt 1

+ ::: +

12 rt 12 ;

(96)

which allows for current ination (inft ) to be explained by interest rate changes that were made
up to 12 months ago. FDL models are typically used for policy analysis questions such as these.
10.1.2

Autoregressive Distributed Lag model

FDL models can be extended to allow for lags of the dependent variable. The conditioning set
for the regression is extended to cover not only present and past explanatory variables, but also
past values of the dependent variable. The model is
E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 z1 )

1 yt 1

+ 0 zt +
106

+ :::

1 zt 1

p yt p

+ ::: +

q zt q ;

(97)

so that past values of yt are permitted to have explanatory power for yt . This is an additional
way to introduce a concept a lagged eects or inertia into a model of a dynamic situation. Model
(97) is called an Autoregressive Distributed Lag (ARDL) model. It is a exible way of capturing
dynamic eects, but requires more eort to interpret than the FDL model.
10.1.3

Forecasting

A small variation on the ARDL model is often used for forecasting. Forecasting is the attempt to
predict a variable in the future. In the simplest case, a forecast is made on time period into the
future. Regressions such as (95) and (97) are not useful for forecasting because they contain zt as
an explanatory variable for yt , which means a forecast for the future value of yt is being expressed
in terms of the future value of zt . A forecasting model needs to remove any variables at time t
from its set of conditioning variables, and make take a form such as
E (yt jyt

1 ; zt 1 ; : : : ; y1 z1 )

+ 1 zt

1 yt 1
1

+ :::

+ ::: +

p yt p
q zt q :

(98)

In this model the forecast of yt (i.e. E (yt jyt 1 ; zt 1 ; : : : ; y1 z1 )) is expressed purely in terms of
variables that are available in the previous time period t 1.
10.1.4

Application

Especially since the GFC, there has been considerable discussion in economics about the various
possible eects of government debt on economics growth. The austeritystory is that government
debt crowds out private sector economics activity and undermines condence, and hence prolongs
the recession, giving rise to calls to cut government spending and hence government debt. The
scal stimulusstory is that at a time of recession the government should spend more than they
otherwise might (going into further debt if necessary) in an eort to stimulate the economy and
end the recession, leaving the task of then reducing the debt to when strong economic growth has
resumed.
We will look at a simple dynamic model relating government debt and economic growth in
Australia using annual data from 1971 to 2012 on the real GDP growth rate per year (from the
RBA) and on the net government debt as a percentage of GDP (from the Australian government
budget papers). Time series plots are shown in Figures 88 and 10.1.4. Observe that the evolution
of the debt/GDP ratio is much smoother than that of GDP growth, a fact that will inform our
regression modelling later.
A time series regression model for the question of interest is the ARDL PRF
E growtht jdebtt ; growtht

1 ; debtt 1 ; : : :

1 growtht 1

+ 0 debtt +

+ ::: +

1 debtt 1

p growtht p

+ ::: +

q debtt q :

The debt variables in this model can be used to measure the eect of government debt on economic
growth. The inclusion of the lagged dependent variables is a simple way for the model to allow
for the other dynamics in the economy.

107

GROWTH
6
5
4
3
2
1
0
-1
-2
-3
1975

1980

1985

1990

1995

2000

2005

2010

Figure 88: Annual real GDP growth, Australia

DEBT_GDP
20

16

12

-4
1975

1980

1985

1990

1995

2000

2005

2010

Net government debt as a percentage of GDP

10.2

OLS estimation

The algebra of OLS estimation in these regressions is almost identical to that for cross-sectional
regressions, with one exception. The presence of lags in these regressions requires adjustments to
be made at the start of the sample. To illustrate, suppose we have observations for t = 1; : : : ; n,
and specify the rst-order FDL model
E (yt jzt ; zt

1 ; : : : ; z1 )

0 zt

1 zt 1 :

This equation has a problem for t = 1 because it involves the variable zt 1 = z0 on the right hand
side, and this is unavailable. Strictly speaking we should write down these models as applying
only to values of t for which the variables are available. For example
E (yt jzt ; zt

1 ; : : : ; z1 )

108

0 zt

1 zt 1 ;

t = 2; : : : ; n

or
E (yt jzt ; zt

1 ; : : : z1 )

0 zt

1 zt 1

1 yt 1

+ ::: +

q zt q ;

t = q + 1; : : : ; n

q zt q ;

t = max (p; q) + 1; : : : ; n

or
E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 z1 )

+ 0 zt +

+ :::

1 zt 1

p yt p

+ ::: +

and so on.
OLS estimation therefore does not use all n observations, it only uses those observations for
which the regression is well-dened for the sample. In the preceding examples, OLS will use
respectively n 1, n q and n max (p; q) observations.
The theory for time series regressions is more di cult than for i.i.d. regressions. Only an
outline of some practically important points is given here.
10.2.1

Bias

Unbiasedness is more di cult to show in time series regressions, and often does not hold. Recall
the unbiasedness proof for an i.i.d. regression
E (yi jxi ) =
in which the OLS estimator is written
Pn
^ = Pi=1 (xi
1
n
i=1 (xi

x) yi
x)2

1 xi ;

n
X

an;i yi ;

i=1

and then the independence part of the i.i.d. conditions is used to deduce that
E (yi jxi ) = E (yi jx1 ; : : : ; xn ) ;

(99)

so that
h

E ^1

= E
= E
=

" n
X

"

i=1
0

(100)

an;i xi

(101)

an;i E (yi jx1 ; : : : ; xn )

n
X

an;i +

i=1

n
X
i=1

P
P
(using ni=1 an;i = 0 and ni=1 an;i xi = 1). The crucial condition is (99), since without that the
step from (100) to (101) cannot happen.
Small biases Consider a simple time series regression
E (yt jxt ; : : : ; x1 ) =

1 xt ;

(102)

where xt might be an explanatory variable such as zt in (95), or xt might just be the lagged
dependent variable xt = yt 1 , in which case we would have the so-called AR(1) model
E (yt jyt

1 ; yt 2 ; : : : ; y1 )

109

1 yt 1 ;

(103)

which is often used as a very simple forecasting model. Now the crucial condition, analogous to
(99), that is required is
E (yt jxt ; : : : ; x1 ) = E (yt jxn ; : : : ; x1 ) :
(104)
If this is true then xt is said to be a strictly exogenous regressor and the OLS estimator of 1 is
unbiased. However, in a time series setting without independence across time, (104) can easily
fail.
The simplest situation in which (104) is certain to fail is in the AR(1) model. In that case we
have
E (yt jyn 1 ; : : : ; y1 ) = yt ;
since yt is included in the conditioning set yn 1 ; : : : ; y1 . Thus E (yt jyn 1 ; : : : ; y1 ) diers from
E (yt jyt 1 ; yt 2 ; : : : ; y1 ) in (103) for all t = 2; : : : ; n 1, implying that strict exogeneity does not
hold in this model . The OLS estimator of an AR(1) model is biased, and more generally the
OLS estimator of any model with a lagged dependent variable (eg any ARDL model) will also be
biased. It turns out, however, that this bias is small in the sense that it does not arise from a
misspecication of the model and disappears as the sample size grows. That is, in a reasonable
sized sample we can expect the bias to be practically unimportant (just as in a reasonablesized
sample we can treat the OLS coe cients and t statistics as being approximately normal and t
distrbuted).
In (102) it is also possible for there to be bias if xt is not a lagged dependent variable, depending
on the nature of the relationships between xt and yt . If yt has some explanatory value for future
values of xt (i.e. the leads xt+j for some j > 0) then (104) will fail. For example, if yt is correlated
with xt+1 as well as xt then we may have the relationship
E (yt jxn ; : : : x1 ) =

1 xt

2 xt+1 ;

which is not equal to (102). This latter relationship is usually not interesting from a practical
perspective since saying that yt is explained by future values of xt is useless for forecasting and
is unlikely to be meaningful in causal modelling. The point is that we can use this to derive
an expression for the coe cients in (102) by taking expectations conditional on xt ; : : : ; x1 and
applying the LIE:
E (yt jxt ; : : : ; x1 ) = E [E (yt jxn ; : : : x1 ) jxt ; : : : ; x1 ]
=

1 xt

2E

[xt+1 jxt ; : : : ; x1 ] :

For simplicity suppose that xt has the AR(1) conditional mean


E [xt+1 jxt ; : : : ; x1 ] =

1 xt ;

so that
E (yt jxt ; : : : ; x1 ) =

= (
=

That is, in this situation the value of

0
0

1 xt

2 0)

2( 0

+(

1 xt )
2 1 ) xt

1 xt :

in (102) is given by (

110

2 1 ).

Now the usual unbi-

asedness proof approach


h

E ^1

= E
= E

"

n
X

t=1
" n
X

an;t E (yt jxn ; : : : ; x1 )


an;t (

t=1

=
=

1E

"

n
X

1 xt

an;t xt+1

t=1

h i
^1 ;
E
1

0+

2 xt+1 )

where ^ 1 is
h the
i OLS estimator of 1 in the AR(1) model for xt . As an OLS estimator of an AR(1)
model, E ^ 1 6= 1 , and this is the source of the bias in ^ 1 . Again this bias is small, so that
for reasonable sample sizes the bias in ^ can be treated as unimportant.
1

Such explanatory power of yt for future xt is quite realistic. For example, in (96), current values
of ination may be useful for forecasting future interest rate movements because the central bank
may set interest rates partly in response to observed ination. Some bias may be present in the
OLS estimation of the FDL model (96) as a result.
Large biases Biases can also arise from mis-specifying the conditional expectations. For
example, suppose (102) is the assumed model,
E (yt jxn ; : : : ; x1 ) =

0 xt

1 xt 1 ;

with the form of the conditioning set implying that xt is a strictly exogenous regressor. This looks
like a lot like an omitted variables problem (i.e. xt 1 is omitted in (102)) but the consequences
of omitting xt 1 are more like a functional form misspecication. That is, the estimates of the
conditional mean E (yt jxt ; : : : ; x1 ) can be biased by the omission of xt 1 . To see this, we take the
same approach as in the analysis of functional form misspecication. The OLS estimator ^ 1 in
the SRF
y^t = ^ 0 + ^ 1 xt ;
can be written
^ =
1
as usual, with

Pn

t=1 an;t

= 0 and
h

E ^1

n
X
t=1

Pn

t=1 an;t xt

= E
= E

"

n
X

t=1
" n
X

= 1. Now
#

an;t E (yt jxn ; : : : ; x1 )


an;t (

t=1

an;t yt ;

1E

" n
X
t=1

Pn

6=

0 xt

an;t xt

1 xt 1 )

where t=1 an;t xt 1 is the slope coe cient in a regression of xt 1 on an intercept and xt . If there
is temporal dependence in xt then this regression coe cient will be generally non-zero, implying
111

that ^ 1 is not an unbiased estimator of 0 . This bias does not disappear with larger samples. In
attempting to model E (yt jxt ; : : : ; x1 ) (i.e. in any FDL or ARDL model), it is necessary to have
a method of choosing enough lags to go in the regression in order to avoid inducing biases in the
estimates.
Summary of biases A time series regression is a conditional expectation E (yt jxt ; xt 1 ; : : : x1 ).
The explanatory variables xt may include lagged dependent variables yt 1 ; yt 2 ; : : : and/or other
explanatory variables zt ; zt 1 ; : : :. That is, E (yt jxt ; xt 1 ; : : : x1 ) can include FDL, AR and ARDL
models.
A large bias occurs if E (yt jxt ; xt 1 ; : : : x1 ) is not correctly specied, i.e. if insu cient lags
are included or if the incorrect functional form is specied. A large bias is once that does
not disappear no matter how large the sample size and is one we should try to avoid by careful
specication.
If E (yt jxt ; xt 1 ; : : : x1 ) is correctly specied then the OLS estimates of its parameters may still
be subject to small biases arising from the temporal dependence in the variables. This bias is
di cult to avoid (i.e. it arises even in well-specied models) but will disappear for larger samples
and is usually not worried about in practical work.
10.2.2

A general theory for time series regression

A general theoretical result that underpins much of practical time series analysis is as follows. If
1. the true conditional expectation for the PRF is
E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

1 yt 1

+ 0 zt +

+ ::: +

1 zt 1

+ :::

p yt p
q zt q ;

2. both yt and zt are weakly dependent,


then the parameters of the OLS SRF
y^t = ^ 0 + ^ 1 yt 1 + : : : + ^ p yt
+^0 zt + ^1 zt 1 + : : : ^q zt

p
q

are consistent and asymptotically normal estimators of the parameters of the PRF.
There are some new terms in this. A consistent estimator is asymptotically unbiased, so that
any bias disappears as the sample size grows. That is, a consistent estimator may exhibit the
small bias discussed above, but not large bias. An asymptotically normal estimator is one
that obeys the Central Limit Theorem, just like the cross sectional case. Then the OLS estimators
are approximately normal and subsequent t and Wald tests are valid. The practical implication
of this result is that we can use the OLS estimators (and resulting t and Wald tests) in just the
same way as they are used in cross-sectional regressions.
There are two important conditions to be satised. The rst is that su cient lags have
been included in the ARDL model to remove any large bias, as discussed above. The second
condition is that yt and zt are weakly dependent, which is a new concept. A time series xt is
weakly dependent if any dependence between xt and xt h decreases quickly to zero as h increases
to innity. An implication is that the correlation between xt and xt h must quickly decrease
to zero as h increases, which we will use to check for weak dependence. In a time series plot, a
strongly dependent time series may exhibit a trend (a persistent upwards or downwards movement)
and/or evolve very smoothly, and needs to be transformed before being included in a time series
regression.
The practical steps for time series regression are therefore the following.
112

1. Check each variable for weak dependence and transform if necessary.


2. Choose an FDL/AR/ARDL specication with su cient lags.
3. Carry out estimation and inference by OLS methods as usual.
This is not the nal word on time series regression, there are more complications that can arise,
but this approach is often su cient.

10.3

Checking weak dependence

Deciding whether or not a time series displays weak or strong dependence can be a di cult and
inexact process. The rst piece of evidence to check is the time series plot. A strongly dependent
time series may display a trend or very smooth plot, while a weakly dependent time series will be
less smooth. Figures 88 and 10.1.4 suggest that GDP growth is weakly dependent because its plot
is not smooth at all, while the plot of debt/GDP is quite smooth and suggests strong dependence.
The other piece of evidence we will use is the Correlogram. For a weakly dependent time
series the correlation cor (xt ; xt h ) will decrease quickly to zero as h increases, while for a strongly
dependent time series cor (xt ; xt h ) will decrease much more slowly. This is not a clear-cut criterion
to apply, but is often informative. To obtain the correlogram of a time series, choose View Correlogram... for that series as shown in Figure 89, for then select Level for now. The
correlograms for growth and debt/GDP are shown in Figures 90 and 91. The relevant correlations
are in the graph under the heading Autocorrelation and in the table under the heading AC.
The autocorrelations for growth are all quite small and support the graphical evidence that GDP
growth is weakly dependent. The autocorrelations for debt/GDP are considerably larger and
decrease much more slowly towards zero. This evidence, together with the time series plot, leads
us to treat debt/GDP as strongly dependent.
If a variable is judged to be strongly dependent then the usual next step is to take its rst
di erence in order to achieve weak dependence. The dierence is dened as
debtt = debtt

debtt

1;

i.e. the amount by which the debt/GDP changes from one year to the next. Usually one dierence
is su cient, but occasionally dierencing twice may be required. Eviews uses the letter D to
generate a dierence. The time series plot of debtt is shown in Figure 92, where it can be
seen to be substantially less smooth than its undierenced version. The correlogram of debtt is
shown in Figure 93, where the autocorrelations can be seen to decrease towards zero faster than
the undierenced version. These pieces of evidence are su cient for us to proceed using debtt
in its rst-dierenced version.

10.4

Model specication

To illustrate the specication and interpretation of these models, we will rst select an FDL model
and then an ARDL model. This is just for illustration and usually the selection process could
cover all possibilities together.
For any model of E (yt jzt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 ), it is a necessary condition for correct specication that the residuals of the SRF not display signicant temporal dependence, in particular
no autocorrelation. If we dene the prediction error of the PRF as
et = yt

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 ) ;

113

Figure 89: Choosing a correlogram view of a variable

Figure 90: Correlogram of growth

114

Figure 91: Correlogram of debt/GDP

D(DEBT_GDP)
6

-2

-4
1975

1980

1985

1990

1995

2000

2005

Figure 92: First dierence of debt/GDP

115

2010

Figure 93: Correlogram of

debtt

then it must be the case that


E (et jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

= 0;

and hence E (et ) = 0 by the LIE. Then


cov (et ; et

j)

= E (et et

j)

= E [E (et et

j jzt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 )]

= E [E (et jzt ; yt

by the LIE

1 ; zt 1 ; : : : ; y1 ; z1 ) et j ]

= 0;

so that et has no correlation with any lags of itself. The important step in this proof is from the
second line to the third, where et j is taken outside of the conditional expectation. This is possible
because et j is a function of yt j ; zt j ; yt j 1 ; zt j 1 ; : : : ; y1 ; z1 , all of which are contained in the
conditioning set zt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 for any j > 0. The practical implication of this is that
any evidence of autocorrelation in the residuals of the SRF, which would imply cov (et ; et j ) 6= 0,
suggests that the PRF has been misspecied and requires additional lags. A convenient check for
autocorrelation is provided by the residual correlogram in Eviews, see Figure 94.
Figures 9597 show the results of FDL regressions for growth including zero, one and two lags
respectively. The residual correlograms are shown in Figures 98100.The last two columns of the
residual correlograms are useful for autocorrelation testing. The null hypothesis for the Q-stat in
row r is that there is no correlation between et and et j for all j = 1; : : : ; r, with the p value for
the test in the last column. For example, in Figure 98 we could set out a test for correlation at
lags 14 as
116

Figure 94: Choosing a residual correlogram in an estimated equation


1. H0 : cov (et ; et

j)

= 0 for j = 1; 2; 3; 4

2. H0 : cov (et ; et

j)

6= 0 for one of more of j = 1; 2; 3; 4

3.

= 0:05

4. Test statistic : p = 0:662


5. Decision rule : reject H0 for p < 0:05
6. Do not reject H0 , so there is no evidence of autocorrelation at any lags less than or equal
to four.
It is generally unnecessary to set out the full test like this for an autocorrelation check. It is
su cient to look down the last column of p values and if any of them are less than 0.05 then
consider the model as being misspecied and move on to another that attempts to address the
problem, using more lags for eg. In this case, there is no evidence of residuals autocorrelation in
any of the three FDL models.
Since all three models pass the autocorrelation test, we can compare them using their AIC
values. The FDL model with a single lag has the smallest AIC value and so would be chosen from
among these three models. .
ARDL models are also specied for growth, see Figures 101, 103, 105. These are, respectively, ARDL(1,0), ARDL(1,1), ARDL(1,2) models, implying each has a single lagged dependent
variable (the AR(1) part) and respectively 0, 1 and 2 lagged explanatory variables. The residual
correlogram for the ARDL(1,0) model in Figure 102 shows a signicant lag nine autocorrelation,
so this model is excluded from further comparisons. The other two models pass the residual
autocorrelation tests. The ARDL(1,1) model has a lower AIC than the ARDL(1,2) model and is
therefore preferred. The ARDL(1,1) model is also preferred to the FDL(1) model in Figure 96
according to the AIC. Of those six models considered here, the ARDL(1,1) would therefore be the
one preferred overall. We will interpret both the ARDL(1,1) and FDL(1) models for illustrative
purposes.
117

Figure 95: FDL model for growth with no lags

Figure 96: FDL model for growth with one lag

118

Figure 97: FDL model for growth with two lags

Figure 98: Residual correlogram for the FDL model with zero lags

119

Figure 99: Residual correlogram for the FDL model with one lag

Figure 100: Residual correlogram for the FDL model with two lags

120

Figure 101: ARDL(1,0) model for growth

Figure 102: Residual correlogram for ARDL(1,0) model

121

Figure 103: ARDL(1,1) model for growth

Figure 104: Residual correlogram for ARDL(1,1) model

122

Figure 105: ARDL(1,2) model for growth

Figure 106: Residual correlogram for ARDL(1,2) model

123

10.5

Interpretation

The interpretation of models with lags is not quite as straightforward as in static regressions.
10.5.1

Interpretation of FDL models

Consider an FDL model


E (yt jxt ; xt

1 ; : : : ; x1 )

0 xt

1 xt 1 :

The individual coe cients have similar interpretations to usual regressions. If xt is increased by
one unit then the conditional mean of yt changes by 0 units, and in these regressions this is called
the impact multiplier. If xt 1 is increased by one unit then the conditional mean of yt changes by
1 units, so this change takes one time period before it occurs. The coe cient 1 is called the lag
one multiplier. These interpretations carry over to longer lags in FDL models.
Now suppose xt 1 were increased by one unit and this increase were allowed to remain in time
t as well. In that case both xt and xt 1 have been increased by one unit, so the eect on the
conditional mean of yt is 0 + 1 . This is called the long run multiplier. This joint interpretation of
the coe cients (i.e. allowing both xt and xt 1 to increase, rather than increasing one and holding
the other constant) makes practical sense. If xt were the o cial interest rate and yt ination for
example, the central bank would be interested to measure the eect on ination if the interest
rate were increased by 1% this month and the increase allowed to stay in place next month. In
that context, the long run multiplier has more practical meaning than the lag one multiplier,
which measures the eect of a 1% increase in interest rate in one month that is then reversed the
following month.
The long run multiplier can be estimated directly by re-writing the FDL regression as
E (yt jxt ; xt

1 ; : : : ; x1 )

+(

1 ) xt

xt :

That is, regressing yt on an intercept and xt and xt will give a direct estimate of the long run
multiplier, along with its standard error for t statistics and condence intervals.
The FDL(1) model in Figure 96 can be written
d
growth
t = 3:198

(0:186)

0:586 debtt + 0:563 debtt

(0:153)

(0:161)

1;

n = 40; R2 = 0:333:

The two slope coe cients are signicant at the 5% level (i.e. have p < 0 on their t statistics). The
impact multiplier is 0:586, so that a 1% increase in the rate of change of the debt/GDP ratio
predicts a 0:586% fall in the growth rate. This is both statistically and economically signicant
and would be consistent with the austeritystory. However, the lag one multiplier is +0:563, so
that one period later the eect of the increase in the rate of change of the debt/GDP ratio is of
opposite sign and approximately the same magnitude. The long run multiplier is 0:586+0:563 =
0:023% , so that an initial negative eect on growth is almost completely oset the following
year by a positive eect of growth, with the net eect of the increase in debt being very small
on growth. The transformed regression to estimate the long run multiplier directly involves a
regression of growth on an intercept and debtt and 2 debtt (i.e. the second dierence of debt),
the results of which are shown in Figure 107. The long run multipler estimate of 0:023% is
insignicant, implying very little long run eect of government debt changes on predictions for
economic growth, despite the signicant changes in the short run (at lags 0 and 1).

124

Figure 107: Direct estimation of long run multiplier on the FDL model for growth
10.5.2

Interpretation of ARDL models

Interpretation of ARDL models is more complicated because the dynamic eects are formed
by a mixture of the lagged dependent and lagged explanatory variables. For the purposes of
interpretations here, we will assume that zt is strictly exogenous. This makes the derivations
simpler and may be a reasonable assumption when zt is a policy variable such as government debt
or an o cial interest rate.
ARDL(1,0) Consider rst the ARDL(1,0) model
E (yt jyt

1 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 1

0 zt :

(105)

The conditioning on all of z1 ; : : : ; zn , not just z1 ; : : : ; zt , is possible because of the assumption


that zt is strictly exogenous. The impact multiplier of a one unit increase in zt on the conditional
mean of yt is 0 . That is a standard interpretation.
Looking for eects at higher lags requires some derivations. First take (105) and lag by one
time period:
E (yt 1 jyt 2 ; : : : ; y1 ; zn ; : : : ; z1 ) = 0 + 1 yt 2 + 0 zt 1 :
(106)
Now the expectation of both sides of (105) conditional on yt 2 ; : : : ; y1 ; zn ; : : : ; z1 and use the LIE
on the left hand side and (106) on the right hand side to obtain
E (yt jyt

2 ; : : : ; y1 ; zn ; : : : ; z1 )

1E

1( 0

0 (1

(yt

1)

+
+

1 jyt 2 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 2 + 0 z t 1 ) + 0 zt
2
1 yt 2 + 1 0 zt 1 + 0 zt :

This representation shows that the lag one multiplier for a one unit increase in zt is
To nd the lag two multiplier, lagging (106) by another time period gives
E (yt

2 jyt 3 ; : : : ; y1 ; zn ; : : : ; z1 )

125

0 zt

1 yt 3

0 zt 2 ;

(107)
1 0.

and then taking the expectation of (107) conditional on yt


E (yt jyt

3 ; : : : ; y1 ; zn ; : : : ; z1 )

0 (1

1)

0 (1

1)

1+

3 ; : : : ; y1 ; zn ; : : : ; z1

gives

2
1 E (yt 2 jyt 3 ; : : : ; y1 ; zn ; : : : ; z1 ) + 1 0 zt 1
2
1 ( 0 + 1 yt 3 + 0 zt 2 ) + 1 0 zt 1 + 0 zt
2
3
2
1 + 1 yt 3 + 1 0 zt 2 + 1 0 zt 1 + 0 zt :

0 zt

This shows that the lag two multiplier for a one unit increase in zt is 21 0 .
This process can be repeated as often as desired. The pattern is clearly that the lag j multiplier
for a one unit increase in zt is j1 0 . The long run multiplier is the sum over all of the individual
lag j multipliers. It is both simple and conventional to sum over all j = 0; 1; 2; : : : without upper
limit, giving
0

0 1

2
0 1

3
0 1

+ ::: =

;
1

P
j
which uses the geometric series 1
1 ) for j 1 j < 1, the latter condition being
j=0 1 = 1= (1
satised when yt is weakly dependent. This is the long run eect on the conditional mean of yt
of a permanent one unit increase in zt .
ARDL(1,1) The same approach to interpretation applies to the ARDL(1,1) model
E (yt jyt

1 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 1

0 zt

1 zt 1 :

(108)

Repeated lags and conditional expectations as above leads to


E (yt jyt
=

j ; : : : ; y1 ; zn ; : : : ; z1 )

1+

+ 0 zt + (

j 1
1

+ ::: +
1 0

1 ) zt 1

j
1 yt j
1( 1 0

1 ) zt 2

2
1( 1 0

1 ) zt 3

from which it can be seen that the lag j multiplier for a one unit increase in zt is
j 1
( 1 0 + 1 ) for j > 0. The long run multiplier is therefore
1
0+(

0+

1)

1
X

j 1
1

j=1

1 0

1
1

+ :::
0

Returning to the ARDL(1,1) model in Figure 103, the SRF can be written
d
growth
t = 4:070

(0:470)

0:269 growtht

(0:153)

0:726 debtt + 0:617 debtt

(0:181)

(0:176)

n = 40; R2 = 0:383;

giving
^ =
1

0:269; ^0 =

0:726; ^1 = 0:617:

The multipliers are therefore


Impact (lag 0)
Lag 1
Lag 2

^0 = 0:726
^ ^0 + ^1 = 0:269
0:726 + 0:617 = 0:812
1
^ ^ ^0 + ^1 = 0:269 0:812 = 0:218
1
^2
1

Lag 3

^ ^0 + ^1 =
1

..
.
126

0:269
..
.

0:218 = 0:059

1;

for j = 0 and

Figure 108: Wald test of the long run multiplier


The long run multiplier is
^0 + ^1
1

0:726 + 0:617
=
1 ( 0:269)

0:086:

The evidence from this regression is that changes in government debt predict short run changes
in economic growth, greatest in the rst 2-3 years, but quickly decreasing to zero thereafter. Also
the eects tend to oscillate in sign so that they cancel out when added, producing a long run
eect that is quite small and, according to the Wald test in Figure 108, statistically insignicant.
These results are about the predictions of economic growth on the basis of changes in government debt. All of the di culties outlined above in making causal interpretations applies here. In
particular there are other variables besides government debt that may inuence economic growth.
Also there may be simultaneity between government debt and growth for example, a slowdown
in economic growth may increase government expenditures (unemployment benets) and decrease
tax receipts (reduced company and income taxes and GST because of reduced economic activity)
and hence increase the government debt. So these results here are informative about the dynamic
structure of the conditional expectations that relate economics growth and government debt, but
must be treated with very great caution for causal inference.

11

Regression in matrix notation

Linear regression is far neater to present using matrix notation.

11.1

Denitions

A matrix is a rectangular arrangement of numbers. For example


0
1
1 4
A = @ 2 5 A:
3 6

The dimension of a matrix is denoted r c, where r is the numbers of rows and c is the number
of columns. The matrix A has dimension 3 2.
127

The individual elements of the matrix A can be denoted ai;j for i = 1; : : : ; r indexes the row
and j = 1; : : : ; c indexes the column. So a2;1 = 2 and a3;2 = 6.
Two matrices are dened to be equal if they have the same dimensions and their individual
elements are all equal.
A square matrix has the same number of columns as rows; that is, r = c.
A column vector, or simply a vector, is a matrix consisting of a single column. A row vector
consists of a single row. For example, if we dene
B=

1
2

; C=

1 2 3

then B is a (column) vector and C is a row vector. Their dimensions are 2 1 and 1 3 respectively.
A scalar is a 1 1 matrix, that is, a single number.
The transpose of a matrix turns its columns into rows (equivalently rows into columns). The
transpose of A is denoted A0 (though some denote transpose as AT ). For example,
0 1
1
1
2
3
A0 =
; B0 = 1 2 ; C 0 = @ 2 A :
4 5 6
3

Note that (A0 )0 = A for any matrix, and that the transpose of a scalar is just the scalar (eg.
20 = 2). If A has dimension r c then A0 has dimension c r.
A square matrix M is symmetric if M = M 0 . For example,
1
0
1 4 6
M =@ 4 2 0 A
6 0 3
is a symmetric matrix. If M has elements mi;j then it is symmetric if mi;j = mj;i for all i and j.
The main diagonal of a square matrix are the elements running from the top left corner to
the bottom right corner of the matrix, denoted diag (M ). In the example,
0 1
1
@
2 A:
diag (M ) =
3

That is diag (M ) is the vector of elements mi;i for all i. A symmetric matrix is symmetric about
its main diagonal, meaning those elements below the main diagonal are reected above the main
diagonal.

11.2

Addition and Subtraction

Two matrices can be added or subtracted if they have the same dimensions; that is, if they are
conformable for addition. Addition and subtraction are element-wise, for example if
0
1
3
8
9 A;
D=@ 2
1
2
then A and D are conformable for addition and
0
1
4 12
A + D = @ 4 14 A ; A
4 4

D=@

2
0
2

1
4
4 A:
8

It is not possible to add or subtract non-conformable matrices. For example, A + B and A + C


are not dened.
128

11.3

Multiplication

If x is a scalar then the product Ax means each element of A is multiplied by x. For example, if
x = 2 then
0
1
2 8
Ax = @ 4 10 A :
6 12
Suppose we have two matrices A and B with respective dimensions rA cA and rB cB . The
matrix product AB can be dened if cA = rB ; that is, if the number of columns of A matches the
number of rows of B. In this case, AB is a matrix of dimension rA cB , with individual elements
of the form
cA
X
(AB)i;j =
ai;k bk;j :
k=1

For example, with A and B as dened above, we have cA = rB = 2 so that the product AB
is dened, and the result will have dimension rA cB = 3 1 :
0
1 0
1
0
1
1 1+4 2
9
1 4
1
= @ 2 1 + 5 2 A = @ 12 A :
AB = @ 2 5 A
2
3 1+6 2
15
3 6

Unlike with scalars, matrix multiplication is not commutative; that is, AB 6= BA in general.
In fact, AB may be dened but BA not dened. The current denitions of A and B illustrate
this, since BA is not dened since B has one column and A has three rows. Even if AB and BA
are both dened, they may not be the same dimension, and even if they are the same dimension,
AB and BA will generally be dierent.
For example, if
E= 2 3 ;
then
1 3
4 6

BE =

; EB = 8:

The transpose of a matrix product satises (AB)0 = B 0 A0 .

11.4

The PRF

A multiple regression model has the general form


E (yi jx1;i ; : : : xk;i ) =

1 x1;i

+ ::: +

k xk;i :

The right hand side of this can be compactly written in matrix form. Dene the (k + 1)
vectors
0
0
1
1
1
0
B x1;i C
B
C
C
B 1 C
B
= B . C:
xi = B . C ;
@ .. A
@ .. A
xk;i

Then

x0i =

1 x1;i

+ ::: +

k xk;i ;

and the PRF can be written


E (yi jxi ) = x0i :
This representation is very useful for theoretical and computational purposes.
129

11.5

Matrix Inverse

The determinant of a square r

r matrix A is denoted jAj. If A is 2


a11 a12
a21 a22

jAj =

= a11 a22

2 then

a12 a21 :

Formulae for higher order matrices are more involved.


If jAj = 0 then A is called singular, while if jAj =
6 0 then A is non-singular. For example,
1 2
2 4

1
2

= 0;

2
4

= 8;

so the rst matrix is singular and the second is non-singular. A singular matrix satises Ac = 0
for some vector c 6= 0, while a non-singular matrix has Ac 6= 0 for all c 6= 0. For example,
A=

1 2
2 4

2
1

; c=

0
0

; Ac =

while if
1
2

A=

2
4

there is no vector c such that Ac = 0.


The identity matrix is a square matrix of dimension r, denoted Ir , such that
AIr = Ir A = A
for any r

r matrix A. It has ones on the main diagonal and zeros elsewhere, that is
0
1
1 0 ::: 0
B 0 1
0 C
C
B
Ir = B .
.. C :
.
.
.
@ .
. . A
0 0 ::: 1

The inverse, if it exists, of an r

r square matrix A is an r
A

A = AA

r matrix denoted A

and satises

= Ir :

The inverse exists only if A is non-singular.


If A is 2 2 then
A

a11 a12
a21 a22

1
jAj

a22
a21

a12
a11

For example,
1
2

2
4

but
1 2
2 4

1
8

4 2
2 1

1
2

1
4

1
4
1
8

does not exist.

In the linear system of equations


Ax = b;
if A is non-singular then A
A 1 b so the solution is

exists. Multiplying the system on either side by A


x=A
130

b:

gives A

1 Ax

11.6

OLS in matrix notation

For a PRF E (yi jxi ) = x0i , the OLS estimator ^ is the choice of vector b that minimises the sum
of squared residuals
n
X
2
SSR(b) =
yi x0i b :
i=1

This can be shown to be

^=

n
X

xi x0i

i=1

1 n
X

x i yi :

i=1

P
The matrix ni=1 xi x0i is a square (k + 1) (k + 1) matrix. To be non-singular, there must be
no vector c 6= 0 such that
n
X
xi x0i c = 0:
i=1
0
i=1 xi xi c =

Pn

For there to be such a vector c,


0 would require x0i c = 0 for all i, which would imply
a perfect linear relationship among the elements of the xi vector, i.e.Pperfect multicollinearity. So
the condition that there is no perfect multicollinearity implies that ni=1 xi x0i is non-singular and
has an inverse, and hence that ^ can be computed.
11.6.1

Proof

The OLS estimator can be derived using vector calculus, or it can shown to minimise SSR (b) as
follows. Write
n
X
2
SSR (b) =
yi x0i ^
xi b ^
i=1

n
X

x0i ^

yi

i=1

xi yi

x0i ^ + b

x i yi

n
0X

i=1

Pn

The rst term is SSR ^ =


the second term satises
n
X

n
X

i=1

x0i ^

yi

x0i ^
n
X

i=1

^ :

i=1

, while the formula for ^ can be used to show that

x i yi

i=1

n
X

xi x0i b

n
X

xi x0i

i=1

x i yi

i=1

n
X

n
X

xi x0i

i=1

1 n
X

x i yi

i=1

x i yi

i=1

= 0;
so

SSR (b) = SSR ^ + b

n
0X

xi x0i b

^ :

i=1

If b = ^ then b

0P
n
0
i=1 xi xi

n
0X

^ = 0. If b 6= ^ then writing c = b
xi x0i

^ =

i=1

n
X
i=1

xi x0i c

n
X

^ gives

zi2 > 0

i=1

since zi = c0 xi 6= 0 when there is no perfect multicollinearity. Thus SSR (b) > SSR ^
b 6= ^ . This shows that SSR (b) is minimised by b = ^ .
131

when

11.7

Unbiasedness of OLS

Suppose (yi ; xi ) are i.i.d. for i = 1; : : : ; n and E (yi jxi ) = x0i . Then the independence part of
i.i.d. implies that
E (yi jxi ) = E (yi jx1 ; : : : ; xn ) :
Then
2
!
n
h i
X
E ^ = E4
xi x0i
2

= E4
2

= E4

i=1

n
X

xi x0i

xi x0i

i=1

n
X
i=1

1 n
X
i=1

1 n
X
i=1

1 n
X
i=1

xi E (yi jx1 ; : : : ; xn )5
3

xi E (yi jxi )5
3

xi x0i 5

showing the OLS estimator is unbiased. The proof is clearly far simpler when expressed in matrix
notation.

11.8

Time series regressions

Matrix notation provides a convenient way to represent time series regressions. For example, the
ARDL model
E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

1 yt 1

+ 0 zt +

+ ::: +

1 zt 1

can be written as
E (yt jxt ; : : : ; x1 ) = x0t ;
where

B yt 1
B
B ..
B .
B
xt = B
B yt p
B zt
B
B ..
@ .
zt q

The strict exogeneity condition is

C
C
C
C
C
C;
C
C
C
C
A

B
B
B
B
B
=B
B
B
B
B
@

C
1 C
.. C
. C
C
C
p C:
C
0 C
.. C
. A
q

E (yt jxt ; : : : ; x1 ) = E (yt jxn ; : : : ; x1 ) ;

132

+ :::

p yt p
q zt q ;

under which ^ is exactly unbiased.


2
!
n
h i
X
0
E ^ = E4
xt xt
2

= E4
2

= E4

t=1

n
X

xt x0t

t=1

n
X
t=1

xt x0t

1 n
X
t=1

1 n
X
t=1

1 n
X
t=1

xt E (yt jxn ; : : : ; x1 )5
3

xt E (yt jxt ; : : : ; x1 )5
3

xt x0t 5

Without the strict exogeneity condition, ^ need only be asymptotically unbiased.

133

You might also like