You are on page 1of 117

Contents

1 Review of Statistics 4
1.1 Random Variables and Distributions . . . . . . . . . . . . . . . . . . 4
1.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Lecture Notes in Financial Econometrics (MBF, 1.3 Distributions Commonly Used in Tests . . . . . . . . . . . . . . . . . 13

MSc course at UNISG) 1.4 Normal Distribution of the Sample Mean as an Approximation . . . . 15

A Statistical Tables 17

Paul Söderlind1 2 Least Squares Estimation 20


2.1 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
13 May 2010 2.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A A Primer in Matrix Algebra 44

A Statistical Tables 50

3 Maximum Likelihood Estimation 53


3.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Key Properties of MLE . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Three Test Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 QMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Index Models 61
4.1 The Inputs to a MV Analysis . . . . . . . . . . . . . . . . . . . . . . 61
1 University of St. Gallen. Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St. Gallen,
4.2 Single-Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Switzerland. E-mail: Paul.Soderlind@unisg.ch. Document name: FinEcmtAll.TeX

1
4.3 Estimating Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.4 Non-Linear Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.4 Multi-Index Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8.5 (G)ARCH-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 70 8.6 Multivariate (G)ARCH . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.6 Estimating Expected Returns . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Estimation on Subsamples . . . . . . . . . . . . . . . . . . . . . . . 73 9 Risk Measures 165
4.8 Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 9.1 Symmetric Dispersion Measures . . . . . . . . . . . . . . . . . . . . 165
9.2 Downside Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5 Testing CAPM and Multifactor Models 80 9.3 Empirical Return Distributions . . . . . . . . . . . . . . . . . . . . . 175
5.1 Market Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 9.4 Threshold Exceedance . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.2 Several Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Fama-MacBeth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 10 Risk Measures II 184
10.1 Fitting a Mixture Normal Distribution to Data . . . . . . . . . . . . . 184
6 Time Series Analysis 98 10.2 Recap of Univariate Distributions . . . . . . . . . . . . . . . . . . . 187
6.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 98 10.3 Beyond (Linear) Correlations . . . . . . . . . . . . . . . . . . . . . . 189
6.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 10.4 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.3 Autoregression (AR) . . . . . . . . . . . . . . . . . . . . . . . . . . 100 10.5 Joint Tail Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.4 Moving Average (MA) . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5 ARMA(p,q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 11 Option Pricing and Estimation of Continuous Time Processes 204
6.6 VAR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 11.1 The Black-Scholes Model . . . . . . . . . . . . . . . . . . . . . . . . 204
6.7 Non-stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . 112 11.2 Estimation of the Volatility of a Random Walk Process . . . . . . . . 210

7 Predicting Asset Returns 119 12 Event Studies 218


7.1 Asset Prices, Random Walks, and the Efficient Market Hypothesis . . 119 12.1 Basic Structure of Event Studies . . . . . . . . . . . . . . . . . . . . 218
7.2 Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 12.2 Models of Normal Returns . . . . . . . . . . . . . . . . . . . . . . . 220
7.3 Other Predictors and Methods . . . . . . . . . . . . . . . . . . . . . 131 12.3 Testing the Abnormal Return . . . . . . . . . . . . . . . . . . . . . . 223
7.4 Security Analysts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 12.4 Quantitative Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.5 Technical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
13 Kernel Density Estimation and Regression 227
7.6 Spurious Regressions and In-Sample Overfit . . . . . . . . . . . . . 141
13.1 Non-Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . 227
7.7 Empirical U.S. Evidence on Stock Return Predictability . . . . . . . . 145
13.2 Examples of Non-Parametric Estimation . . . . . . . . . . . . . . . . 232
8 ARCH and GARCH 150
8.1 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.2 ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

2 3
.D..3 N(0,2) distribution Normal distributions
0.4 0.4
Pr(−2 < x ≤ −1) = 16 % N(0,2)
0.3 Pr(0 < x ≤ 1) = 26 % 0.3 N(1,2)
0.2 0.2
1 Review of Statistics
0.1 0.1

More advanced material is denoted by a star ( ). It is not required reading. 0 0
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x
1.1 Random Variables and Distributions

1.1.1 Distributions Normal distributions


0.4
A univariate distribution of a random variable x describes the probability of different 0.3
N(0,2)
N(0,1)
values. If f .x/ is the probability density function, then the probability that x is between
0.2
A and B is calculated as the area under the density function from A to B
Z 0.1
B
Pr .A  x < B/ D f .x/dx: (1.1) 0
A −4 −3 −2 −1 0 1 2 3 4
x
See Figure 1.1 for illustrations of normal (gaussian) distributions.

Remark 1.1 If x  N.;  2 /, then the probability density function is Figure 1.1: A few different normal distributions
1 1 2
. x  / :
f .x/ D p e 2
A joint normal distributions is completely described by the means and the covariance
2 2
matrix " # " # " #!
This is a bell-shaped curve centered on the mean  and where the standard deviation  x x x2 xy
N ; ; (1.3)
determines the “width” of the curve. y y xy y2

A bivariate distribution of the random variables x and y contains the same information where x and y denote means of x and y, x2 and y2 denote the variances of x and
as the two respective univariate distributions, but also information on how x and y are y and xy denotes their covariance. Some alternative notations are used: E x for the
related. Let h .x; y/ be the joint density function, then the probability that x is between mean, Std.x/ for the standard deviation, Var.x/ for the variance and Cov.x; y/ for the
A and B and y is between C and D is calculated as the volume under the surface of the covariance.
density function Clearly, if the covariance xy is zero, then the variables are (linearly) unrelated to each
other. Otherwise, information about x can help us to make a better guess of y. See Figure
Z B Z D
Pr .A  x < B and C  x < D/ D h.x; y/dxdy: (1.2) 1.2 for an example. The correlation of x and y is defined as
A C
xy
xy D : (1.4)
x y

4 5
If two random variables happen to be independent of each other, then the joint density For the bivariate normal distribution (1.3) we have the distribution of y conditional on a
function is just the product of the two univariate densities (here denoted f .x/ and k.y/) given value of x as
 
h.x; y/ D f .x/ k .y/ if x and y are independent. (1.5) xy xy xy
yjx  N y C 2 .x x / ; y2 : (1.7)
x x2
This is useful in many cases, for instance, when we construct likelihood functions for Notice that the conditional mean can be interpreted as the best guess of y given that we
maximum likelihood estimation. know x. Similarly, the conditional variance can be interpreted as the variance of the
N(0,1) distribution Bivariate normal distribution, corr=0.1 forecast error (using the conditional mean as the forecast). The conditional and marginal
distribution coincide if y is uncorrelated with x. (This follows directly from combining
(1.5) and (1.6)). Otherwise, the mean of the conditional distribution depends on x, and
0.4 0.2
the variance is smaller than in the marginal distribution (we have more information). See
0.1
0.2 Figure 1.3 for an illustration.
2
0
2 0
0 0 Bivariate normal distribution, corr=0.1 Conditional distribution of y, corr=0.1
−2 −1 0 1 2 y −2 −2 x 0.8
x x=−0.8
0.6 x=0
0.2
0.4
0.1
Bivariate normal distribution, corr=0.8
2 0.2
0
2 0
0 0
0.2 y −2 −2 x −2 −1 0 1 2
y
0.1
2
0
2 0
0 Bivariate normal distribution, corr=0.8 Conditional distribution of y, corr=0.8
y −2 −2 x 0.8
x=−0.8
0.6 x=0
0.2
Figure 1.2: Density functions of univariate and bivariate normal distributions 0.4
0.1
2 0.2
0
2 0
0 0
1.1.2 Conditional Distributions y −2 −2 x −2 −1 0 1 2
y
If h .x; y/ is the joint density function and f .x/ the (marginal) density function of x, then
the conditional density function is
Figure 1.3: Density functions of normal distributions
g.yjx/ D h.x; y/=f .x/: (1.6)

6 7
1.1.3 Illustrating a Distribution The standard deviation (here denoted Std.x t /), the square root of the variance, is the most
common measure of volatility. (Sometimes we use T 1 in the denominator of the sample
If we know that type of distribution (uniform, normal, etc) a variable has, then the best
variance instead T .) See Figure 1.4 for an illustration.
way of illustrating the distribution is to estimate its parameters (mean, variance and what-
A sample mean is normally distributed if x t is normal distributed, x t  N.;  2 /. The
ever more—see below) and then draw the density function.
basic reason is that a linear combination of normally distributed variables is also normally
In case we are not sure about which distribution to use, the first step is typically to draw
distributed. However, a sample average is typically approximately normally distributed
a histogram: it shows the relative frequencies for different bins (intervals). For instance, it
even if the variable is not (discussed below). If x t is iid (independently and identically
could show the relative frequencies of a variable x t being in each of the follow intervals:
distributed), then the variance of a sample mean is
-0.5 to 0, 0 to 0.5 and 0.5 to 1.0. Clearly, the relative frequencies should sum to unity (or
100%), but they are sometimes normalized so the area under the histogram has an area of Var.x/
N D  2 =T , if x t is iid. (1.9)
unity (as a distribution has).
See Figure 1.4 for an illustration. A sample average is (typically) unbiased, that is, the expected value of the sample
average equals the population mean, that is,
Histogram of small growth stocks Histogram of large value stocks
0.1
mean, std:
0.1
mean, std:
E xN D E x t D : (1.10)
0.33 8.40 0.64 4.76
skew, kurt, BJ: skew, kurt, BJ:
0.5 7.1 468.8 −0.2 4.1 34.2 Since sample averages are typically normally distributed in large samples (according to
0.05 0.05
the central limit theorem), we thus have

xN  N.;  2 =T /; (1.11)
0 0
−20 −10 0 10 20 −20 −10 0 10 20
Monthly excess return, % Monthly excess return, % so we can construct a t-stat as
Monthly data on two U.S. indices, 1957:1−2008:12 xN 
Sample size: 624
tD p ; (1.12)
= T
which has an N.0; 1/ distribution.
Figure 1.4: Histogram of returns, the curve is a normal distribution with the same mean Proof. (of (1.9)–(1.10)) To prove (1.9), notice that
and standard deviation as the return series P 
T
Var.x/
N D Var tD1 x t =T
P
D TtD1 Var .x t =T /
1.2 Moments
D T Var .x t / =T 2
1.2.1 Mean and Standard Deviation D  2 =T:

The mean and variance of a series are estimated as The first equality is just a definition and the second equality follows from the assumption
PT PT that x t and xs are independently distributed. This means, for instance, that Var.x2 C
xN D t D1 x t =T and O 2 D t D1 .x t N 2 =T:
x/ (1.8)
x3 / D Var.x2 / C Var.x3 / since the covariance is zero. The third equality follows from

8 9
the assumption that x t and xs are identically distributed (so their variances are the same). 1.2.3 Covariance and Correlation
The fourth equality is a trivial simplification.
The covariance of two variables (here x and y) is typically estimated as
To prove (1.10)
PT
PT O xy D t D1 .x t N .y t
x/ N =T:
y/ (1.15)
E xN D E t D1 x t =T
P
D TtD1 E x t =T (Sometimes we use T 1 in the denominator of the sample covariance instead of T .)
D E xt :
Correlation of 0.9 Correlation of 0

The first equality is just a definition and the second equality is always true (the expectation 2 2
of a sum is the sum of expectations), and the third equality follows from the assumption 1 1
of identical distributions which implies identical expectations. 0 0

y
−1 −1
1.2.2 Skewness and Kurtosis −2 −2
−5 0 5 −5 0 5
The skewness, kurtosis and Bera-Jarque test for normality are useful diagnostic tools. x x
They are

Test statistic Distribution Correlation of −0.9 Correlation of 0


PT 
xt  3
skewness D 1
t D1 N .0; 6=T /
T
P T


xt  4
(1.13) 2 2
kurtosis D 1
T t D1 
N .3; 24=T / 1 1
Bera-Jarque D T
6
skewness2 C 24
T
.kurtosis 3/2 22 : 0 0

y
This is implemented by using the estimated mean and standard deviation. The distribu- −1 −1
−2 −2
tions stated on the right hand side of (1.13) are under the null hypothesis that x t is iid
−5 0 5 −5 0 5
N.;  2 /. The “excess kurtosis” is defined as the kurtosis minus 3. The test statistic for x x
the normality test (Bera-Jarque) can be compared with 4.6 or 6.0, which are the 10% and
5% critical values of a 22 distribution.
Clearly, we can test the skewness and kurtosis by traditional t-stats as in Figure 1.5: Example of correlations.

skewness kurtosis 3 The correlation of two variables is then estimated as


tD p and t D p ; (1.14)
6=T 24=T O xy
Oxy D ; (1.16)
which both have N.0; 1/ distribution under the null hypothesis of a normal distribution. O x O y
See Figure 1.4 for an illustration. where O x and O y are the estimated standard deviations. A correlation must be between
1 and 1. Note that covariance and correlation measure the degree of linear relation only.
This is illustrated in Figure 1.5.

10 11
See Figure 1.6 for an empirical illustration. 1.3 Distributions Commonly Used in Tests
Under the null hpothesis of no correlation—and if the data is approximately normally
distributed, then 1.3.1 Standard Normal Distribution, N.0; 1/
O
p  N.0; 1=T /; (1.17) Suppose the random variable x has a N.;  2 / distribution. Then, the test statistic has a
1 O2
standard normal distribution
so we can form a t-stat as
p O x 
tD Tp ; (1.18) zD  N.0; 1/: (1.19)
1 O2 
which has an N.0; 1/ distribution (in large samples). To see this, notice that x  has a mean of zero and that x= has a standard deviation of
unity.
25
Monthly data on two U.S. indices 1957:1−2008:12
20 1.3.2 t-distribution
Correlation: 0.55
Large value stocks, montly returns, %

15
If we instead need to estimate  to use in (1.19), then the test statistic has tdf -distribution
10 x 
tD  tn ; (1.20)
5 O
0 where n denotes the “degrees of freedom,” that is the number of observations minus the
number of estimated parameters. For instance, if we have a sample with T data points
−5
and only estimate the mean, then n D T 1.
−10
The t-distribution has more probability mass in the tails: gives a more “conservative”
−15 test (harder to reject the null hypothesis), but the difference vanishes as the degrees of
−20 freedom (sample size) increases. See Figure 1.7 for a comparison and Table A.1 for
critical values.
−25
−25 −20 −15 −10 −5 0 5 10 15 20 25
Small growth stocks, montly returns, % Example 1.2 (t -distribution) If t D 2:0 and n D 50, then this is larger than the10%
critical value (but not the 5% critical value) for a 2-sided test in Table A.1.
Figure 1.6: Scatter plot of two different portfolio returns
1.3.3 Chi-square Distribution

If z  N.0; 1/, then z 2  21 , that is, z 2 has a chi-square distribution with one degree of
freedom. This can be generalized in several ways. For instance, if x  N.x ; xx / and
y  N.y ; yy / and they are uncorrelated, then Œ.x x /=x 2 C Œ.y y /=y 2  22 .
More generally, we have

v0˙ 1
v  2n , if the n  1 vector v  N.0; ˙/: (1.21)

12 13
N(0,1) and t(10) distributions N(0,1) and t(50) distributions 1.3.4 F -distribution
0.4 0.4
10% crit val:
If we instead need to estimate ˙ in (1.21), then
0.3 0.3
1.64 1.64
0.2 1.81 0.2 1.68 v 0 ˙O 1
v=J  Fn1 ;n2 (1.22)
0.1 N(0,1) 0.1 N(0,1)
t(10) t(50) where Fn1 ;n2 denotes an F -distribution with .J; n2 ) degrees of freedom. Similar to the t-
0
−3 −2 −1 0 1 2 3
0
−3 −2 −1 0 1 2 3
distribution, n2 is the number of observations minus the number of estimated parameters.
x x See Figure 1.7 for an illustration and Tables A.3–A.4 for critical values.

Chi−square(n) distributions F(n,50) distributions


1.4 Normal Distribution of the Sample Mean as an Approximation
1
0.4
n=2 n=2 In many cases, it is unreasonable to just assume that the variable is normally distributed.
0.3
n=5
4.61
n=5
2.41 The nice thing with a sample mean (or sample average) is that it will still be normally
9.24 0.5 1.97
0.2 distributed—at least approximately (in a reasonably large sample). This section gives
0.1 a short summary of what happens to sample means as the sample size increases (often
0 0 called “asymptotic theory”)
0 2 4 6 8 10 12 0 1 2 3 4 5
x x
Distribution of sample avg. Distribution of √T × sample avg.
3 0.4
T=5
T=25
Figure 1.7: Probability density functions T=100 0.3
2
0.2
See Figure 1.7 for an illustration and Table A.2 for critical values. 1
0.1

Example 1.3 (22 distribution) Suppose x is a 2  1 vector 0 0


−2 −1 0 1 2 −6 −4 −2 0 2 4 6
" # " # " #! Sample average √T × sample average
x1 4 5 3
N ; :
x2 2 3 4 Sample average of zt−1 where zt has a χ2(1) distribution

If x1 D 3 and x2 D 5, then
" #0 " # 1 " #
3 4 5 3 3 4 Figure 1.8: Sampling distributions
 6:1
5 2 3 4 5 2
The law of large numbers (LLN) says that the sample mean converges to the true
has a  22 distribution. Notice that 6.1 is higher than the 5% critical value (but not the population mean as the sample size goes to infinity. This holds for a very large class
1% critical value) in Table A.2. of random variables, but there are exceptions. A sufficient (but not necessary) condition
for this convergence is that the sample average is unbiased (as in (1.10)) and that the

14 15
variance goes to zero as the sample size goes to infinity (as in (1.9)). (This is also called A Statistical Tables
convergence in mean square.) To see the LLN in action, see Figure 1.8.
p
The central limit theorem (CLT) says that T xN converges in distribution to a normal n Critical values
distribution as the sample size increases. See Figure 1.8 for an illustration. This also 10% 5% 1%
10 1.81 2.23 3.17
holds for a large class of random variables—and it is a very useful result since it allows
20 1.72 2.09 2.85
us to test hypothesis. Most estimators (including least squares and other methods) are 30 1.70 2.04 2.75
effectively some kind of sample average, so the CLT can be applied. 40 1.68 2.02 2.70
50 1.68 2.01 2.68
60 1.67 2.00 2.66
70 1.67 1.99 2.65
80 1.66 1.99 2.64
90 1.66 1.99 2.63
100 1.66 1.98 2.63
Normal 1.64 1.96 2.58

Table A.1: Critical values (two-sided test) of t distribution (different degrees of freedom)
and normal distribution.

Bibliography

16 17
n Critical values
10% 5% 1%
1 2.71 3.84 6.63
2 4.61 5.99 9.21
3 6.25 7.81 11.34
4 7.78 9.49 13.28
5 9.24 11.07 15.09
6 10.64 12.59 16.81
7 12.02 14.07 18.48
8 13.36 15.51 20.09
9 14.68 16.92 21.67
n1 n2 2n1 =n1
10 15.99 18.31 23.21
10 30 50 100 300
1 3.29 2.88 2.81 2.76 2.72 2.71
Table A.2: Critical values of chisquare distribution (different degrees of freedom, n). 2 2.92 2.49 2.41 2.36 2.32 2.30
3 2.73 2.28 2.20 2.14 2.10 2.08
4 2.61 2.14 2.06 2.00 1.96 1.94
5 2.52 2.05 1.97 1.91 1.87 1.85
6 2.46 1.98 1.90 1.83 1.79 1.77
7 2.41 1.93 1.84 1.78 1.74 1.72
8 2.38 1.88 1.80 1.73 1.69 1.67
n1 n2 2n1 =n1 9 2.35 1.85 1.76 1.69 1.65 1.63
10 2.32 1.82 1.73 1.66 1.62 1.60
10 30 50 100 300
1 4.96 4.17 4.03 3.94 3.87 3.84
2 4.10 3.32 3.18 3.09 3.03 3.00 Table A.4: 10% Critical values of Fn1;n2 distribution (different degrees of freedom).
3 3.71 2.92 2.79 2.70 2.63 2.60
4 3.48 2.69 2.56 2.46 2.40 2.37
5 3.33 2.53 2.40 2.31 2.24 2.21
6 3.22 2.42 2.29 2.19 2.13 2.10
7 3.14 2.33 2.20 2.10 2.04 2.01
8 3.07 2.27 2.13 2.03 1.97 1.94
9 3.02 2.21 2.07 1.97 1.91 1.88
10 2.98 2.16 2.03 1.93 1.86 1.83

Table A.3: 5% Critical values of Fn1;n2 distribution (different degrees of freedom).

18 19
.D..3 Regression: y = b0 + b1x+ ε
10

8 Intercept (b0) and slope (b1): 2.0 1.3

2 Least Squares Estimation 6

4
Reference: Verbeek (2008) 2 and 4
2
More advanced material is denoted by a star ( ). It is not required reading.
0

y
2.1 Least Squares −2

−4
2.1.1 Simple Regression: Constant and One Regressor
−6
The simplest regression model is −8 Data points
Regression line
y t D ˇ0 C ˇ1 x t C u t , where E u t D 0 and Cov.x t ; u t / D 0; (2.1) −10
−10 −5 0 5 10
x
where we can observe (have data on) the dependent variable y t and the regressor x t but
not the residual u t . In principle, the residual should account for all the movements in y t
that we cannot explain (by x t ). Figure 2.1: Example of OLS
Note the two very important assumptions: (i) the mean of the residual is zero; and
to find the value of b in the interval blow  b  bhigh , which makes the value of the
(ii) the residual is not correlated with the regressor, x t . If the regressor summarizes all
differentiable function f .b/ as small as possible. The answer is blow , bhigh , or the value
the useful information we have in order to describe y t , then the assumptions imply that
of b where df .b/=db D 0. See Figure 2.2.
we have no way of making a more intelligent guess of u t (even after having observed x t )
than that it will be zero. The first order conditions for a minimum are that the derivatives of this loss function
Suppose you do not know ˇ0 or ˇ1 , and that you have a sample of data: y t and x t for with respect to b0 and b1 should be zero. Notice that
t D 1; :::; T . The LS estimator of ˇ0 and ˇ1 minimizes the loss function
@
PT .y t b0 b1 x t /2 D 2.y t b0 b1 x t /1 (2.3)
tD1 .y t b0 b1 x t /2 D .y1 b0 b1 x1 /2 C .y2 b0 b1 x2 /2 C :::: (2.2) @b0
@
.y t b0 b1 x t /2 D 2.y t b0 b1 x t /x t : (2.4)
by choosing b0 and b1 to make the loss function value as small as possible. The objective @b1
is thus to pick values of b0 and b1 in order to make the model fit the data as closely
as possible—where close is taken to be a small variance of the unexplained part (the
residual). See Figure 2.1 for an illustration.

Remark 2.1 (First order condition for minimizing a differentiable function). We want

20 21
2b2 2b2+(c−4)2 OLS, y = b + b x + u, b =0
0 1 0 Sum of squared errors
20 10
2 y: −1.5 −0.6 2.1 b1: 1.8
30
15
1 x: −1.0 0.0 1.0 R2: 0.92
20 Std(b1): 0.30
10 0 5

y
10
Data
5 0 −1 2*x
2 6 −2 OLS
0 0 4
−2 2 0
−3 −2 −1 0 1 2 3 b c −1 −0.5 0 0.5 1 0 1 2 3 4
b x b1

Figure 2.2: Quadratic loss function. Subfigure a: 1 coefficient; Subfigure b: 2 coefficients


OLS Sum of squared errors
10
b1: 1.8
Let .ˇO0 ; ˇO1 / be the values of .b0 ; b1 / where that is true
2 y: −1.3 −1.0 2.3
x: −1.0 0.0 1.0 2
1 R : 0.81
Std(b1): 0.50
@ PT P
ˇO0 ˇO1 x t /2 D 2 TtD1 .y t ˇO0 ˇO1 x t /1 D 0 (2.5) 0 5

y
.y t
@ˇ0 tD1 −1
@ PT P
.y t ˇO0 ˇO1 x t /2 D 2 TtD1 .y t ˇO0 ˇO1 x t /x t D 0; (2.6) −2
@ˇ1 tD1 0
−1 −0.5 0 0.5 1 0 1 2 3 4
x b1
which are two equations in two unknowns (ˇO0 and ˇO1 ), which must be solved simultane-
ously. These equations show that both the constant and x t should be orthogonal to the fit-
ted residuals, uO t D y t ˇO0 ˇO1 x t . This is indeed a defining feature of LS and can be seen Figure 2.3: Example of OLS estimation
as the sample analogues of the assumptions in (2.1) that E u t D 0 and Cov.x t ; u t / D 0.
To see this, note that (2.5) says that the sample average of uO t should be zero. Similarly, Remark 2.4 (Cross moments and covariance). A covariance is defined as
(2.6) says that the sample cross moment of uO t and x t should also be zero, which implies
that the sample covariance is zero as well since uO t has a zero sample mean. Cov.x; y/ D EŒ.x E x/.y E y/
D E.xy xEy y E x C E x E y/
Remark 2.2 Note that ˇi is the true (unobservable) value which we estimate to be ˇOi . D E xy ExEy Ey Ex C Ex Ey
Whereas ˇi is an unknown (deterministic) number, ˇOi is a random variable since it is
D E xy E x E y:
calculated as a function of the random sample of y t and x t .
When x D y, then we get Var.x/ D E x 2 .E x/2 . These results hold for sample moments
Remark 2.3 Least squares is only one of many possible ways to estimate regression co-
too.
efficients. We will discuss other methods later on.
When the means of y and x are zero, then we can disregard the constant. In this case,

22 23
Scatter plot against market return Scatter plot against market return HiTec Utils
30 30
US data constant -0.19 0.25
(-1.14) (1.57)
Excess return %, HiTec

1970:1−2008:12

Excess return %, Utils


20 20

10 10
market return 1.32 0.52
(30.46) (11.35)
0 0 R2 0.74 0.32
−10 −10 obs 463.00 463.00
α −0.17 α 0.23 Autocorr (t) -0.88 0.88
−20 1.31 −20 0.53
β β White 9.14 20.38
−30 −30 All slopes 342.15 149.79
−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30
Excess return %, market Excess return %, market
Table 2.1: CAPM regressions, monthly returns, %, US data 1970:1-2008:7. Numbers
in parentheses are t-stats. Autocorr is a N(0,1) test statistic (autocorrelation); White is a
Figure 2.4: Scatter plot against market return chi-square test statistic (heteroskedasticity), df = K(K+1)/2 - 1; All slopes is a chi-square
test statistic (of all slope coeffs), df = K-1
(2.6) with ˇO0 D 0 immediately gives
PT P This is the squared correlation of the actual and predicted value of y t .
t D1 y t x tD ˇO1 TtD1 x t x t or
PT To understand this result, suppose that x t has no explanatory power, so R2 should be
y t x t =T zero. How does that happen? Well, if x t is uncorrelated with y t , then the numerator in
ˇO1 D PTt D1 : (2.7)
t D1 x t x t =T (2.7) is zero so ˇO1 D 0. As a consequence yO t D ˇO0 , which is a constant—and a constant
In this case, the coefficient estimator is the sample covariance (recall: means are zero) of is always uncorrelated with anything else (as correlations measure comovements around
y t and x t , divided by the sample variance of the regressor x t (this statement is actually the means).
true even if the means are not zero and a constant is included on the right hand side—just To get a bit more intuition for what R2 represents, suppose the estimated coefficients
more tedious to show it). equal the true coefficients, so yO t D ˇ0 C ˇ1 x t . In this case,
See Table 2.1 and Figure 2.4 for illustrations.
R2 D Corr .ˇ0 C ˇ1 x t C u t ; ˇ0 C ˇ1 x t /2 ;

2.1.2 Least Squares: Goodness of Fit that is, the squared correlation of y t with the systematic part of y t . Clearly, if the model
The quality of a regression model is often measured in terms of its ability to explain the is perfect so u t D 0, then R2 D 1. In contrast, when there is no movements in the
movements of the dependent variable. systematic part (ˇ1 D 0), then R2 D 0.
Let yO t be the fitted (predicted) value of y t . For instance, with (2.1) it would be yO t D See Figure 2.5 for an example.
ˇ0 C ˇO1 x t . If a constant is included in the regression (or the means of y and x are zero),
O
2.1.3 Least Squares: Outliers
then a check of the goodness of fit of the model is given by
Since the loss function in (2.2) is quadratic, a few outliers can easily have a very large
R2 D Corr.y t ; yO t /2 : (2.8)
influence on the estimated coefficients. For instance, suppose the true model is y t D

24 25
HiTec Utils 2
Return = a + b*lagged Return, slope Return = a + b*lagged Return, R
constant 0.17 -0.08 Slope with 90% conf band
(1.21) (-0.58) 0.5
market return 1.10 0.72 0.1
(25.15) (16.39) 0
SMB 0.21 -0.16 0.05
(3.64) (-2.95) −0.5
HML -0.62 0.57 0
0 20 40 60 0 20 40 60
(-8.62) (9.04) Return horizon (months) Return horizon (months)
R2 0.82 0.49 US stock returns 1926:1−2008:12
obs 463.00 463.00
Autocorr (t) 0.49 1.15
Return = a + b*E/P, slope Return = a + b*E/P, R2
White 70.75 32.88 0.6 0.3
All slopes 377.42 228.97
0.4 0.2
Table 2.2: Fama-French regressions, monthly returns, %, US data 1970:1-2008:7. Num-
bers in parentheses are t-stats. Autocorr is a N(0,1) test statistic (autocorrelation); White 0.2 0.1
is a chi-square test statistic (heteroskedasticity), df = K(K+1)/2 - 1; All slopes is a chi-
square test statistic (of all slope coeffs), df = K-1 0 0
0 20 40 60 0 20 40 60
Return horizon (months) Return horizon (months)
0:75x t C u t , and that the residual is very large for some time period s. If the regression
coefficient happened to be 0.75 (the true value, actually), the loss function value would be
Figure 2.5: Predicting US stock returns (various investment horizons) with the dividend-
large due to the u2t term. The loss function value will probably be lower if the coefficient price ratio
is changed to pick up the ys observation—even if this means that the errors for the other
observations become larger (the sum of the square of many small errors can very well be 2.1.4 The Distribution of ˇO
less than the square of a single large error).
Note that the estimated coefficients are random variables since they depend on which par-
There is of course nothing sacred about the quadratic loss function. Instead of (2.2)
ticular sample that has been “drawn.” This means that we cannot be sure that the estimated
one could, for instance, use a loss function in terms of the absolute value of the error
coefficients are equal to the true coefficients (ˇ0 and ˇ1 in (2.1)). We can calculate an es-
˙ tTD1 jy t ˇ0 ˇ1 x t j. This would produce the Least Absolute Deviation (LAD) estima-
timate of this uncertainty in the form of variances and covariances of ˇO0 and ˇO1 . These
tor. It is typically less sensitive to outliers. This is illustrated in Figure 2.7. However, LS
can be used for testing hypotheses about the coefficients, for instance, that ˇ1 D 0.
is by far the most popular choice. There are two main reasons: LS is very easy to compute
To see where the uncertainty comes from consider the simple case in (2.7). Use (2.1)
and it is fairly straightforward to construct standard errors and confidence intervals for the
estimator. (From an econometric point of view you may want to add that LS coincides
with maximum likelihood when the errors are normally distributed.)

26 27
OLS: senstitivity to outlier OLS vs LAD of y = 0.75*x + u
2 2
y: −1.125 −0.750 1.750 1.125 y: −1.125 −0.750 1.750 1.125
1.5 1.5
x: −1.500 −1.000 1.000 1.500 x: −1.500 −1.000 1.000 1.500
1 1
Three data points are on the line y=0.75x,
0.5 the fourth has a big error 0.5

0 0
y

y
−0.5 −0.5

−1 −1
Data Data
−1.5 OLS (0.25 0.90) −1.5 OLS (0.25 0.90)
True (0.00 0.75) LAD (0.00 0.75)
−2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x

Figure 2.6: Data and regression line from OLS Figure 2.7: Data and regression line from OLS and LAD

to substitute for y t (recall ˇ0 D 0) perfect — and with large movements in u t we will see large movements in ˇ. O The second
PT conclusion is that a small sample (small T ) will also lead to large random movements in
t D1 x t .ˇ1 x t C u t / =T P
ˇO1 D PT ˇO1 —in contrast to a large sample where the randomness in TtD1 x t u t =T is averaged out
t D1 x t x t =T
PT more effectively (should be zero in a large sample).
x t u t =T O (i) set up a
D ˇ1 C PTt D1 ; (2.9) There are three main routes to learn more about the distribution of ˇ:
t D1 x t x t =T small “experiment” in the computer and simulate the distribution (Monte Carlo or boot-
so the OLS estimate, ˇO1 , equals the true value, ˇ1 , plus the sample covariance of x t and strap simulation); (ii) pretend that the regressors can be treated as fixed numbers and then
u t divided by the sample variance of x t . One of the basic assumptions in (2.1) is that assume something about the distribution of the residuals; or (iii) use the asymptotic (large
the covariance of the regressor and the residual is zero. This should hold in a very large sample) distribution as an approximation. The asymptotic distribution can often be de-
sample (or else OLS cannot be used to estimate ˇ1 ), but in a small sample it may be rived, in contrast to the exact distribution in a sample of a given size. If the actual sample
different from zero. Since u t is a random variable, ˇO1 is too. Only as the sample gets very is large, then the asymptotic distribution may be a good approximation.
large can we be (almost) sure that the second term in (2.9) vanishes. The simulation approach has the advantage of giving a precise answer—but the dis-
Equation (2.9) will give different values of ˇO when we use different samples, that is advantage of requiring a very precise question (must write computer code that is tailor
different draws of the random variables u t , x t , and y t . Since the true value, ˇ, is a fixed made for the particular model we are looking at, including the specific parameter values).
constant, this distribution describes the uncertainty we should have about the true value See Figure 2.9 for an example.
after having obtained a specific estimated value.
The first conclusion from (2.9) is that, with u t D 0 the estimate would always be

28 29
2.1.5 The Distribution of ˇO with Fixed Regressors The first line follows directly from (2.10), since ˇ1 is a constant. Notice that the two
P
1= TtD1 x t x t are kept separate in order to facilitate the comparison with the case of several
The assumption of fixed regressors makes a lot of sense in controlled experiments, where
regressors. The second line follows from assuming that the residuals are uncorrelated with
we actually can generate different samples with the same values of the regressors (the
each other (Cov.ui ; uj / D 0 if i ¤ j ), so all cross terms (xi xj Cov.ui ; uj /) are zero.
heat or whatever). It makes much less sense in econometrics. However, it is easy to
The third line follows from assuming that the variances are the same across observations
derive results for this case—and those results happen to be very similar to what asymptotic
(i2 D j2 D  2 ). The fourth and fitfth lines are just algebraic simplifications.
theory gives.
Notice that the denominator increases with the sample size while the numerator stays
Remark 2.5 (Linear combination of normally distributed variables.) If the random vari- constant: a larger sample gives a smaller uncertainty about the estimate. Similarly, a lower
ables z t and v t are normally distributed, then a C bz t C cv t is too. To be precise, volatility of the residuals (lower  2 ) also gives a lower uncertainty about the estimate. See

a C bz t C cv t  N a C bz C cv ; b 2 z2 C c 2 v2 C 2bczv . Figure 2.8.


Suppose u t  N 0;  2 , then (2.9) shows that ˇO1 is normally distributed. The reason Example 2.6 When the regressor is just a constant (equal to one) x t D 1, then we have
is that ˇO1 is just a constant (ˇ1 ) plus a linear combination of normally distributed residuals PT 0
PT O D  2 =T:
P tD1 x t x t D tD1 1  10 D T so Var.ˇ/
(with fixed regressors x t = TtD1 x t x t can be treated as constant). It is straightforward to
see that the mean of this normal distribution is ˇ1 (the true value), since the rest is a linear (This is the classical expression for the variance of a sample mean.)
combination of the residuals—and they all have a zero mean. Finding the variance of ˇO1
is just slightly more complicated. First, write (2.9) as Example 2.7 When the regressor is a zero mean variable, then we have
PT h i
1 0 c O c
t D1 x t x t D Var.x t /T so Var.ˇ/ D  = Var.xi /T :
2
ˇO1 D ˇ1 C PT .x1 u1 C x2 u2 C : : : xT uT / : (2.10)
t D1 x t x t
c t /. Why?
The variance is increasing in  2 , but decreasing in both T and Var.x
Second, remember that we treat x t as fixed numbers (“constants”). Third, assume that the
residuals are iid: they are uncorrelated with each other (independently distributed) and
have the same variances (identically distributed). The variance of (2.10) is then 2.1.6 The Distribution of ˇ: O A Bit of Asymptotic Theory
1 1 P P
Var.ˇO1 / D PT Var .x1 u1 C x2 u2 C : : : xT u t / PT A law of large numbers would (in most cases) say that both TtD1 x t2 =T and TtD1 x t u t =T
t D1 x t x t t D1 x t x t in (2.9) converges to their expected values as T ! 1. The reason is that both are sample
1  1
D PT x12 12 C x22 22 C : : : xT2 T2 PT averages of random variables (clearly, both x t2 and x t u t are random variables). These ex-
t D1 x t x t t D1 x t x t
 pected values are Var .x t / and Cov .x t ; u t /, respectively (recall both x t and u t have zero
1 1
D PT x12  2 C x22  2 C : : : xT2  PT 2
means). The key to show that ˇO is consistent is that Cov .x t ; u t / D 0. This highlights the
x
t D1 t t x t D1 x t x t
1 P  1 importance of using good theory to derive not only the systematic part of (2.1), but also
T 2
D PT t D1 x t x t  PT in understanding the properties of the errors. For instance, when economic theory tells
t D1 x t x t t D1 x t x t
1 us that y t and x t affect each other (as prices and quantities typically do), then the errors
D PT  2: (2.11) are likely to be correlated with the regressors—and LS is inconsistent. One common way
t D1 x t x t
to get around that is to use an instrumental variables technique. Consistency is a feature

30 31
Regression: y = b0 + b1x+ ε Distribution of t−stat, T=5 Distribution of t−stat, T=100
Regression: large error variance
10 10
b0 and b1: 2.0 1.3 0.4 0.4
5 5 0.3 0.3
0.2 0.2
0 0
y

y
0.1 0.1

−5 −5 0 0
Data points −4 −2 0 2 4 −4 −2 0 2 4
Regression line
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10 2
Model: Rt=0.9ft+εt, εt = vt − 2 where vt has a χ (2) distribution
x x
Results for T=5 and T=100:
Probability density functions
Regression: little variation in x Regression: small sample
10 10 Kurtosis of t−stat: 71.9 3.1
0.4 N(0,1)
Frequency of abs(t−stat)>1.645: 0.25 0.10
χ2(2)−2
5 5 Frequency of abs(t−stat)>1.96: 0.19 0.06 0.3
0.2
0 0
y

0.1

−5 −5 0
−4 −2 0 2 4

−10 −10
−10 −5 0 5 10 −10 −5 0 5 10 Figure 2.9: Distribution of LS estimator when residuals have a non-normal distribution
x x

x t u t ) with a zero expected value, since we assumed that ˇO is consistent. Under weak
p
Figure 2.8: Regressions: importance of error variance and variation of regressor conditions, a central limit theorem applies so T times a sample average converges to a
p
normal distribution. This shows that T ˇO has an asymptotic normal distribution. It turns
we want from most estimators, since it says that we would at least get it right if we had out that this is a property of many estimators, basically because most estimators are some
enough data. kind of sample average. The properties of this distribution are quite similar to those that
Suppose that ˇO is consistent. Can we say anything more about the asymptotic distri- we derived by assuming that the regressors were fixed numbers.
bution. Well, the distribution of ˇO converges to a spike with all the mass at ˇ, but the
p
distribution of T .ˇO ˇ/, will typically converge to a non-trivial normal distribution. 2.1.7 Multiple Regression
To see why, note from (2.9) that we can write
All the previous results still hold in a multiple regression—with suitable reinterpretations
p P  1 p PT
T .ˇO ˇ/ D T 2
t D1 x t =T T t D1 x t u t =T: (2.12) of the notation.

The first term on the right hand side will typically converge to the inverse of Var .x t /, as
p
discussed earlier. The second term is T times a sample average (of the random variable

32 33
Consider the linear model The logic of a hypothesis test is perhaps best described by a example. Suppose you
want to test the hypothesis that ˇ D 1. (Econometric programs often automatically report
y t D x1t ˇ1 C x2t ˇ2 C    C xk t ˇk C u t results for the null hypothesis that ˇ D 0.) The steps are then as follows.
D x t0 ˇ C u t ; (2.13)
1. Construct distribution under H0 : set ˇ D 1 in (2.17).
where y t and u t are scalars, x t a k 1 vector, and ˇ is a k 1 vector of the true coefficients
2. Would test statistic (t) you get be very unusual under the H0 distribution? Depends
(see Appendix A for a summary of matrix algebra). Least squares minimizes the sum of
on the alternative and what tolerance you have towards unusual outcomes.
the squared fitted residuals
PT PT O 2; 3. What is the alternative hypothesis? Suppose H1 : ˇ ¤ 1 so unusual is the same as
O 2t
t D1 u D t D1 .y t x t0 ˇ/ (2.14)
a value of ˇO far from 1, that is, jtj is large.
by choosing the vector ˇ. The first order conditions are
4. Define how tolerant towards unusual outcomes you are. For instance, choose a
PT O or PT PT 0 O cutoff value so that the test statistic beyond this would happen less than 10% (sig-
0kx1 D t D1 x t .y t x t0 ˇ/ t D1 x t y t D t D1 x t x t ˇ; (2.15)
nificance level) of the times in case your H0 was actually true. Since two-sided
which can be solved as P  test: use both tails, so the critical values are ( 1:65 and 1:65). See Figure 2.10 for
1 PT
ˇO D T 0
t D1 x t x t t D1 x t y t : (2.16) an example. The idea that values beyond ( 1:65 and 1:65) are unlikely to happen
if your H0 is true, so if you get such a t-stat, then your H0 is probably false: you
Example 2.8 With 2 regressors (k D 2), (2.15) is
" # " # reject it. See Tables 2.1 and Table 2.2 for examples.
0 PT x1t .y t x1t ˇO1 x2t ˇO2 /
D t D1 Clearly, a 5% significance level gives the critical values 1:96 and 1:96,which would
0 x2t .y t x1t ˇO1 x2t ˇO2 /
be really unusual under H0 . We sometimes compare with a t -distribution instead of
and (2.16) is a N.0; 1/, especially when the sample is short. For samples of more than 30–40 data
" # " #! 1 " # points, the difference is trivial—see Table A.1. The p-value is a related concept. It is the
ˇO1 PT x1t x1t x1t x2t PT x1t y t
D t D1 t D1 : lowest significance level at which we can reject the null hypothesis.
ˇO2 x2t x1t x2t x2t x2t y t
Example 2.9 Std.ˇ/ O D 1:5, ˇO D 3 and ˇ D 1 (under H0 ): t D .3 1/=1:5  1:33 so we
2.2 Hypothesis Testing cannot reject H0 at the 10% significance level. Instead, if ˇO D 4, then t D .4 1/=1:5 D
2, so we could reject the null hypothesis.
2.2.1 Testing a Single Coefficient

We assume that the estimates are normally distributed. To be able to easily compare with
printed tables of probabilities, we transform to a N.0; 1/ variable as 2.2.2 Joint Test of Several Coefficients

A joint test of several coefficients is different from testing the coefficients one at a time.
ˇO
ˇ
tD  N.0; 1/; (2.17) For instance, suppose your economic hypothesis is that ˇ1 D 1 and ˇ3 D 0. You could
O
Std.ˇ/
clearly test each coefficient individually (by a t-test), but that may give conflicting results.
O is the standard error (deviation) of ˇ.
where Std.ˇ/ O In addition, it does not use the information in the sample as effectively as possible. It

34 35
N(0,1) distribution 2 2
χ1 distribution χ2 distribution
0.4

0.35 5% in each tail: −1.65 1.65 1


Pr(x ≥ : 2.71) is 0.1
0.4
Pr(x ≥ : 4.61) is 0.1
2.5% in each tail: −1.96 1.96
0.3 0.3
0.25 0.5 0.2

0.2 0.1
0 0
0.15 0 5 10 15 0 5 10 15
x x
0.1

0.05
χ25 distribution
0
−3 −2 −1 0 1 2 3 0.15
x Pr(x ≥ : 9.24) is 0.1

0.1

Figure 2.10: Density function of normal distribution


0.05

might well be the case that we cannot reject any of the hypotheses (that ˇ1 D 1 and 0
0 5 10 15
ˇ3 D 0), but that a joint test might be able to reject it. x
Intuitively, a joint test is like exploiting the power of repeated sampling as illustrated
by the following example. My null hypothesis might be that I am a better tennis player
Figure 2.11: Density functions of 2 distributions with different degrees of freedom
than my friend. After playing (and losing) once, I cannot reject the null—since pure
randomness (wind, humidity,...) might have caused the result. The same is true for the
the following nornal distribution
second game (and loss)—if I treat the games as completely unrelated events. However, " # " # " #!
considering both games, the evidence against the null hypothesis is (unfortunately) much v1 0 1 0
N ; ;
stronger. v2 1 0 2
A joint test makes use of the following remark.
then the quadratic form
Remark 2.10 (Chi-square distribution) If v is a zero mean normally distributed vector, " #0 " #" #
v1 1 0 v1
then we have D v12 C v22 =2
v2 0 1=2 v2
v 0 ˙ 1 v  2n , if the n  1 vector v  N.0; ˙/:
has a 22 distribution.
As a special case, suppose the vector only has one element. Then, the quadratic form can
be written Œv= Std.v/2 , which is the square of a t-statistic. For instance, suppose we have estimated a model with three coefficients and the null
hypothesis is
Example 2.11 (Qudratic form with a chi-square distribution) If the 2  1 vector v has
H0 W ˇ1 D 1 and ˇ3 D 0: (2.18)

36 37
It is convenient to write this on matrix form as Then, (2.22) is
2 3 0 2 3 10 0
2 3 1
" # ˇ1 " # " # 2 " # " # " 1 # 2 " #
1 0 0 6 7 1 B 1 0 0 6 7 0 C 4 0 B 1 0 0 6 7 0 C
4ˇ2 5 D or more generally (2.19) @ 47775 A @ 47775 A
0 0 1 0 0 0 1 0 0 1 0 0 1 0
ˇ3 3 3
" #" #
Rˇ D q; (2.20) h i 0:25 0 2
2 3 D 10;
0 1 3
where q has J (here 2) rows. Notice that the covariance matrix of these linear combina-
tions is then which is higher than the 10% critical value of the 22 distribution (which is 4.61).
Var.Rˇ/O D RV .ˇ/RO 0; (2.21)

where V .ˇ/O denotes the covariance matrix of the coefficients. Putting together these 2.3 Heteroskedasticity
results we have the test static (a scalar)
Suppose we have a regression model
.RˇO O 0  1 .RˇO
q/0 ŒRV .ˇ/R q/  2J : (2.22)
y t D x t0 b C u t ; where E u t D 0 and Cov.xi t ; u t / D 0: (2.24)
This test statistic is compared to the critical values of a 2J distribution—see Table A.2.
(Alternatively, it can put in the form of an F statistics, which is a small sample refine- In the standard case we assume that u t is iid (independently and identically distributed),
ment.) which rules out heteroskedasticity.
A particularly important case is the test of the joint hypothesis that all slope coeffi- In case the residuals actually are heteroskedastic, least squares (LS) is nevertheless a
cients in the regression (that is, excluding the intercept) are zero. It can be shown that the useful estimator: it is still consistent (we get the correct values as the sample becomes
test statistics for this hypothesis is really large)—and it is reasonably efficient (in terms of the variance of the estimates).
However, the standard expression for the standard errors (of the coefficients) is (except in
TR2  2#slopes : (2.23) a special case, see below) not correct. This is illustrated in Figure 2.13.
To test for heteroskedasticity, we can use White’s test of heteroskedasticity. The null
See Tables 2.1 and 2.2 for examples of this test.
hypothesis is homoskedasticity, and the alternative hypothesis is the kind of heteroskedas-
ticity which can be explained by the levels, squares, and cross products of the regressors—
Example 2.12 (Joint test) Suppose H0 : ˇ1 D 0 and ˇ3 D 0; .ˇO1 ; ˇO2 ; ˇO3 / D .2; 777; 3/
clearly a special form of heteroskedasticity. The reason for this specification is that if the
and
2 3 squared residual is uncorrelated with these squared regressors, then the usual LS covari-
" # 4 0 0
1 0 0 ance matrix applies—even if the residuals have some other sort of heteroskedasticity (this
RD O D6
and V .ˇ/
7
40 33 05 , so
0 0 1 is the special case mentioned before).
0 0 1 To implement White’s test, let wi be the squares and cross products of the regressors.
2 32 3
" # 4 0 0 1 0 " # The test is then to run a regression of squared fitted residuals on w t
O 0D 1 0 0 6
RV .ˇ/R 40 33 0
76
54 0 0
7
5 D
4 0
:
0 0 1 0 1 uO 2t D w t0 C v t ; (2.25)
0 0 1 0 1

38 39
Scatter plot, iid residuals Scatter plot, Var(residual) depends on x2 Std of LS estimator under heteroskedasticity
0.1
20 20 2 −1
σ (X’X)
0.09
10 10 White’s
0.08 Simulated
0 0
y

y
−10 −10 0.07
−20 −20 0.06
−10 −5 0 5 10 −10 −5 0 5 10
0.05
x x
0.04
y = 0.03 + 1.3x + u
Solid regression lines are based on all data, 0.03 Model: yt = 0.9xt + εt,
dashed lines exclude the crossed out data point 2
0.02 where εt ∼ N(0,ht), with ht = 0.5exp(αxt )
bLS is the LS estimate of b in
0.01
yt = a + bxt + ut
Figure 2.12: Effect of heteroskedasticity on uncertainty about regression line
0
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
α
and to test if all the slope coefficients (not the intercept) in are zero. (This can be done
be using the fact that TR2  p2 , p D dim.wi / 1:)
Figure 2.13: Variance of OLS estimator, heteroskedastic errors
Example 2.13 (White’s test) If the regressors include .1; x1t ; x2t / then w t in (2.25) is the
vector (1; x1t ; x2t ; x1t
2 2
; x1t x2t ; x2t ). (Notice that the cross product of .1; x1t ; x2t / with 1 Assuming that the residuals are uncorrelated gives
gives us the regressors in levels, not squares.)
1 1
Var.ˇO1 / D PT Var .x1 u1 C x2 u2 C : : : xT u t / PT
There are two ways to handle heteroskedasticity in the residuals. First, we could use t D1 x t x t t D1 x t x t
1  1
some other estimation method than LS that incorporates the structure of the heteroskedas- D PT x12 Var .u1 / C x22 Var .u2 / C : : : xT2 Var .uT / PT
ticity. For instance, combining the regression model (2.24) with an ARCH structure of t D1 x t x t t D1 x t x t
1 PT 1
the residuals—and estimate the whole thing with maximum likelihood (MLE) is one way. D PT 2 2
t D1 x t  t PT : (2.27)
As a by-product we get the correct standard errors—provided the assumed distribution t D1 x t x t t D1 x t x t

(in the likelihood function) is correct. Second, we could stick to OLS, but use another This expression cannot be simplified further since  t is not constant—and also related to
P P
expression for the variance of the coefficients: a heteroskedasticity consistent covariance x t2 . The idea of White’s estimator is to estimate TtD1 x t2  t2 by TtD1 x t x t0 u2t (which also
matrix, among which “White’s covariance matrix” is the most common. allows for the case with several elements in x t , that is, several regressors).
To understand the construction of White’s covariance matrix, recall that the variance It is straightforward to show that the standard expression for the variance underes-
of ˇO1 is found from timates the true variance when there is a positive relation between x t2 and  t2 (and vice
1 versa). The intuition is that much of the precision (low variance of the estimates) of OLS
ˇO1 D ˇ1 C PT .x1 u1 C x2 u2 C : : : xT uT / : (2.26) comes from data points with extreme values of the regressors: think of a scatter plot and
t D1 x t x t
notice that the slope depends a lot on fitting the data points with very low and very high

40 41
values of the regressor. This nice property is destroyed if the data points with extreme Scatter plot, iid residuals Scatter plot, autocorrelated residuals
values of the regressor also have lots of noise (high variance of the residual).
20 20
Remark 2.14 (Standard OLS vs. White’s variance) If x t2 is not related to  t2 , then we
0 0

y
could write the last term in (2.27) as
−20 −20
PT 1 PT P
2 2
t D1 x t  t D  2 T x2
T t D1 t tD1 t −10 −5 0 5 10 −10 −5 0 5 10
P
D  2 TtD1 x t2 x x

PT y = 0.03 + 1.3x + u
where  2 is the average variance, typically estimated as t D1 u t =T .
2
That is, it is the
Solid regression lines are based on all data,
same as for standard OLS. Notice that dashed lines are based on late sample (high x values).
The regressor is (strongly) autocorrelated, since
PT 1 PT P it is an increasing series (−10,−9.9,...,10).
2 2
t D1 x t  t >  2 T x2
T t D1 t t D1 t
if x t2 is positively related to  t2 (and vice versa). For instance, with .x12 ; x22 / D .10; 1/ Figure 2.14: Effect of autocorrelation on uncertainty about regression line
P P P
and .12 ; 22 / D .5; 2/, TtD1 x t2  t2 D 10  5 C 1  2 D 52 while T1 TtD1  t2 TtD1 x t2 D
1
2
.5 C 2/.10 C 1/ D 38:5: An alternative for testing the first autocorrelation coefficient is the Durbin-Watson. The
test statistic is (approximately)
DW  2 2O1 ; (2.30)
2.4 Autocorrelation
and the null hypothesis is rejected in favour of positive autocorrelation if DW<1.5 or so
Autocorrelation of the residuals (Cov.u t u t s / ¤ 0) is also a violation of the iid assump- (depending on sample size and the number of regressors). To extend (2.29) to higher-order
tions underlying the standard expressions for the variance of ˇO1 . In this case, LS is (typi- autocorrelation, use the Box-Pierce test
cally) still consistent (exceptions: when the lagged dependent variable is a regressor), but
L
X
the variances are (again) wrong. In particular, not even the the first line of (2.27) is true,
QL D T Os2 !d L
2
: (2.31)
since the variance of the sum in (2.26) depends also on the covariance terms. sD1
There are several straightforward tests of autocorrelation—all based on using the fitted
If there is autocorrelation, then we can choose to estimate a fully specified model (in-
residuals. The null hypothesis is no autocorrelation. First, estimate the autocorrelations
cluding how the autocorrelation is generated) by MLE or we can stick to OLS but apply an
of the fitted residuals as
autocorrelation consistent covariance matrix—for instance, the “Newey-West covariance
s D Corr.uO t ; uO t s /, s D 1; :::; L: (2.28) matrix.”
To understand the Newey-West covariance matrix, consider the first line of (2.27). The
p
Second, test the autocorrelation s by using the fact that T Os has a standard normal middle term is a variance of a sum—which should (in principle) involve all covariance
distribution (in large samples)
p
T Os  N.0; 1/: (2.29)

42 43
Model: yt = 0.9xt + εt, Std of LS under autocorrelation, κ = −0.9 Std of LS under autocorrelation, κ = 0
0.1 0.1
Autocorrelation of xtut where εt = ρεt−1 + ut, 2 −1
σ (X’X)
Newey−West
where ut is iid N(0,h) such that Std(εt) = 1, and Simulated
0.5 xt = κxt−1 + ηt 0.05 0.05

0 bLS is the LS estimate of b in


κ = −0.9 yt = a + bxt + ut
−0.5 κ=0
0 0
−0.5 0 0.5 −0.5 0 0.5
κ = 0.9
ρ ρ
−0.5 0 0.5
ρ

Std of LS under autocorrelation, κ = 0.9


Figure 2.15: Autocorrelation of x t u t when u t has autocorrelation  0.1

terms. For instance, with T D 3, we have 0.05

Var .x1 u1 C x2 u2 C x3 u3 / D x12 12 C x22 22 C x32 32 C


2x1 x2 Cov.u1 ; u2 / C 2x1 x3 Cov.u1 ; u3 / C 2x2 x3 Cov.u2 ; u3 / 0
−0.5 0 0.5
(2.32) ρ

The idea of the Newey-West estimator is to first estimate all variances and a user-
determined number of lags (covariances) and then sum them up as in (2.32). With only Figure 2.16: Variance of OLS estimator, autocorrelated errors
P P 
one lag the calculation is (with several regressors) TtD1 x t x t0 uO 2t C TtD2 x t x t0 1 C x t 1 x t0 uO t uO t 1.

It is clear from this expression that what really counts is not so much the autocorrelation in element by element
u t per se, but the autocorrelation of x t u t . If this is positive, then the standard expression " # " #
A11 A12 A11 C c A12 C c
underestimates the true variance of the estimated coefficients (and vice versa). Cc D
A21 A22 A21 C c A22 C c
" # " #
A11 A12 A11 c A12 c
cD :
A A Primer in Matrix Algebra A21 A22 A21 c A22 c

Let c be a scalar and define the matrices Example A.1


" # " # " # " # " # " #
x1 z1 A11 A12 B11 B12 1 3 11 13
xD ;z D ;A D , and B D : C 10 D
x2 z2 A21 A22 B21 B22 3 4 13 14
" # " #
Adding/subtracting a scalar to a matrix or multiplying a matrix by a scalar are both 1 3 10 30
10 D :
3 4 30 40

44 45
Return = a + b*lagged Return, slope Overlapping US 12−month interest rates and
0.5 Expectations hypothesis: changes next−year average federal funds rate: 1970:1−2009:5

Change in avg short rate


Slope with two different 90% conf band, OLS and NW std 5
Slope coefficient: 0.62
US stock returns 1926:1−2008:12, overlapping data
Std (classical and Newey−West): 0.04 0.12

0
−5
−5 0 5
Change in interest rate

−0.5
Figure 2.18: US 12-month interest and average federal funds rate (next 12 months)
0 10 20 30 40 50 60
Return horizon (months)
Example A.3 (Matrix transpose)
" #0
10 h i
Figure 2.17: Slope coefficient, LS vs Newey-West standard errors D 10 11
11
2 3
Matrix addition (or subtraction) is element by element " #0 1 4
" # " # " # 1 2 3 6 7
D 42 55
A11 A12 B11 B12 A11 C B11 A12 C B12 4 5 6
ACB D C D : 3 6
A21 A22 B21 B22 A21 C B21 A22 C B22

Example A.2 (Matrix addition and subtraction/ Matrix multiplication requires the two matrices to be conformable: the first matrix
" # " # " # has as many columns as the second matrix has rows. Element ij of the result is the
10 2 8 multiplication of the ith row of the first matrix with the j th column of the second matrix
D
11 5 6 " #" # " #
" # " # " # A11 A12 B11 B12 A11 B11 C A12 B21 A11 B12 C A12 B22
1 3 1 2 2 5 AB D D :
C D A21 A22 B21 B22 A21 B11 C A22 B21 A21 B12 C A22 B22
3 4 3 2 6 2
Multiplying a square matrix A with a column vector z gives a column vector
To turn a column into a row vector, use the transpose operator like in x 0
" #" # " #
" #0 A11 A12 z1 A11 z1 C A12 z2
x1 h i Az D D :
x0 D D x1 x2 : A21 A22 z2 A21 z1 C A22 z2
x2

Similarly, transposing a matrix is like flipping it around the main diagonal


" #0 " #
0 A11 A12 A11 A21
A D D :
A21 A22 A12 A22

46 47
Example A.4 (Matrix multiplication) Example A.6 (Matrix inverse) We have
" #" # " # " #" # " #
1 3 1 2 10 4 4=5 3=5 1 3 1 0
D D , so
3 4 3 2 15 2 3=5 1=5 3 4 0 1
" #" # " # " # 1 " #
1 3 2 17 1 3 4=5 3=5
D D :
3 4 5 26 3 4 3=5 1=5

For two column vectors x and z, the product x 0 z is called the inner product
" #
h i z
0 1
x z D x1 x2 D x1 z1 C x2 z2 ;
z2

and xz 0 the outer product


" # " #
0x1 h i x1 z1 x1 z2
xz D z1 z2 D :
x2 x2 z1 x2 z2

(Notice that xz does not work). If x is a column vector and A a square matrix, then the
product x 0 Ax is a quadratic form.

Example A.5 (Inner product, outer product and quadratic form )


" #0 " # " #
10 2 h i 2
D 10 11 D 75
11 5 5
" # " #0 " # " #
10 2 10 h i 20 50
D 2 5 D
11 5 11 22 55
" #0 " #" #
10 1 3 10
D 1244:
11 3 4 11

A matrix inverse is the closest we get to “dividing” by a matrix. The inverse of a


matrix A, denoted A 1 , is such that

1
AA D I and A 1 A D I;

where I is the identity matrix (ones along the diagonal, and zeroes elsewhere). The matrix
inverse is useful for solving systems of linear equations, y D Ax as x D A 1 y.

48 49
A Statistical Tables

n Critical values n Critical values


10% 5% 1% 10% 5% 1%
10 1.81 2.23 3.17 1 2.71 3.84 6.63
20 1.72 2.09 2.85 2 4.61 5.99 9.21
30 1.70 2.04 2.75 3 6.25 7.81 11.34
40 1.68 2.02 2.70 4 7.78 9.49 13.28
50 1.68 2.01 2.68 5 9.24 11.07 15.09
60 1.67 2.00 2.66 6 10.64 12.59 16.81
70 1.67 1.99 2.65 7 12.02 14.07 18.48
80 1.66 1.99 2.64 8 13.36 15.51 20.09
90 1.66 1.99 2.63 9 14.68 16.92 21.67
100 1.66 1.98 2.63 10 15.99 18.31 23.21
Normal 1.64 1.96 2.58

Table A.2: Critical values of chisquare distribution (different degrees of freedom, n).
Table A.1: Critical values (two-sided test) of t distribution (different degrees of freedom)
and normal distribution.

Bibliography
n1 n2 2n1 =n1
Verbeek, M., 2008, A guide to modern econometrics, Wiley, Chichester, 3rd edn.
10 30 50 100 300
1 4.96 4.17 4.03 3.94 3.87 3.84
2 4.10 3.32 3.18 3.09 3.03 3.00
3 3.71 2.92 2.79 2.70 2.63 2.60
4 3.48 2.69 2.56 2.46 2.40 2.37
5 3.33 2.53 2.40 2.31 2.24 2.21
6 3.22 2.42 2.29 2.19 2.13 2.10
7 3.14 2.33 2.20 2.10 2.04 2.01
8 3.07 2.27 2.13 2.03 1.97 1.94
9 3.02 2.21 2.07 1.97 1.91 1.88
10 2.98 2.16 2.03 1.93 1.86 1.83

Table A.3: 5% Critical values of Fn1;n2 distribution (different degrees of freedom).

50 51
.D..3

3 Maximum Likelihood Estimation


Reference: Verbeek (2008) 2 and 4
More advanced material is denoted by a star ( ). It is not required reading.

3.1 Maximum Likelihood

n1 n2 2n1 =n1 A different route to arrive at an estimator is to maximize the likelihood function.
10 30 50 100 300 To understand the principle of maximum likelihood estimation, consider the following
1 3.29 2.88 2.81 2.76 2.72 2.71 example.
2 2.92 2.49 2.41 2.36 2.32 2.30
3 2.73 2.28 2.20 2.14 2.10 2.08
4 2.61 2.14 2.06 2.00 1.96 1.94 3.1.1 Example: Estimating the Mean with ML
5 2.52 2.05 1.97 1.91 1.87 1.85
Suppose we know x t  N.;  2 /, but we don’t know the value of  (for now, assume
6 2.46 1.98 1.90 1.83 1.79 1.77
7 2.41 1.93 1.84 1.78 1.74 1.72 we know the variance). Since x t is a random variable, there is a probability of every
8 2.38 1.88 1.80 1.73 1.69 1.67 observation and the density function of x t is
9 2.35 1.85 1.76 1.69 1.65 1.63  
10 2.32 1.82 1.73 1.66 1.62 1.60 1 1 .x t /2
L D pdf .x t / D p exp ; (3.1)
2 2 2 2

Table A.4: 10% Critical values of Fn1;n2 distribution (different degrees of freedom). where L stands for “likelihood.” The basic idea of maximum likelihood estimation (MLE)
is to pick model parameters to make the observed data have the highest possible proba-
bility. Here this gives O D x t . This is the maximum likelihood estimator in this example.
What if there are T observations, x1 ; x2 ,...xT ? In the simplest case where xi and xj
are independent, then the joint pdf is just the product of the individual pdfs (for instance,
pdf.xi ; xj / D pdf.xi / pdf.xj /) so

L D pdf .x1 /  pdf .x2 /  :::  pdf .xN / (3.2)


  
1 .x1 /2 .x2 /2 .xN /2
D .2 2 / N=2 exp 2
C 2
C ::: C 2
(3.3)
2   

52 53
Take logs (log likelihood) we could assume that u t is iid N.0;  2 /. The probability density function of u t is
 
T 1   1 1 2 2
ln L D ln.2 2 / .x1 /2 C .x2 /2 C ::: C .xT /2 : (3.4) pdf .u t / D p exp u t = : (3.10)
2 2 2 2 2 2
The derivative with respect to  is Since the errors are independent, we get the joint pdf of the u1 ; u2 ; : : : ; uT by multiplying
@ ln L 1 the marginal pdfs of each of the errors
D 2 Œ.x1 / C .x2 / C ::: C .xT / : (3.5)
@ 
L D pdf .u1 /  pdf .u2 /  :::  pdf .uT /
To maximize the likelihood function find the value of O that makes @ ln L=@ D 0, which   
1 u21 u22 u2T
is the usual sample average D .2 2 / T =2 exp C C ::: C : (3.11)
2 2 2 2

O D .x1 C x2 C ::: C xT / =T: (3.6) Substitute y t x t ˇ for u t and take logs to get the log likelihood function of the sample
XT
Remark 3.1 (Coding the log likelihood function) Many software packages want just the ln L D ln L t , where (3.12)
t D1
likelihood contribution of data point t (not the ful) ln L t D 21 ln.2 2 / 21 2 .x t /2 . 1 1 1
ln L t D ln .2/ ln. 2 / .y t x t ˇ/2 = 2 : (3.13)
2 2 2
3.1.2 Example: Estimating the Variance with ML Suppose (for simplicity) that we happen to know the value of  2 . It is then clear that
To estimate the variance, use (3.4) and find the value  2 that makes @ ln L=@ 2 D 0 this likelihood function is maximized by minimizing the last term, which is proportional
to the sum of squared errors: LS is ML when the errors are iid normally distributed (but
@ ln L
0D only then). (This holds also when we do not know the value of  2 —just slightly messier
@ 2
T 1 1   to show it.) See Figure 3.1.
D 2 C .x1 /2 C .x2 /2 C ::: C .xT /2 ; (3.7) Maximum likelihood estimators have very nice properties, provided the basic distri-
2 2 2 2. 2 /2
butional assumptions are correct, that is, if we maximize the right likelihood function.
so
1  In that case, MLE is typically the most efficient/precise estimators (at least in very large
O 2 D .x1 /2 C .x2 /2 C ::: C .xT /2 : (3.8)
T samples). ML also provides a coherent framework for testing hypotheses (including the
Notice that we divide by T , not by T 1, so O 2 must be biased, but the bias disappears Wald, LM, and LR tests).
as T ! 1
Example 3.2 Consider the regression model yi D ˇ2 xi Cui , where we (happen to) know
3.1.3 MLE of a Regression that ui  N.0; 1/.Suppose we have the following data
h i h i h i h i
To apply this idea to a regression model y1 y2 y3 D 1:5 0:6 2:1 and x1 x2 x3 D 1 0 1 :

y t D ˇx t C u t ; (3.9)

54 55
Suppose .y1 ; x1 / D . 1:5; 1/. Try different values of ˇ2 on observation 1 Suppose ˇ2 D 2, then we get the following values for u t D y t 2x t and its square
2 3 2 3 2 3
ˇ2 u1 Density function value of u1 1:5 2  . 1/ 0:5 0:25
6 7 6 7 6 7
1:6 1:5 1:6  . 1/ D 0:1 0:40 4 0:6 2  0 5 D 4 0:65 with the square 40:365 with sum 0:62:
1:8 1:5 1:8  . 1/ D 0:3 0:38 2:1 21 0:1 0:01
2:0 1:5 2:0  . 1/ D 0:5 0:35
Now, suppose instead that ˇ2 D 1:8, then we get
Observation 1 favours ˇ D 1:6; see Figure 3.2. Do the same for observations 2 and 3: 2 3 2 3 2 3
1:5 1:8  . 1/ 0:3 0:09
6 7 6 7 6 7
ˇ2 u2 Density function value of u2 4 0:6 1:8  0 5 D 4 0:65 with the square 40:365 with sum 0:54:
1:6 0:6 1:6  0 D 0:6 0:33 2:1 1:8  1 0:3 0:09
1:8 0:6 1:8  0 D 0:6 0:33
The latter choice of ˇ2 will certainly give a larger value of the likelihood function (it is
2:0 0:6 2:0  0 D 0:6 0:33
actually the optimum). See Figure 3.1.
ˇ2 u3 Density function value of u3
1:6 2:1 1:6  1 D 0:5 0:35
3.2 Key Properties of MLE
1:8 2:1 1:8  1 D 0:3 0:38
2:0 2:1 2:0  1 D 0:1 0:40 No general results on small-sample properties of MLE: can be biased or not...
To sum up, observation 1 favours ˇ D 1:6, observation 2 is neutral, and observation 3 MLE have very nice asymptotic (large-sample) properties, provided we maximize the
favours ˇ D 2. The estimate is a “compromise” that maximixes the joint density (the right likelihood function. If so, then
product of the individual densities since the ui are independent)
1. MLE is consistent
ˇ2 pdf.u1 /  pdf.u2 /  pdf.u3 /
2. MLE is the most efficient/precise estimator, at least asymptotically (efficient =
1:6 0:40  0:33  0:35  0:047
smallest variance)
1:8 0:38  0:33  0:38  0:048
2:0 0:35  0:33  0:40  0:047 O are normally distributed,
3. MLE estimates ()
p
so 1:8 has the highest likelihood value of these three alternatives (it is actually the opti- T .O / !d N.0; V /; (3.14)
mum). See Figure 3.2. @ ln L
2
1
V D I./ with I. / D E =T: (3.15)
@ @
Example 3.3 Consider the simple regression where we happen to know that the intercept
(I. / is called the “information matrix”). The information matrix can also be
is zero, y t D ˇ1 x t C u t . Suppose we have the following data 2 log L

h i h i h i h i written I. / D E @ @@ t


, where ln L t is the log likelihood contribution of obser-
y1 y2 y3 D 1:5 0:6 2:1 and x1 x2 x3 D 1 0 1 : vation t.

4. ML also provides a coherent framework for testing hypotheses (including the Wald,
LM, and LR tests).

56 57
OLS, y = b + b x + u, b =0 Sum of squared errors Pdf value of u1 = −1.5 − β×(−1) Pdf value of u2 = −0.6 − β×0
0 1 0
10 0.4 0.4
2 y: −1.5 −0.6 2.1
x: −1.0 0.0 1.0 0.35 0.35
1
b1: 1.8 (OLS)
0 5 0.3 0.3
y

−1 Data
fitted
0.25 0.25
−2
0 0.2 0.2
−1 −0.5 0 0.5 1 0 1 2 3 4 1 1.5 2 2.5 3 1 1.5 2 2.5 3
x b1 β β

Log likelihood Pdf value of u3 = 2.1 − β×1 Pdf value of (u1,u2,u3)


0 0.4

0.35 0.04
−5
0.3
−10 0.02
0.25

−15 0.2 0
0 1 2 3 4 1 1.5 2 2.5 3 1 1.5 2 2.5 3
b1 β β

Figure 3.1: Example of OLS and ML estimation Figure 3.2: Example of OLS and ML estimation

3.2.1 Example of the Information Matrix This is the standard expression for the distribution of a sample average.

Differentiate (3.5) (and assume we know  2 ) to get


3.3 Three Test Principles
@2 ln L T
D : (3.16)
@@ 2 Wald test. Estimate  with MLE, check if O H0 is too large. Example: t-test and F-test
The information matrix is Likelihood ratio test. Estimate  with MLE as usual, estimate again by imposing the
H0 restrictions, test if ln L.O / ln L.“O with H0 restrictions”/ D 0. Example: compare
@2 ln L 1
I./ D E =T D 2 ; (3.17) the R2 from a model without and with a restriction that some coefficient equals 1/3
@@ 
Lagrange multiplier test. Estimate  under the H0 restrictions, check if @ ln L=@ D 0
which we combine with (3.14)–(3.15) to get for unconstrained model is true when evaluated at “O with H0 restrictions”
p
T .O / !d N.0;  2 / or O !d N.;  2 =T /: (3.18)

58 59
3.4 QMLE .D..3

A MLE based on the wrong likelihood function (distribution) may still be useful.
Suppose we use the likelihood function L and get estimates O by

@ ln L 4 Index Models
D0 (3.19)
@
Reference: Elton, Gruber, Brown, and Goetzmann (2007) 7–8, 10
If L is wrong, then we are maximizing the wrong thing. With some luck, we still get
reasonable (concistent) estimates.
4.1 The Inputs to a MV Analysis
Example 3.4 (LS and QMLE) In a linear regression, y t D x t0 ˇ C " t , the first order
O t D 0. To calculate the mean variance frontier we need to calculate both the expected return and
condition for MLE based on the assumption that " t  N.0;  2 / is ˙ tTD1 .y t x t0 ˇ/x
variance of different portfolios (based on n assets). With two assets (n D 2) the expected
This has an expected value of zero (at the true parameters), even if the shocks have a, say,
return and the variance of the portfolio are
t22 distribution (which would define the correct likelihood function).
" #
h i 
1
The example suggests that if E.Rp / D w1 w2
@ ln L 2
E D 0; (3.20) h
"
i 2 
#" #
@ 2 1 12 w1
P D w1 w2 : (4.1)
then the estimates are still consistent. We are doing quasi-MLE (or pseudo-MLE). 12 22 w2
p
With QMLE, T .O  / !d N.0; V /, but
   In this case we need information on 2 mean returns and 3 elements of the covariance
@ ln L t @ ln L t 0 matrix. Clearly, the covariance matrix can alternatively be expressed as
V D I./ 1 E I./ 1 (3.21)
@ @ " # " #
12 12 12 12 1 2
Practical implication: this is perhaps a “safer” way of constructing tests—since it is D ; (4.2)
12 22 12 1 2 22
less restrictive than assuming that we have the exactly correct likelihood function.
which involves two variances and one correlation (3 elements as before).
There are two main problems in estimating these parameters: the number of parame-
Bibliography ters increase very quickly as the number of assets increases and historical estimates have
Verbeek, M., 2008, A guide to modern econometrics, Wiley, Chichester, 3rd edn. proved to be somewhat unreliable for future periods.
To illustrate the first problem, notice that with n assets we need the following number
of parameters

Required number of estimates With 100 assets


i n 100
i i n 100
ij n.n 1/=2 4950

60 61
The numerics is not the problem as it is a matter of seconds to estimate a covariance CAPM regression: Ri−Rf = αi + βi×(Rm−Rf)+ ei
matrix of 100 return series. Instead, the problem is that most portfolio analysis uses 10
lots of judgemental “estimates.” These are necessary since there might be new assets 8 Intercept (αi) and slope (βi): 2.0 1.3
(no historical returns series are available) or there might be good reasons to believe that
6
old estimates are not valid anymore. To cut down on the number of parameters, it is
often assumed that returns follow some simple model. These notes will discuss so-called 4

Excess return asset i,%


single- and multi-index models. 2
The second problem comes from the empirical observations that estimates from his- 0
torical data are sometimes poor “forecasts” of future periods (which is what matters for
−2
portfolio choice). As an example, the correlation between two asset returns tends to be
more “average” than the historical estimate would suggest. −4

A simple (and often used) way to deal with this is to replace the historical correla- −6
tion with an average historical correlation. For instance, suppose there are three assets. −8 Data points
Then, estimate ij on historical data, but use the average estimate as the “forecast” of all Regression line
−10
correlations: −10 −5 0 5 10
2 3 2 3 Market excess return,%
1 12 13 1 N N
6 7 6 7
estimate 4 1 23 5 , calculate N D .O12 C O13 C O23 /=3, and use 4 1 N5 :
Figure 4.1: CAPM regression
1 1

ferent assets are uncorrelated. This means that all comovements of two assets (Ri and
4.2 Single-Index Models
Rj , say) are due to movements in the common “index” Rm . This is not at all guaranteed
The single-index model is a way to cut down on the number of parameters that we need by running LS regressions—just an assumption. It is likely to be false—but may be a
to estimate in order to construct the covariance matrix of assets. The model assumes that reasonable approximation in many cases. In any case, it simplifies the construction of the
the co-movement between assets is due to a single common influence (here denoted Rm ) covariance matrix of the assets enormously—as demonstrated below.

Ri D ˛i Cˇi Rm Cei , where E.ei / D 0, Cov .ei ; Rm / D 0, and Cov.ei ; ej / D 0: (4.3) Remark 4.1 (The market model) The market model is (4.3) without the assumption that
Cov.ei ; ej / D 0. This model does not simplify the calculation of a portfolio variance—but
The first two assumptions are the standard assumptions for using Least Squares: the resid- will turn out to be important when we want to test CAPM.
ual has a zero mean and is uncorrelated with the non-constant regressor. (Together they
If (4.3) is true, then the variance of asset i and the covariance of assets i and j are
imply that the residuals are orthogonal to both regressors, which is the standard assump-
tion in econometrics.) Hence, these two properties will be automatically satisfied if (4.3) i i D ˇi2 Var .Rm / C Var .ei / (4.4)
is estimated by Least Squares.
ij D ˇi ˇj Var .Rm / : (4.5)
See Figures 4.1 – 4.3 for illustrations.
The key point of the model, however, is the third assumption: the residuals for dif- Together, these equations show that we can calculate the whole covariance matrix by

62 63
Scatter plot against market return Scatter plot against market return HiTec Utils
30 30
US data constant -0.19 0.25
(-1.14) (1.57)
Excess return %, HiTec

1970:1−2008:12

Excess return %, Utils


20 20

10 10
market return 1.32 0.52
(30.46) (11.35)
0 0 R2 0.74 0.32
−10 −10 obs 463.00 463.00
α −0.17 α 0.23 Autocorr (t) -0.88 0.88
−20 1.31 −20 0.53
β β White 9.14 20.38
−30 −30 All slopes 342.15 149.79
−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30
Excess return %, market Excess return %, market
Table 4.1: CAPM regressions, monthly returns, %, US data 1970:1-2008:7. Numbers
in parentheses are t-stats. Autocorr is a N(0,1) test statistic (autocorrelation); White is a
Figure 4.2: Scatter plot against market return chi-square test statistic (heteroskedasticity), df = K(K+1)/2 - 1; All slopes is a chi-square
test statistic (of all slope coeffs), df = K-1
having just the variance of the index (to get Var .Rm /) and the output from n regressions
(to get ˇi and Var .ei / for each asset). This is, in many cases, much easier to obtain than
direct estimates of the covariance matrix. For instance, a new asset does not have a return
history, but it may be possible to make intelligent guesses about its beta and residual 4.3 Estimating Beta
variance (for instance, from knowing the industry and size of the firm).
See Figure 4.4 for an example. 4.3.1 Estimating Historical Beta: OLS and Other Approaches
Proof. (of (4.4)–(4.5) By using (4.3) and recalling that Cov.Rm ; ei / D 0 direct calcu- Least Squares (LS) is typically used to estimate ˛i , ˇi and Std.ei / in (4.3)—and the R2
lations give is used to assess the quality of the regression.

i i D Var .Ri / Remark 4.2 (R2 of market model) R2 of (4.3) measures the fraction of the variance (of
D Var .˛i C ˇi Rm C ei / Ri ) that is due to the systematic part of the regression, that is, relative importance of mar-
D Var .ˇi Rm / C Var .ei / C 2  0 ket risk as compared to idiosyncratic noise (1 R2 is the fraction due to the idiosyncratic
noise)
D ˇi2 Var .Rm / C Var .ei / :
Var.˛i C ˇi Rm / ˇ2 2
 R2 D D 2 2i m 2 :
Var.Ri / ˇi m C ei
Similarly, the covariance of assets i and j is (recalling also that Cov ei ; ej D 0)
 To assess the accuracy of historical betas, Blume and others estimate betas for non-
ij D Cov Ri ; Rj overlapping samples (periods)—and then compare the betas across samples. They find

D Cov ˛i C ˇi Rm C ei ; ˛j C ˇj Rm C ej that the correlation of betas across samples is moderate for individual assets, but relatively
D ˇi ˇj Var .Rm / C 0 high for diversified portfolios. It is also found that betas tend to “regress” towards one: an
extreme historical beta is likely to be followed by a beta that is closer to one. There are
D ˇi ˇj Var .Rm / :
several suggestions for how to deal with this problem.

64 65
US industry portfolios, β against the market, 1970:1−2008:12
1.5
Correlations, data Difference in correlations: data − model

1 0.5

0.5 0

0 −0.5
1 25 20 25 20
β

15 10 20 25 15 10 20 25
5 5 10 15 5 5 10 15
Portfolio Portfolio

25 FF US portfolios, 1957:1−2008:12
Index (factor): US market
0.5
NoDur Durbl Manuf Enrgy HiTec Telcm Shops Hlth Utils Other

Figure 4.4: Correlations of US portfolios


Figure 4.3: ˇs of US industry portfolios
mean. If we treat the variance of the LS estimator (ˇ1
2
) as known, then the Bayesian
To use Blume’s ad-hoc technique, let ˇOi1 be the estimate of ˇi from an early sample, estimator of beta is
and ˇOi 2 the estimate from a later sample. Then regress
b D .1 F /ˇOi1 C Fˇ0 , where
ˇOi2 D 0 C 1 ˇOi1 C i (4.6) 1=02
2
ˇ1
F D D 2 : (4.7)
1=02 2
C 1=ˇ1 2
0 C ˇ1
and use it for forecasting the beta for yet another sample. Blume found . O0 ; O1 / D
.0:343; 0:677/ in his sample. 4.3.2 Fundamental Betas
Other authors have suggested averaging the OLS estimate (ˇOi1 ) with some average
Another way to improve the forecasts of the beta over a future period is to bring in infor-
beta. For instance, .ˇOi1 C1/=2 (since the average beta must be unity) or .ˇOi1 C˙inD1 ˇOi1 =n/=2
mation about fundamental firm variables. This is particularly useful when there is little
(which will typically be similar since ˙inD1 ˇOi1 =n is likely to be close to one).
historical data on returns (for instance, because the asset was not traded before).
The Bayesian approach is another (more formal) way of adjusting the OLS estimate.
It is often found that betas are related to fundamental variables as follows (with signs
It also uses a weighted average of the OLS estimate, ˇOi1 , and some other number, ˇ0 ,
in parentheses indicating the effect on the beta): Dividend payout (-), Asset growth (+),
.1 F /ˇOi1 C Fˇ0 where F depends on the precision of the OLS estimator. The general
Leverage (+), Liquidity (-), Asset size (-), Earning variability (+), Earnings Beta (slope in
idea of a Bayesian approach (Greene (2003) 16) is to treat both Ri and ˇi as random. In
earnings regressed on economy wide earnings) (+). Such relations can be used to make
this case a Bayesian analysis could go as follows. First, suppose our prior beliefs (before
an educated guess about the beta of an asset without historical data on the returns—but
having data) about ˇi is that it is normally distributed, N.ˇ0 ; 02 /, where (ˇ0 ; 02 ) are some
with data on (at least some) of these fundamental variables.
numbers . Second, run a LS regression of (4.3). If the residuals are normally distributed,
so is the estimator—it is N.ˇOi1 ; ˇ12
/, where we have taken the point estimate to be the

66 67
4.4 Multi-Index Models of interpreting the results:

1. Let the first transformed index equal the original index, I1 D I1 . This would often
4.4.1 Overview
be the market return.
The multi-index model is just a multivariate extension of the single-index model (4.3)
2. Regress the second original index on the first transformed index, I2 D 0 C 1 I1 C
Ri D ai C  
bi1 I1 C  
bi2 I2 C  
bi3 I3 C    C ei , where (4.8) "2 . Then, let the second transformed index be the fitted residual, I2 D "O2 .

E.ei / D 0, Cov ei ; Ik D 0, and Cov.ei ; ej / D 0: 3. Regress the third original index on the first two transformed indices, I3 D 0 C
1 I1 C 2 I2 C "3 . Then, let I3 D "O3 .
As an example, there could be two indices: the stock market return and an interest rate.
An ad-hoc approach is to first try a single-index model and then test if the residuals are Recall that the fitted residual (from Least Squares) is always uncorrelated with the
approximately uncorrelated. If not, then adding a second index might give an acceptable regressor (by construction). In this case, this means that I2 is uncorrelated with I1 (step
approximation. 2) and that I3 is uncorrelated with both I1 and I2 (step 3). The correlation matrix is
It is often found that it takes several indices to get a reasonable approximation—but therefore 2 3
that a single-index model is equally good (or better) at “forecasting” the covariance over h i 1 0 0
6 7
a future period. This is much like the classical trade-off between in-sample fit (requires a Corr I1 I2 I3 D 4 1 05 : (4.10)
large model) and forecasting (often better with a small model). 1
The types of indices vary, but one common set captures the “business cycle” and This recursive approach also helps in interpreting the transformed indices. Suppose
includes things like the market return, interest rate (or some measure of the yield curve the first index is the market return and that the second original index is an interest rate.
slope), GDP growth, inflation, and so forth. Another common set of indices are industry The first transformed index (I1 ) is then clearly the market return. The second transformed
indices. index (I2 ) can then be interpreted as the interest rate minus the interest rate expected at the
It turns out (see below) that the calculations of the covariance matrix are much simpler current stock market return—that is, the part of the interest rate that cannot be explained
if the indices are transformed to be uncorrelated so we get the model by the stock market return.

Ri D ai C bi1 I1 C bi2 I2 C bi3 I3 C    C ei ; where (4.9) 4.4.3 Multi-Index Model after “Rotating” the Indices
E.ei / D 0, Cov .ei ; Ik / D 0, Cov.ei ; ej / D 0 (unless i D j /, and
To see why the transformed indices are very convenient for calculating the covariance
Cov.Ij ; Ik / D 0 (unless j D k). matrix, consider a two-index model. Then, (4.9) implies that the variance of asset i is

If this transformation of the indices is linear (and non-singular, so it is can be reversed if i i D Var .ai C bi1 I1 C bi2 I2 C ei /
we want to), then the fit of the regression is unchanged. 2
D bi1 Var .I1 / C bi2
2
Var .I2 / C Var .ei / : (4.11)

4.4.2 “Rotating” the Indices Similarly, the covariance of assets i and j is


There are several ways of transforming the indices to make them uncorrelated, but the fol- 
ij D Cov ai C bi1 I1 C bi2 I2 C ei ; aj C bj1 I1 C bj 2 I2 C ej
lowing regression approach is perhaps the simplest and may also give the best possibility
D bi1 bj1 Var .I1 / C bi 2 bj 2 Var .I2 / : (4.12)

68 69
eigenvector associated with the j th largest eigenvalue j (which equals Var.pcjt / D
Correlations, data Difference in correlations: data − model
wj0 ˙wj ).

1 0.5
Let the i th eigenvector be the i th column of the n  n matrix

0.5 0 W D Œ w1    wn : (4.13)

0 −0.5
25 20 25 20 We can then calculate the n  1 vector of principal components as
15 10 20 25 15 10 20 25
5 5 10 15 5 5 10 15
Portfolio Portfolio pc t D W 0 z t : (4.14)

Since the eigenvectors are ortogonal it can be shown that W 0 D W 1


, so the expression
25 FF US portfolios, 1957:1−2008:12 can be inverted as
Indices (factors): US market, SMB, HML
z t D Wpc t : (4.15)

This shows that the i th eigenvector (the i th column of W ) can be interpreted as the effect
Figure 4.5: Correlations of US portfolios
of the ith principal component on each of the elements in z t . However, the sign of column
See Figure 4.5 for an example. j of W can be changed without any effects (except that the pcjt also changes sign), so
we can always reinterpret a negative cofficient as a positive exposure (to pcjt ).

4.5 Principal Component Analysis Example 4.4 (PCA with 2 series) With two series we have
" #0 " # " #0 " #
Principal component analysis (PCA) can help us determine how many factors that are w11 z1t w12 z1t
pc1t D and pc2t D or
needed to explain a cross-section of asset returns. w21 z2t w22 z2t
" # " #0 " #
Let z t D R t RN t be an n  1 vector of demeaned returns with covariance matrix ˙. pc1t w11 w12 z1t
D and
The first principal component (pc1t ) is the (normalized) linear combinations of z t that pc2t w21 w22 z2t
" # " #" #
account for as much of the variability as possible—and its variance is denoted 1 . The z1t w11 w12 pc1t
j th (j  2) principal component (pcjt ) is similar (and its variance is denoted j ), except D :
z2t w21 w22 pc2t
that is must be uncorrelated with all lower principal components. Remark 4.3 gives a a
formal definition. For instance, w12 shows how pc2t affects z1t , while w22 shows how pc2t affects z2t .

Remark 4.5 (Data in matrices ) Transpose (4.14) to get pc t0 D z t0 W , where the dimen-
Remark 4.3 (Principal component analysis) Consider the zero mean N 1 vector z t with
sions are 1  n, 1  n and n  n respectively. If we form a T  n matrix of data Z by
covariance matrix ˙ . The first (sample) principal component is pc1t D w10 z t , where w1
putting z t in row t, then the T  N matrix of principal components can be calculated as
is the eigenvector associated with the largest eigenvalue (1 ) of ˙ . This value of w1
P C D ZW .
solves the problem maxw w 0 ˙w subject to the normalization w 0 w D 1. The eigenvalue
1 equals Var.pc1t / D w10 ˙ w1 . The j th principal component solves the same problem, Notice that (4.15) shows that all n data series in z t can be written in terms of the n prin-
but under the additional restriction that wi0 wj D 0 for all i < j . The solution is the cipal components. Since the principal components are uncorrelated (Cov.pcit ; pcjt / D

70 71
25 FF US portfolios, eigenvectors the earnings/price yield, and the book value–market value ratio. Still, the predictive power
0.6 is typically low.
1st (83.3%)
2nd (7.1%) Makridakis, Wheelwright, and Hyndman (1998) 10.1 show that there is little evidence
0.4 3rd (3.7%) that the average stock analyst beats (on average) the market (a passive index portfolio).
In fact, less than half of the analysts beat the market. However, there are analysts which
0.2
seem to outperform the market for some time, but the autocorrelation in over-performance
0 is weak. The evidence from mutual funds is similar. For them it is typically also found
that their portfolio weights do not anticipate price movements.
−0.2 It should be remembered that many analysts also are sales persons: either of a stock
(for instance, since the bank is underwriting an offering) or of trading services. It could
−0.4 well be that their objective function is quite different from minimizing the squared forecast
0 5 10 15 20 25
errors—or whatever we typically use in order to evaluate their performance. (The number
of litigations in the US after the technology boom/bust should serve as a strong reminder
Figure 4.6: Eigenvectors for US portfolio returns
of this.)

0/), we can think of the sum of the sum of their variances (˙inD1 i ) as the “total variation”
of the series in z t . In practice, it is common to report the relative importance of principal 4.7 Estimation on Subsamples
component j as
To capture time-variation in the regression coefficients, it is fairly common to run the
relative importance of pcj D j =˙inD1 i : (4.16)
regression
For instance, if it is found that the first two principal components account for 75% for the y t D x t0 b C " t (4.17)
total variation among many asset returns, then a two-factor model is likely to be a good
on a longer and longer data set (“recursive estimation”). In the standard recursive es-
approximation.
timation, the first estimation is done on the sample t D 1; 2; : : : ; ; while the second
estimation is done on t D 1; 2; : : : ; ;  C 1; and so forth until we use the entire sample
4.6 Estimating Expected Returns t D 1 : : : ; T . In the “backwards recursive estimate” we instead keep the end-point fixed
and use more and more of old data. That is, the first sample could be T ; : : : ; T ; the
The starting point for forming estimates of future mean excess returns is typically histor-
second T  1; : : : ; T ; and so forth.
ical excess returns. Excess returns are preferred to returns, since this avoids blurring the
Alterntively, a moving data window (“rolling samples”) could be used. In this case,
risk compensation (expected excess return) with long-run movements in inflation (and
the first sample is t D 1; 2; : : : ; ; but the second is on t D 2; : : : ; ;  C 1, that is, by
therefore interest rates). The expected excess return for the future period is typically
dropping one observation at the start when the sample is extended at the end. See Figure
formed as a judgmental adjustment of the historical excess return. Evidence suggest that
4.7 for an illustration.
the adjustments are hard to make.
An alternative is to apply an exponentially weighted moving average (EMA) esti-
It is typically hard to predict movements (around the mean) of asset returns, but a few
mator, which uses all data points since the beginning of the sample—but where recent
variables seem to have some predictive power, for instance, the slope of the yield curve,
observations carry larger weights. The weight for data in period t is T t where T is the

72 73
β of HiTech sector, recursive β of HiTech sector, backwards recursive 4.8 Robust Estimation
2 2
4.8.1 Robust Means, Variances and Correlations

1.5 1.5 Outliers and other extreme observations can have very decisive influence on the estimates
of the key statistics needed for financial analysis, including mean returns, variances, co-
variances and also regression coefficients.
1 1
1960 1980 2000 1960 1980 2000 The perhaps best way to solve these problems is to carefully analyse the data—and
end of sample start of sample
then decide which data points to exclude. Alternatively, robust estimators can be applied
instead of the traditional ones.
To estimate the mean, the sample average can be replaced by the median or a trimmed
β of HiTech sector, 5−year data window β of HiTech sector, EWMA estimate
2 2 mean (where the x% lowest and highest observations are excluded).
Similarly, to estimate the variance, the sample standard deviation can be replaced by
the interquartile range (the difference between the 75th and the 25th percentiles), divided
1.5 1.5
by 1:35
StdRobust D Œquantile.0:75/ quantile.0:25/=1:35; (4.20)
1 1
1960 1980 2000 0.9 0.92 0.94 0.96 0.98 1 or by the median absolute deviation
end of 5−year sample λ
StdRobust D median.jx t j/=0:675: (4.21)

Figure 4.7: Betas of US industry portfolios Both these would coincide with the standard deviation if data was indeed drawn from a
normal distribution without outliers.
latest observation and 0 <  < 1, where a smaller value of  means that old data carries A robust covariance can be calculated by using the identity
low weights. In practice, this means that we define
Cov.x; y/ D ŒVar.x C y/ Var.x y/=4 (4.22)
xQ t D x t T t
and yQ t D y t T t
(4.18)
and using a robust estimator of the variances—like the square of (4.20). A robust cor-
and then estimate relation is then created by dividing the robust covariance with the two robust standard
yQ t D xQ t0 b C " t : (4.19) deviations.
Notice that also the constant (in x t ) should be scaled in the same way. (Clearly, this See Figures 4.8–4.9 for empirical examples.
method is strongly related to the GLS approach used when residuals are heteroskedastic.
Also, the idea of down weighting old data is commonly used to estimate time-varying 4.8.2 Robust Regression Coefficients
volatility of returns as in the RISK metrics method.) See Figure 4.7 for an illustration. Reference: Amemiya (1985) 4.6
The least absolute deviations (LAD) estimator miminizes the sum of absolute residu-

74 75
e US industry portfolios, Std
US industry portfolios, ER
0.11 Monthly data 1947:1−2008:12 Monthly data 1947:1−2008:12
mean std
median 0.2 iqr/1.35
0.1
mean excess return

0.18
0.09

std
0.08 0.16

0.07 0.14

0.06 0.12

A B C D E F G H I J A B C D E F G H I J

Figure 4.8: Mean excess returns of US industry portfolios Figure 4.9: Volatility of US industry portfolios

als (rather than the squared residuals) Remark 4.7 (Algorithm for LAD) The LAD estimator can be written

T T
X
X ˇ ˇ
ˇOLAD D arg min ˇy t x t0 b ˇ (4.23) ˇOLAD D arg min w t uO t .b/2 , w t D 1= juO t .b/j ; with uO t .b/ D y t x t0 bO
b ˇ
t D1 t D1

This estimator involve non-linearities, but a simple iteration works nicely. It is typically so it is a weighted least squares where both y t and x t are multiplied by 1= juO t .b/j. It can
less sensitive to outliers. (There are also other ways to estimate robust regression coeffi- be shown that iterating on LS with the weights given by 1= juO t .b/j, where the residuals
cients.) This is illustrated in Figure 4.10. are from the previous iteration, converges very quickly to the LAD estimator.
See Figure 4.11 for an empirical example.
Some alternatives to LAD: lleast median squares (LMS), and least trimmed squares
If we assume that the median of the true residual, u t , is zero, then we (typically) have
(LTS) estimators which solve
p   XT
T .ˇOLAD ˇ0 / !d N 0; f .0/ 2 ˙xx1 =4 , where ˙xx D plim x t x t0 =T; (4.24)  
tD1 ˇOLMS D arg min median uO 2t , with uO t D y t x t0 bO (4.25)
ˇ
where f .0/ is the value of the pdf of the residual at zero. Unless we know this density h
X
function (or else we would probably have used MLE instead of LAD), we need to estimate ˇOLT S D arg min uO 2i , uO 21  uO 22  ::: and h  T: (4.26)
ˇ
it—for instance with a kernel density method. i D1

p Note that the LTS estimator in (4.26) minimizes of the sum of the h smallest squared
Example 4.6 (N.0;  2 /) When u t  N.0;  2 ), then f .0/ D 1= 2 2 , so the covari-
residuals.
ance matrix in (4.24) becomes  2 ˙xx1 =2. This is =2 times larger than when using
LS.

76 77
OLS vs LAD of y = 0.75*x + u
2
y: −1.125 −0.750 1.750 1.125
1.5
x: −1.500 −1.000 1.000 1.500
1

0.5

0
y

−0.5

−1
Data
−1.5 OLS (0.25 0.90)
LAD (0.00 0.75)
−2 US industry portfolios, β
−3 −2 −1 0 1 2 3 1.5
x Monthly data 1947:1−2008:12
OLS
LAD

Figure 4.10: Data and regression line from OLS and LAD

Bibliography 1

β
Amemiya, T., 1985, Advanced econometrics, Harvard University Press, Cambridge, Mas-
sachusetts.

Elton, E. J., M. J. Gruber, S. J. Brown, and W. N. Goetzmann, 2007, Modern portfolio 0.5
theory and investment analysis, John Wiley and Sons, 7th edn. A B C D E F G H I J

Greene, W. H., 2003, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Figure 4.11: Betas of US industry portfolios
Jersey, 5th edn.

Makridakis, S., S. C. Wheelwright, and R. J. Hyndman, 1998, Forecasting: methods and


applications, Wiley, New York, 3rd edn.

78 79
.D..3 conditions, the t-statistic has an asymptotically normal distribution, that is

˛O i d
! N.0; 1/ under H0 W ˛i D 0: (5.4)
Std.˛O i /

5 Testing CAPM and Multifactor Models Note that this is the distribution under the null hypothesis that the true value of the inter-
cept is zero, that is, that CAPM is correct (in this respect, at least).
Reference: Elton, Gruber, Brown, and Goetzmann (2007) 15 The test assets are typically portfolios of firms with similar characteristics, for in-
More advanced material is denoted by a star ( ). It is not required reading. stance, small size or having their main operations in the retail industry. There are two
main reasons for testing the model on such portfolios: individual stocks are extremely
5.1 Market Model volatile and firms can change substantially over time (so the beta changes). Moreover,
it is of interest to see how the deviations from CAPM are related to firm characteristics
The basic implication of CAPM is that the expected excess return of an asset (ei ) is (size, industry, etc), since that can possibly suggest how the model needs to be changed.
linearly related to the expected excess return on the market portfolio (em ) according to The results from such tests vary with the test assets used. For US portfolios, CAPM
Cov .Ri ; Rm / seems to work reasonably well for some types of portfolios (for instance, portfolios based
ei D ˇi em , where ˇi D : (5.1) on firm size or industry), but much worse for other types of portfolios (for instance, port-
Var .Rm /
folios based on firm dividend yield or book value/market value ratio). Figure 5.1 shows
Let Riet D Ri t Rf t be the excess return on asset i in excess over the riskfree asset,
some results for US industry portfolios.
and let Rmt
e
be the excess return on the market portfolio. CAPM with a riskfree return
says that ˛i D 0 in n Critical values
10% 5% 1%
Riet D ˛i C bi Rmt
e
C "it , where E "it D 0 and Cov.Rmt
e
; "it / D 0: (5.2) 10 1.81 2.23 3.17
20 1.72 2.09 2.85
The two last conditions are automatically imposed by LS. Take expectations to get 30 1.70 2.04 2.75
40 1.68 2.02 2.70
  50 1.68 2.01 2.68
E Rite D ˛i C bi E Rmt
e
: (5.3)
60 1.67 2.00 2.66
Notice that the LS estimate of bi is the sample analogue to ˇi in (5.1). It is then clear that 70 1.67 1.99 2.65
80 1.66 1.99 2.64
CAPM implies that ˛i D 0, which is also what empirical tests of CAPM focus on. 90 1.66 1.99 2.63
This test of CAPM can be given two interpretations. If we assume that Rmt is the 100 1.66 1.98 2.63
correct benchmark (the tangency portfolio for which (5.1) is true by definition), then it Normal 1.64 1.96 2.58
is a test of whether asset Rit is correctly priced. This is typically the perspective in
performance analysis of mutual funds. Alternatively, if we assume that Rit is correctly Table 5.1: Critical values (two-sided test) of t distribution (different degrees of freedom)
priced, then it is a test of the mean-variance efficiency of Rmt . This is the perspective of and normal distribution.
CAPM tests.
The t-test of the null hypothesis that ˛i D 0 uses the fact that, under fairly mild

80 81
US industry portfolios, 1970:1−2008:12 US industry portfolios, 1970:1−2008:12 5.1.1 Interpretation of the CAPM Test
15 15
Instead of a t-test, we can use the equivalent chi-square test
Mean excess return

Mean excess return


10 10
D D ˛O i2 d
A A ! 21 under H0 : ˛i D 0: (5.5)
5 I F
H G
CJ 5 I FH C
G
J E
Var.˛O i /
E
B B
Excess market return: 4.5% It is quite straightforward to use the properties of minimum-variance frontiers (see
0 0
0 0.5 1 1.5 0 5 10 15 Gibbons, Ross, and Shanken (1989), and also MacKinlay (1995)) to show that the test
β Predicted mean excess return (with α=0)
statistic in (5.5) can be written

˛O i2 .SRc /2 .SRm /2
alpha pval StdErr CAPM
D ; (5.6)
all NaN 0.07 NaN Factor: US market Var.˛O i / Œ1 C .SRm /2 =T
A (NoDur) 3.50 0.02 9.05 α and StdErr are in annualized %
B (Durbl)
C (Manuf)
−2.10
0.69
0.29
0.50
12.27
6.34
where SRm is the Sharpe ratio of the market portfolio (as before) and SRc is the Sharpe
D (Enrgy) 4.69 0.05 14.81 ratio of the tangency portfolio when investment in both the market return and asset i is
E (HiTec) −1.98 0.31 12.24
F (Telcm) 1.26 0.49 11.42 possible. (Recall that the tangency portfolio is the portfolio with the highest possible
G (Shops) 1.18 0.45 9.88
H (Hlth ) 2.20 0.24 11.74 Sharpe ratio.) If the market portfolio has the same (squared) Sharpe ratio as the tangency
I (Utils) 2.80 0.15 11.88
J (Other) −0.10 0.93 7.03 portfolio of the mean-variance frontier of Rit and Rmt (so the market portfolio is mean-
variance efficient also when we take Ri t into account) then the test statistic, ˛O i2 = Var.˛O i /,
is zero—and CAPM is not rejected.
Figure 5.1: CAPM regressions on US industry indices Proof. ( Proof of (5.6)) From the CAPM regression (5.2) we have
" # " # " # " #
n Critical values Rite ˇi2 m2 C Var."i t / ˇi m2 ei ˛i C ˇi em
10% 5% 1% Cov e
D , and D :
Rmt ˇi m2 m2 em em
1 2.71 3.84 6.63
2 4.61 5.99 9.21 Suppose we use this information to construct a mean-variance frontier for both Ri t and
3 6.25 7.81 11.34
Rmt , and we find the tangency portfolio, with excess return Rct
e
. It is straightforward to
4 7.78 9.49 13.28
5 9.24 11.07 15.09 show that the square of the Sharpe ratio of the tangency portfolio is e0 ˙ 1 e , where
6 10.64 12.59 16.81 e is the vector of expected excess returns and ˙ is the covariance matrix. By using the
7 12.02 14.07 18.48 covariance matrix and mean vector above, we get that the squared Sharpe ratio for the
8 13.36 15.51 20.09
tangency portfolio, e0 ˙ 1 e , (using both Rit and Rmt ) is
9 14.68 16.92 21.67
10 15.99 18.31 23.21  2  e 2
ec ˛i2 m
D C ;
c Var."i t / m
Table 5.2: Critical values of chisquare distribution (different degrees of freedom, n).
which we can write as
˛i2
.SRc /2 D C .SRm /2 :
Var."it /

82 83
MV frontiers before and after (α=0) MV frontiers before and after (α=0.05) 5.1.2 Econometric Properties of the CAPM Test

0.1
Solid curves: 2 assets,
0.1
A common finding from Monte Carlo simulations is that these tests tend to reject a true
Dashed curves: 3 assets
null hypothesis too often when the critical values from the asymptotic distribution are
Mean

Mean
0.05 0.05
used: the actual small sample size of the test is thus larger than the asymptotic (or “nom-
inal”) size (see Campbell, Lo, and MacKinlay (1997) Table 5.1). The practical conse-
quence is that we should either used adjusted critical values (from Monte Carlo or boot-
0 0
0 0.05 0.1 0.15 0 0.05 0.1 0.15 strap simulations)—or more pragmatically, that we should only believe in strong rejec-
Std Std
tions of the null hypothesis.
The new asset has the abnormal return α
compared to the market (of 2 assets)
To study the power of the test (the frequency of rejections of a false null hypothesis)
MV frontiers before and after (α=−0.04) Means
we have to specify an alternative data generating process (for instance, how much extra
0.0800 0.0500 α + β(ERm−Rf)
Cov 0.0256 0.0000 0.0000
return in excess of that motivated by CAPM) and the size of the test (the critical value to
0.1
matrix 0.0000 0.0144 0.0000 use). Once that is done, it is typically found that these tests require a substantial deviation
0.0000 0.0000 0.0144
from CAPM and/or a long sample to get good power. The basic reason for this is that asset
Mean

0.05
Tang N=2 α=0 α=0.05 α=−0.04 returns are very volatile. For instance, suppose that the standard OLS assumptions (iid
portf 0.47 0.47 0.31 0.82
0.53 0.53 0.34 0.91
residuals that are independent of the market return) are correct. Then, it is straightforward
0
0 0.05 0.1 0.15
NaN 0.00 0.34 −0.73 to show that the variance of Jensen’s alpha is
Std " #
.em /2
Var.˛O i / D 1 C  Var."it /=T (5.7)
Var Rm e
Figure 5.2: Effect on MV frontier of adding assets
D Œ1 C .SRm /2  Var."it /=T; (5.8)
Combine this with (5.8) which shows that Var.˛O i / D Œ1 C .SRm /  Var."it /=T . 2
where SRm is the Sharpe ratio of the market portfolio. We see that the uncertainty about
This is illustrated in Figure 5.2 which shows the effect of adding an asset to the invest-
the alpha is high when the residual is volatile and when the sample is short, but also when
ment opportunity set. In this case, the new asset has a zero beta (since it is uncorrelated
the Sharpe ratio of the market is high. Note that a large market Sharpe ratio means that
with all original assets), but the same type of result holds for any new asset. The basic
the market asks for a high compensation for taking on risk. A bit uncertainty about how
point is that the market model tests if the new assets moves the location of the tangency
risky asset i is then translates in a large uncertainty about what the risk-adjusted return
portfolio. In general, we would expect that adding an asset to the investment opportunity
should be.
set would expand the mean-variance frontier (and it does) and that the tangency portfolio
changes accordingly. However, the tangency portfolio is not changed by adding an asset Example 5.1 Suppose we have monthly data with b ˛ i D 0:2% (that is, 0:2%  12 D 2:4%
p
with a zero intercept. The intuition is that such an asset has neutral performance com- per year), Std ."i t / D 3% (that is, 3%  12  10% per year) and a market Sharpe ratio
p
pared to the market portfolio (obeys the beta representation), so investors should stick to of 0:15 (that is, 0:15  12  0:5 per year). (This corresponds well to US CAPM
the market portfolio. regressions for industry portfolios.) A significance level of 10% requires a t-statistic (5.4)

84 85
of at least 1.65, so A quite different approach to study a cross-section of assets is to first perform a CAPM
0:2
p p  1:65 or T  626: regression (5.2) and then the following cross-sectional regression
1 C 0:152 3= T
T
X
We need a sample of at least 626 months (52 years)! With a sample of only 26 years (312
Rite =T D C ˇOi C ui ; (5.9)
months), the alpha needs to be almost 0.3% per month (3.6% per year) or the standard t D1
deviation of the residual just 2% (7% per year). Notice that cumulating a 0.3% return P
where TtD1 Rite =T is the (sample) average excess return on asset i . Notice that the es-
over 25 years means almost 2.5 times the initial value.
timated betas are used as regressors and that there are as many data points as there are
Proof. ( Proof of (5.8)) Consider the regression equation y t D x t0 b C " t . With iid assets (n).
errors that are independent of all regressors (also across observations), the LS estimator, There are severe econometric problems with this regression equation since the regres-
bOLs , is asymptotically distributed as sor contains measurement errors (it is only an uncertain estimate), which typically tend
p to bias the slope coefficient towards zero. To get the intuition for this bias, consider an
d
T .bOLs b/ ! N.0;  2 ˙xx1 /, where  2 D Var." t / and ˙xx D plim˙ tD1
T
x t x t0 =T: extremely noisy measurement of the regressor: it would be virtually uncorrelated with the
dependent variable (noise isn’t correlated with anything), so the estimated slope coeffi-
When the regressors are just a constant (equal to one) and one variable regressor, f t , so
cient would be close to zero.
x t D Œ1; f t 0 , then we have
If we could overcome this bias (and we can by being careful), then the testable im-
" # " #
P 1 PT 1 ft 1 E ft plications of CAPM is that D 0 and that  equals the average market excess return.
˙xx D E TtD1 x t x t0 =T D E D , so
T t D1 f t f t2 E f t E f t2 We also want (5.9) to have a high R2 —since it should be unity in a very large sample (if
" # " #
2 E f t2 E ft 2 Var.f t / C .E f t /2 E ft CAPM holds).
 2 ˙xx1 D D :
E f t2 .E f t /2 E ft 1 Var.f t / E ft 1
5.1.4 Several Assets: SURE Approach
(In the last line we use Var.f t / D E f t2 .E f t /2 :)
This section outlines how we can set up a formal test of CAPM when there are several
test assets.
5.1.3 Several Assets
For simplicity, suppose we have two test assets. Stack (5.2) for the two equations are
In most cases there are several (n) test assets, and we actually want to test if all the ˛i (for
e e
i D 1; 2; :::; n) are zero. Ideally we then want to take into account the correlation of the R1t D ˛1 C b1 Rmt C "1t ; (5.10)
different alphas. e
R2t D ˛2 C e
b2 Rmt C "2t (5.11)
While it is straightforward to construct such a test, it is also a bit messy. As a quick
way out, the following will work fairly well. First, test each asset individually. Second, where E "it D 0 and Cov.Rmt e
; "it / D 0. This is a system of seemingly unrelated regres-
form a few different portfolios of the test assets (equally weighted, value weighted) and sions (SURE)—with the same regressor (see, for instance, Wooldridge (2002) 7.7). In
test these portfolios. Although this does not deliver one single test statistic, it provides this case, the efficient estimator (GLS) is LS on each equation separately. Moreover, the
plenty of information to base a judgement on. For a more formal approach, see Section covariance matrix of the coefficients is particularly simple.
5.1.4. To see what the covariances of the coefficients are, write the regression equation for

86 87
asset 1 (5.10) on a traditional form To apply this, form the test static
" # " # " #0 " # 1 " #
1 ˛1 ˛O 1 11 12 ˛O 1
e
R1t D x t0 ˇ1 C "1t , where x t D e
; ˇ1 D ; (5.12) T Œ1 C .SRm /2  1
 22 : (5.17)
Rmt b1 ˛O 2 12 22 ˛O 2
and similarly for the second asset (and any further assets). This can also be transformed into an F test, which might have better small sample prop-
Define erties.
XT XT
˙O xx D x t x t0 =T , and O ij D "Oi t "Ojt =T; (5.13)
t D1 tD1

where "Oi t is the fitted residual of asset i . The key result is then that the (estimated) 5.1.5 Representative Results of the CAPM Test
asymptotic covariance matrix of the vectors ˇOi and ˇOj (for assets i and j ) is One of the more interesting studies is Fama and French (1993) (see also Fama and French
(1996)). They construct 25 stock portfolios according to two characteristics of the firm:
Cov.ˇOi ; ˇOj / D O ij ˙O xx1 =T: (5.14)
the size (by market capitalization) and the book-value-to-market-value ratio (BE/ME). In
(In many text books, this is written O ij .X 0 X/ 1 .) June each year, they sort the stocks according to size and BE/ME. They then form a 5  5
The null hypothesis in our two-asset case is matrix of portfolios, where portfolio ij belongs to the i th size quintile and the j th BE/ME
quintile. Tables 5.3–5.4 summarize some basic properties of these portfolios.
H0 W ˛1 D 0 and ˛2 D 0: (5.15)
Book value/Market value
In a large sample, the estimator is normally distributed (this follows from the fact that 1 2 3 4 5
the LS estimator is a form of sample average, so we can apply a central limit theorem). Size 1 3.9 9.5 10.6 12.6 15.0
Therefore, under the null hypothesis we have the following result. From (5.8) we know 2 3.5 7.6 10.1 10.3 11.1
3 4.1 7.8 8.1 9.6 11.4
that the upper left element of ˙xx1 =T equals Œ1 C .SRm /2 =T . Then
4 5.2 5.9 7.8 8.8 9.0
" # " # " # ! 5 4.3 5.9 6.1 5.8 7.7
˛O 1 0 11 12
N ; Œ1 C .SRm / =T (asymptotically).
2
(5.16)
˛O 2 0 12 22
Table 5.3: Mean excess returns (annualised %), US data 1957:1–2008:12. Size 1: smallest
In practice we use the sample moments for the covariance matrix. Notice that the zero 20% of the stocks, Size 5: largest 20% of the stocks. B/M 1: the 20% of the stocks with
means in (5.16) come from the null hypothesis: the distribution is (as usual) constructed the smallest ratio of book to market value (growth stocks). B/M 5: the 20% of the stocks
with the highest ratio of book to market value (value stocks).
by pretending that the null hypothesis is true. In practice we use the sample moments for
the covariance matrix. Notice that the zero means in (5.16) come from the null hypothesis:
They run a traditional CAPM regression on each of the 25 portfolios (monthly data
the distribution is (as usual) constructed by pretending that the null hypothesis is true.
1963–1991)—and then study if the expected excess returns are related to the betas as they
We can now construct a chi-square test by using the following fact.
should according to CAPM (recall that CAPM implies E Rite D ˇi  where  is the risk
Remark 5.2 If the n  1 vector y  N.0; ˝/, then y 0 ˝ 1
y  2n . premium (excess return) on the market portfolio).
However, it is found that there is almost no relation between E Rite and ˇi (there is
a cloud in the ˇi  E Riet space, see Cochrane (2001) 20.2, Figure 20.9). This is due
to the combination of two features of the data. First, within a BE/ME quintile, there is

88 89
Book value/Market value 18
1 2 3 4 5
Size 1 1.4 1.2 1.1 1.0 1.0 16
2 1.5 1.2 1.1 1.0 1.1
3 1.4 1.1 1.0 1.0 1.0 14

Mean excess return, %


4 1.3 1.1 1.0 1.0 1.0
5 1.1 1.0 0.9 0.9 0.9 12

10
Table 5.4: Beta against the market portfolio, US data 1957:1–2008:12. Size 1: smallest
20% of the stocks, Size 5: largest 20% of the stocks. B/M 1: the 20% of the stocks with
8
the smallest ratio of book to market value (growth stocks). B/M 5: the 20% of the stocks
with the highest ratio of book to market value (value stocks).
6 US data 1957:1−2008:12
25 FF portfolios (B/M and size)
a positive relation (across size quantiles) between E Rite and ˇi —as predicted by CAPM 4 p−value for test of model: 0.00
(see Cochrane (2001) 20.2, Figure 20.10). Second, within a size quintile there is a negative
4 6 8 10 12 14 16 18
relation (across BE/ME quantiles) between E Rite and ˇi —in stark contrast to CAPM (see Predicted mean excess return (CAPM), %
Cochrane (2001) 20.2, Figure 20.11).
Figure 5.1 shows some results for US industry portfolios and Figures 5.3–5.5 for US
Figure 5.3: CAPM, FF portfolios
size/book-to-market portfolios.

In this case, (5.2) is a multiple regression, but the test (5.4) still has the same form (the
5.1.6 Representative Results on Mutual Fund Performance
standard deviation of the intercept will be different, though).
Mutual fund evaluations (estimated ˛i ) typically find (i) on average neutral performance Fama and French (1993) also try a multi-factor model. They find that a three-factor
(or less: trading costs&fees); (ii) large funds might be worse; (iii) perhaps better perfor- model fits the 25 stock portfolios fairly well (two more factors are needed to also fit the
mance on less liquid (less efficient?) markets; and (iv) there is very little persistence in seven bond portfolios that they use). The three factors are: the market return, the return
performance: ˛i for one sample does not predict ˛i for subsequent samples (except for on a portfolio of small stocks minus the return on a portfolio of big stocks (SMB), and
bad funds). the return on a portfolio with high BE/ME minus the return on portfolio with low BE/ME
(HML). This three-factor model is rejected at traditional significance levels, but it can
5.2 Several Factors still capture a fair amount of the variation of expected returns (see Cochrane (2001) 20.2,
Figures 20.12–13).
In multifactor models, (5.2) is still valid—provided we reinterpret bi and Rmt
e
as vectors, Chen, Roll, and Ross (1986) use a number of macro variables as factors—along with
so bi Rmt stands for bio Rot C bip Rpt C :::
e e e
traditional market indices. They find that industrial production and inflation surprises are
priced factors, while the market index might not be.
Rite D ˛ C bio Rot
e e
C bip Rpt C ::: C "i t : (5.18)
Figure 5.6 shows some results for the Fama-French model on US industry portfolios
and Figures 5.7–5.9 on the 25 Fama-French portfolios.

90 91
18 18

16 16

14 14
Mean excess return, %

Mean excess return, %


12 12

lines connect same size lines connect same B/M


10 10

8 8
1 (small) 1 (low)
6 2 6 2
3 3
4 4
4 5 (large) 4 5 (high)

4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
Predicted mean excess return (CAPM), % Predicted mean excess return (CAPM), %

Figure 5.4: CAPM, FF portfolios Figure 5.5: CAPM, FF portfolios

5.3 Fama-MacBeth US industry portfolios, 1970:1−2008:12 all


alpha
NaN
pval
0.00
StdErr
NaN
15 A (NoDur) 2.03 0.15 8.71
B (Durbl) −5.91 0.00 11.14

Mean excess return


Reference: Cochrane (2001) 12.3; Campbell, Lo, and MacKinlay (1997) 5.8; Fama and C (Manuf) −0.47 0.64 6.09
D (Enrgy) 3.28 0.16 14.29
MacBeth (1973) 10
E (HiTec) 1.87 0.27 10.29
D
A
The Fama and MacBeth (1973) approach is a bit different from the regression ap- H G
C
F (Telcm) 0.97 0.61 11.16
5 F I J G (Shops) 0.14 0.93 9.73
proaches discussed so far. The method has three steps, described below. E
B H (Hlth ) 4.76 0.01 11.02
I (Utils) −0.63 0.72 10.38
0 J (Other) −2.45 0.02 6.16
 First, estimate the betas ˇi (i D 1; : : : ; n) from (5.2) (this is a time-series regres- 0 5 10 15
Predicted mean excess return
sion). This is often done on the whole sample—assuming the betas are constant.
Sometimes, the betas are estimated separately for different sub samples (so we
Fama−French model
could let ˇOi carry a time subscript in the equations below). Factors: US market, SMB (size), and HML (book−to−market)
α and StdErr are in annualized %
 Second, run a cross sectional regression for every t. That is, for period t , estimate
 t from the cross section (across the assets i D 1; : : : ; n) regression
Figure 5.6: Fama-French regressions on US industry indices
Riet D 0t ˇOi C "i t ; (5.19)

92 93
18 18

16 16

14 14
Mean excess return, %

Mean excess return, %


12 12

lines connect same size


10 10

8 8
1 (small)
6 US data 1957:1−2008:12 6 2
3
25 FF portfolios (B/M and size)
4
4 p−value for test of model: 0.00 4 5 (large)
4 6 8 10 12 14 16 18 4 6 8 10 12 14 16 18
Predicted mean excess return (FF), % Predicted mean excess return (FF), %

Figure 5.7: FF, FF portfolios Figure 5.8: FF, FF portfolios

where ˇOi are the regressors. (Note the difference to the traditional cross-sectional step regressions to average out—and thereby make the measurement error in ˇOi smaller.
approach discussed in (5.9), where the second stage regression regressed E Rite on If CAPM is true, then the return of an asset is a linear function of the market return and an
ˇOi , while the Fama-French approach runs one regression for every time period.) error which should be uncorrelated with the errors of other assets—otherwise some factor
is missing. If the portfolio consists of 20 assets with equal error variance in a CAPM
 Third, estimate the time averages
regression, then we should expect the portfolio to have an error variance which is 1/20th
T
1X as large.
"Oi D "Oi t for i D 1; : : : ; n, (for every asset) (5.20)
T t D1 We clearly want portfolios which have different betas, or else the second step regres-
T sion (5.19) does not work. Fama and MacBeth (1973) choose to construct portfolios
1 XO
O D t : (5.21) according to some initial estimate of asset specific betas. Another way to deal with the
T t D1
errors-in-variables problem is to adjust the tests.
We can test the model by studying if "i D 0 (recall from (5.20) that "i is the time
The second step, using ˇOi as regressors, creates an errors-in-variables problem since
O average of the residual for asset i , "i t ), by forming a t-test "Oi = Std.O"i /. Fama and MacBeth
ˇi are estimated, that is, measured with an error. The effect of this is typically to bias the
(1973) suggest that the standard deviation should be found by studying the time-variation
estimator of  t towards zero (and any intercept, or mean of the residual, is biased upward).
in "Oi t . In particular, they suggest that the variance of "Oi t (not "Oi ) can be estimated by the
One way to minimize this problem, used by Fama and MacBeth (1973), is to let the assets
be portfolios of assets, for which we can expect some of the individual noise in the first-

94 95
18 significant in the second step regression, nor is a measure of non-systematic risk.

16
Bibliography
14
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
Mean excess return, %

12 markets, Princeton University Press, Princeton, New Jersey.


lines connect same B/M
10 Chen, N.-F., R. Roll, and S. A. Ross, 1986, “Economic forces and the stock market,”
Journal of Business, 59, 383–403.
8
1 (low) Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.
6 2
3 Elton, E. J., M. J. Gruber, S. J. Brown, and W. N. Goetzmann, 2007, Modern portfolio
4
4 5 (high) theory and investment analysis, John Wiley and Sons, 7th edn.
4 6 8 10 12 14 16 18 Fama, E., and J. MacBeth, 1973, “Risk, return, and equilibrium: empirical tests,” Journal
Predicted mean excess return (FF), %
of Political Economy, 71, 607–636.

Figure 5.9: FF, FF portfolios Fama, E. F., and K. R. French, 1993, “Common risk factors in the returns on stocks and
bonds,” Journal of Financial Economics, 33, 3–56.
(average) squared variation around its mean
Fama, E. F., and K. R. French, 1996, “Multifactor explanations of asset pricing anoma-
1X
T lies,” Journal of Finance, 51, 55–84.
Var.O"it / D .O"i t "Oi /2 : (5.22)
T t D1
Gibbons, M., S. Ross, and J. Shanken, 1989, “A test of the efficiency of a given portfolio,”
Since "Oi is the sample average of "Oit , the variance of the former is the variance of the latter Econometrica, 57, 1121–1152.
divided by T (the sample size)—provided "Oit is iid. That is,
MacKinlay, C., 1995, “Multifactor models do not explain deviations from the CAPM,”
1 1
T
X Journal of Financial Economics, 38, 3–28.
Var.O"i / D Var.O"it / D 2 .O"i t "Oi /2 : (5.23)
T T t D1 Wooldridge, J. M., 2002, Econometric analysis of cross section and panel data, MIT
Press.
A similar argument leads to the variance of O
T
O D 1 X O O 2:
Var./ 2
. t / (5.24)
T t D1

Fama and MacBeth (1973) found, among other things, that the squared beta is not

96 97
p
so T Os can be used as a t-stat.

Example 6.1 (t-test) We want to test the hypothesis that 1 D 0. Since the N.0; 1/ dis-
6 Time Series Analysis tribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we
p
can reject the null hypothesis at the 10% level if T jO1 j > 1:65. With T D 100, we
p
Reference: Newbold (1995) 17 or Pindyck and Rubinfeld (1998) 13.5, 16.1–2, and 17.2 therefore need jO1 j > 1:65= 100 D 0:165 for rejection, and with T D 1000 we need
p
More advanced material is denoted by a star ( ). It is not required reading. jO1 j > 1:65= 1000  0:052.
p
The Box-Pierce test follows directly from the result in (6.3), since it shows that T Oi
6.1 Descriptive Statistics p
and T Oj are iid N(0,1) variables. Therefore, the sum of the square of them is distributed
The sth autocovariance of y t is estimated by as a 2 variable. The test statistics typically used is
PT PT L
X
b
Cov .y t ; y t s / D t D1 .y t N .y t
y/ s N =T , where yN D
y/ tD1 y t =T: (6.1)
QL D T Os2 !d L
2
: (6.5)
sD1
The conventions in time series analysis are that we use the same estimated (using all data)
mean in both places and that we divide by T . Example 6.2 (Box-Pierce) Let O1 D 0:165, and T D 100, so Q1 D 100  0:1652 D
The sth autocorrelation is estimated as 2:72. The 10% critical value of the 21 distribution is 2.71, so the null hypothesis of no
 autocorrelation is rejected.
Os D
b
Cov y t ; y t p
: (6.2)
c .y t /2
Std The choice of lag order in (6.5), L, should be guided by theoretical considerations, but
Compared with a traditional estimate of a correlation we here impose that the standard it may also be wise to try different values. There is clearly a trade off: too few lags may
deviation of y t and y t p are the same (which typically does not make much of a differ- miss a significant high-order autocorrelation, but too many lags can destroy the power of
ence). the test (as the test statistics is not affected much by increasing L, but the critical values
The sampling properties of Os are complicated, but there are several useful large sam- increase).
ple results for Gaussian processes (these results typically carry over to processes which The pth partial autocorrelation is discussed in Section 6.3.6.
are similar to the Gaussian—a homoskedastic process with finite 6th moment is typically
enough, see Priestley (1981) 5.3 or Brockwell and Davis (1991) 7.2-7.3). When the true 6.2 White Noise
autocorrelations are all zero (not 0 , of course), then for any i and j different from zero
" # " # " #! The white noise process is the basic building block used in most other time series models.
p Oi d 0 1 0 It is characterized by a zero mean, a constant variance, and no autocorrelation
T ! N ; : (6.3)
Oj 0 0 1
E "t D 0
This result can be used to construct tests for both single autocorrelations (t-test or 2 test)
Var ." t / D  2 , and
and several autocorrelations at once (2 test). In particular,
Cov ." t s ; " t / D 0 if s ¤ 0. (6.6)
p d
T Os ! N.0; 1/; (6.4)

98 99
If, in addition, " t is normally distributed, then it is said to be Gaussian white noise. This Autocorrelations Partial autocorrelations
process can clearly not be forecasted. 1 1
AR(1), 0.85
To construct a variable that has a non-zero mean, we can form 0.5 0.5 MA(1), 0.85
0 0
yt D  C "t ; (6.7)
−0.5 AR(1), 0.85 −0.5
MA(1), 0.85
where  is a constant. This process is most easily estimated by estimating the sample −1 −1
mean and variance in the usual way (as in (6.1) with p D 0). 0 2 4 6 8 10 0 2 4 6 8 10
Periods Periods

6.3 Autoregression (AR)


Autocorrelations Partial autocorrelations
6.3.1 AR(1) 1 1
AR(1), −0.85
0.5 0.5 MA(1), −0.85
In this section we study the first-order autoregressive process, AR(1), in some detail in
order to understand the basic concepts of autoregressive processes. The process is as- 0 0

sumed to have a zero mean (or is demeaned, that an original variable minus its mean, for −0.5 AR(1), −0.85 −0.5
MA(1), −0.85
instance y t D x t xN t )—but it is straightforward to put in any mean or trend. −1 −1
0 2 4 6 8 10 0 2 4 6 8 10
An AR(1) is Periods Periods
y t D ay t 1 C " t ; with Var." t / D  2 ; (6.8)

where " t is the white noise process in (6.6) which is uncorrelated with y t 1 . If 1 < a < Figure 6.1: Autocorrelations and partial autocorrelations
1, then the effect of a shock eventually dies out: y t is stationary.
The AR(1) model can be estimated with OLS (since " t and y t 1 are uncorrelated) and can be a bit different since a few data points are left out in the OLS estimation, but the
the usual tools for testing significance of coefficients and estimating the variance of the difference is likely to be small.
residual all apply.
The basic properties of an AR(1) process are (provided jaj < 1) Example 6.4 With a D 0:85 and  2 D 0:52 , we have Var .y t / D 0:25=.1 0:852 /  0:9,
which is much larger than the variance of the residual. (Why?)
Var .y t / D  2 =.1 a2 / (6.9)
Corr .y t ; y t s / D a ;
s
(6.10) If a D 1 in (6.8), then we get a random walk. It is clear from the previous analysis
that a random walk is non-stationary—that is, the effect of a shock never dies out. This
so the variance and autocorrelation are increasing in a (assuming a > 0). implies that the variance is infinite and that the standard tools for testing coefficients etc.
See Figure 6.1 for an illustration. are invalid. The solution is to study changes in y instead: y t y t 1 . In general, processes
with the property that the effect of a shock never dies out are called non-stationary or unit
Remark 6.3 (Autocorrelation and autoregression). Notice that the OLS estimate of a in
root or integrated processes. Try to avoid them.
(6.8) is essentially the same as the sample autocorrelation coefficient in (6.2). This follows
See Figure 6.2 for an example.
b c t 1 /. The denominator
from the fact that the slope coefficient is Cov .y t ; y t 1 / =Var.y

100 101
Return = a + b*lagged Return, slope Return = a + b*lagged Return, R2
the right hand side
Slope with 90% conf band
0.5 Var .y t / D Var .ay t 1/ C Var ." t /
0.1
D a Var .y t
2
1/ C Var ." t /
0
0.05 D Var ." t / =.1 a2 /; since Var .y t 1/ D Var .y t / . (6.16)
−0.5
0 In this calculation, we use the fact that Var .y t 1 / and Var .y t / are equal. Formally, this
0 20 40 60 0 20 40 60
Return horizon (months) Return horizon (months) follows from that they are both linear functions of current and past "s terms (see (6.15)),
US stock returns 1926:1−2008:12 which have the same variance over time (" t is assumed to be white noise).
Note from (6.16) that the variance of y t is increasing in the absolute value of a, which
is illustrated in Figure 6.3. The intuition is that a large jaj implies that a shock have effect
Figure 6.2: Predicting US stock returns (various investment horizons) with lagged returns.
over many time periods and thereby create movements (volatility) in y.
Similarly, the covariance of y t and y t 1 is
6.3.2 More on the Properties of an AR(1) Process

Solve (6.8) backwards by repeated substitution Cov .y t ; y t 1/ D Cov .ay t 1 C "t ; yt 1/

D a Cov .y t 1; yt 1/
y t D a.ay t 2 C " t 1 / C " t (6.11)
„ ƒ‚ … D a Var .y t / : (6.17)
yt 1

D a2 y t 2 C a" t 1 C "t (6.12) We can then calculate the first-order autocorrelation as


::
: (6.13) Cov .y t ; y t 1 /
Corr .y t ; y t 1/ D
K
X Std.y t / Std.y t 1 /
D aKC1 y t K 1 C as " t s : (6.14) D a: (6.18)
sD0

The factor aKC1 y t K 1 declines monotonically to zero if 0 < a < 1 as K increases, and It is straightforward to show that
declines in an oscillating fashion if 1 < a < 0. In either case, the AR(1) process is
Corr .y t ; y t s / D Corr .y t Cs ; y t / D as : (6.19)
covariance stationary and we can then take the limit as K ! 1 to get

y t D " t C a" t C a2 " t C ::: 6.3.3 Forecasting with an AR(1)


1 2
1
X Suppose we have estimated an AR(1). To simplify the exposition, we assume that we
D as " t s : (6.15)
sD0
actually know a and Var." t /, which might be a reasonable approximation if they were
estimated on long sample.
Since " t is uncorrelated over time, y t 1 and " t are uncorrelated. We can therefore
We want to forecast y t C1 using information available in t . From (6.8) we get
calculate the variance of y t in (6.8) as the sum of the variances of the two components on
y t C1 D ay t C " t C1 : (6.20)

102 103
Forecast with 90% conf band Forecast with 90% conf band More generally, we have
5 5
E t y t Cs D as y t , (6.26)
 
0 0 Var .y t Cs E t y tCs / D 1 C a2 C a4 C ::: C a2.s 1/
2 (6.27)
2s
a 1 2
D  : (6.28)
Intial value: 3 Intial value: 0 a2 1
−5 −5
0 2 4 6 8 10 0 2 4 6 8 10 Example 6.5 If y t D 3; a D 0:85 and  D 0:5, then the forecasts and the forecast error
Forecasting horizon Forecasting horizon
variances become
AR(1) model: yt+1 = 0.85yt + εt+1, σ = 0.5
Horizon s E t y t Cs Var .y tCs E t y t Cs /
1 0:85  3 D 2:55 0:25

Figure 6.3: Properties of AR(1) process 2 0:852  3 D 2:17 0:852 C 1  0:52 D 0:43
0:8550 1
25 0:8525  3 D 0:05 0:852 1
 0:52 D 0:90
Since the best guess of " tC1 is that it is zero, the best forecast and the associated forecast Notice that the point forecast converge towards zero and the variance of the forecast error
error are variance to the unconditional variance (see Example 6.4).

E t y t C1 D ay t , and (6.21) If the shocks " t , are normally distributed, then we can calculate 90% confidence in-
y t C1 E t y t C1 D " t C1 with variance  2 . (6.22) tervals around the point forecasts in (6.21) and (6.24) as

We may also want to forecast y t C2 using the information in t. To do that note that 90% confidence band of E t y tC1 W ay t ˙ 1:65   (6.29)
(6.8) gives p
90% confidence band of E t y tC2 W a2 y t ˙ 1:65  a2  2 C  2 : (6.30)
y t C2 D ay t C1 C " t C2
(Recall that 90% of the probability mass is within the interval 1:65 to 1:65 in the N(0,1)
D a.ay t C " t C1 / C " t C2 distribution). To get 95% confidence bands, replace 1.65 by 1.96. Figure 6.3 gives an
„ ƒ‚ …
y t C1
example.
D a2 y t C a" t C1 C " t C2 : (6.23)
Example 6.6 Continuing Example 6.5, we get the following 90% confidence bands
Since the E t " t C1 and E t " t C2 are both zero, we get that
Horizon s
p
E t y tC2 D a2 y t ; and (6.24) 1 2:55 ˙ 1:65  0:25  Œ1:7; 3:4
p :
y t C2 E t y tC2 D a" tC1 C " t C2 with variance a2  2 C  2 : (6.25) 2 2:17 ˙ 1:65  0:43  Œ1:1; 3:2
p
25 0:05 ˙ 1:65  0:90  Œ 1:5; 1:6

Remark 6.7 (White noise as special case of AR(1).) When a D 0 in (6.8), then the
AR(1) collapses to a white noise process. The forecast is then a constant (zero) for all

104 105
forecasting horizons, see (6.26), and the forecast error variance is also the same for all In practice, the first partial autocorrelation is estimated by a in an AR(1)
horizons, see (6.28).
y t D ay t 1 C "t : (6.33)
6.3.4 Adding a Constant to the AR(1)
The second partial autocorrelation is estimated by the second slope coefficient (a2 ) in an
The discussion of the AR(1) worked with a zero mean variable, but that was just for AR(2)
convenience (to make the equations shorter). One way to work with a variable x t with y t D a1 y t 1 C a2 y t 2 C " t ; (6.34)
a non-zero mean, is to first estimate its sample mean xN t and then let the y t in the AR(1) and so forth. The general pattern is that the pth partial autocorrelation is estimated by the
model (6.8) be a demeaned variable y t D x t xN t . slope coefficient of the pth lag in an AR(p), where we let p go from 1,2,3...
To include a constant  in the theoretical expressions, we just need to substitute x t  See Figure 6.1 for an illustration.
for y t everywhere. For instance, in (6.8) we would get
6.3.7 Forecasting with an AR(2)
xt  D a .x t 1 / C " t or
x t D .1 a/  C ax t 1 C "t : (6.31) As an example, consider making a forecast of y t C1 based on the information in t by using
an AR(2)
Estimation by LS will therefore give an intercept that equals .1 a/  and a slope coef- y t C1 D a1 y t C a2 y t 1 C " tC1 : (6.35)
ficient that equals a.
This immediately gives the one-period point forecast

6.3.5 AR(p) E t y t C1 D a1 y t C a2 y t (6.36)


1:

The pth-order autoregressive process, AR(p), is a straightforward extension of the AR(1)


We can use (6.35) to write y t C2 as
y t D a1 y t 1 C a2 y t 2 C :::ap y t p C "t : (6.32)
y t C2 D a1 y t C1 C a2 y t C " t C2
All the previous calculations can be made on this process as well—it is just a bit messier. D a1 .a1 y t C a2 y t 1 C " t C1 / C a2 y t C " t C2
„ ƒ‚ …
This process can also be estimated with OLS since " t is uncorrelated with lags of y t . y tC1

Adding a constant is straightforward by substituting x t  for y t everywhere D .a12 C a2 /y t C a1 a2 y t 1 C a1 " t C1 C " tC2 : (6.37)

6.3.6 Partial Autocorrelations Figure 6.4 gives an empirical example.


The expressions for the forecasts and forecast error variances quickly get somewhat
The pth partial autocorrelation tries to measure the direct relation between y t and y t p , messy—and even more so with an AR of higher order than two. There is a simple, and
where the indirect effects of y t 1 ; :::; y t pC1 are eliminated. For instance, if y t is gen- approximately correct, shortcut that can be taken. Note that both the one-period and two-
erated by an AR(1) model, then the 2nd autocorrelation is a2 , whereas the 2nd partial period forecasts are linear functions of y t and y t 1 . We could therefore estimate the
autocorrelation is zero. The partial autocorrelation is therefore a way to gauge how many
lags that are needed in an AR(p) model.

106 107
Forecast made in t−1 and Actual Forecast made in t−2 and Actual 6.4 Moving Average (MA)
5
4 R2=0.72 4 R2=0.49 A qt h -order moving average process is
3
(6.40)
%

%
2
2 y t D " t C 1 " t 1 C ::: C q " t q;

1 Forecast
Actual
0 where the innovation " t is white noise (usually Gaussian). It is straightforward to add a
0
1996 1998 2000 2002 2004 2006 2008 1996 1998 2000 2002 2004 2006 2008 constant to capture a non-zero mean.
Year Year
Estimation of MA processes is typically done by setting up the likelihood function
and then using some numerical method to maximize it; LS does not work at all since the
Notes Comparison of forecast errors from right hand side variables are unobservable. This is one reason why MA models play a
Model: AR(2) of US 4−quarter GDP growth autoregression (AR) and random walk (RW)
Estimated on data for 1947−1994 limited role in applied work. Moreover, most MA models can be well approximated by
1−quarter 2−quarter
Slope coefs: 1.26 and −0.51 MSE(AR)/MSE(RW) 1.01 0.85 an AR model of low order.
R2 is corr(forecast,actual)2 for 1995− MAE(AR)/MAE(RW) 1.07 1.01
y(t) and E(t−s)y(t) are plotted in t R2(AR)/R2(RW) 0.97 1.01 The autocorrelations and partial autocorrelations (for different lags) can help us gauge
if the time series looks more like an AR or an MA. In an AR(p) model, the autocorrela-
tions decay to zero for long lags, while the p C1 partial autocorrelation (and beyond) goes
Figure 6.4: Forecasting with an AR(2) abruptly to zero. The reverse is true for an MA model. See Figure 6.1 for an illustration.

following two equations with OLS 6.5 ARMA(p,q)


y t C1 D a1 y t C a2 y t 1 C " t C1 (6.38) Autoregressive-moving average models add a moving average structure to an AR model.
y t C2 D b1 y t C b2 y t 1 C v t C2 : (6.39) For instance, an ARMA(2,1) could be

Clearly, (6.38) is the same as (6.35) and the estimated coefficients can therefore be used y t D a1 y t 1 C a2 y t 2 C " t C 1 " t 1; (6.41)
to make one-period forecasts, and the variance of " t C1 is a good estimator of the variance
of the one-period forecast error. The coefficients in (6.39) will be very similar to what we where " t is white noise. This type of model is much harder to estimate than the autoregres-
get by combining the a1 and a2 coefficients as in (6.37): b1 will be similar to a12 C a2 and sive model (LS cannot be used). The appropriate specification of the model (number of
b2 to a1 a2 (in an infinite sample they should be identical). Equation (6.39) can therefore lags of y t and " t ) is often unknown. The Box-Jenkins methodology is a set of guidelines
be used to make two-period forecasts, and the variance of v t C2 can be taken to be the for arriving at the correct specification by starting with some model, study the autocorre-
forecast error variance for this forecast. lation structure of the fitted residuals and then changing the model.
It is straightforward to add a constant to capture a non-zero mean.
Most ARMA models can be well approximated by an AR model—provided we add
some extra lags. Since AR models are so simple to estimate, this approximation approach
is often used.

108 109
Remark 6.8 In an ARMA model, both the autocorrelations and partial autocorrelations For instance, for E t x t C2 we get
decay to zero for long lags.
E t x t C2 D a11 .a11 x t C a12 z t / C a12 .a21 x t C a22 z t /
„ ƒ‚ … „ ƒ‚ …
E t x t C1 E t z tC1
6.6 VAR(p) 2

D a11 C a12 a21 x t C .a12 a22 C a11 a12 / z t : (6.49)
The vector autoregression is a multivariate version of an AR(1) process: we can think of
This has the same form as the one-period forecast in (6.45), but with other coefficients.
y t and " t in (6.32) as vectors and the ai as matrices.
Note that all we need to make the forecasts (for both t C 1 and t C 2) are the values in
For instance the VAR(1) of two variables (x t and z t ) is (in matrix form)
period t (x t and z t ). This follows from that (6.42) is a first-order system where the values
" # " #" # " #
x t C1 a11 a12 xt "xtC1 of x t and z t summarize all relevant information about the future that is available in t .
D C ; (6.42)
z t C1 a21 a22 zt "ztC1 The forecast uncertainty about the one-period forecast is simple: the forecast error
x tC1 E t x t C1 D "xtC1 . The two-period forecast error, x t C2 E t x t C2 , is a linear
or equivalently
combination of "xtC1 , "zt C1 , and "xt C2 . The calculations of the forecasting error variance
x t C1 D a11 x t C a12 z t C "xtC1 ; and (6.43) (as well as for the forecasts themselves) quickly get messy. This is even more true when
the VAR system is of a higher order.
z t C1 D a21 x t C a22 z t C "ztC1 : (6.44)
As for the AR(p) model, a practical way to get around the problem with messy calcu-
Both (6.43) and (6.44) are regression equations, which can be estimated with OLS lations is to estimate a separate model for each forecasting horizon. In a large sample, the
(since "xtC1 and "zt C1 are uncorrelated with x t and z t ). difference between the two ways is trivial. For instance, suppose the correct model is the
With the information available in t , that is, information about x t and z t , (6.43) and VAR(1) in (6.42) and that we want to forecast x one and two periods ahead. From (6.45)
(6.44) can be used to forecast one step ahead as and (6.49) we see that the regression equations should be of the form

E t x t C1 D a11 x t C a12 z t (6.45) x t C1 D ı1 x t C ı2 z t C u t C1 , and (6.50)


E t z t C1 D a21 x t C a22 z t : (6.46) x t C2 D 1 x t C 2 z t C w t Cs : (6.51)

We also want to make a forecast of x t C2 based on the information in t . Clearly, it With estimated coefficients (OLS can be used), it is straightforward to calculate forecasts
must be the case that and forecast error variances.
In a more general VAR(p) model we need to include p lags of both x and z in the
E t x t C2 D a11 E t x t C1 C a12 E t z t C1 (6.47) regression equations (p D 1 in (6.50) and (6.51)).
E t z t C2 D a21 E t x t C1 C a22 E t z t C1 : (6.48)
6.6.1 Granger Causality
We already have values for E t x t C1 and E t z t C1 from (6.45) and (6.46) which we can use.
If z t can help predict future x, over and above what lags of x itself can, then z is said to
Granger Cause x. This is a statistical notion of causality, and may not necessarily have
much to do with economic causality (Christmas cards may Granger cause Christmas).
In (6.50) z does Granger cause x if ı2 ¤ 0, which can be tested with an F-test. More

110 111
generally, there may be more lags of both x and z in the equation, so we need to test if all get
coefficients on different lags of z are zero.
y t D  C . C y t 2 C "t 1/ C "t
::
6.7 Non-stationary Processes :
t
X
6.7.1 Introduction D t C y0 C "s : (6.54)
sD1

A trend-stationary process can be made stationary by subtracting a linear trend. The


The effect of " t never dies out: a non-zero value of " t gives a permanent shift of the level
simplest example is
of y t . This process is clearly non-stationary. A consequence of the permanent effect of
y t D  C ˇt C " t (6.52)
a shock is that the variance of the conditional distribution grows without bound as the
where " t is white noise. forecasting horizon is extended. For instance, for the random walk with drift, (6.54), the

A unit root process can be made stationary only by taking a difference. The simplest distribution conditional on the information in t D 0 is N y0 C t; s 2 if the innova-
example is the random walk with drift tions are Gaussian. This means that the expected change is t and that the conditional
variance grows linearly with the forecasting horizon. The unconditional variance is there-
yt D  C yt 1 C "t ; (6.53) fore infinite and the standard results on inference are not applicable.
In contrast, the conditional distribution from the trend stationary model, (6.52), is
where " t is white noise. The name “unit root process” comes from the fact that the largest 
N st;  2 .
eigenvalues of the canonical form (the VAR(1) form of the AR(p)) is one. Such a process
A process could have two unit roots (integrated of order 2: I(2)). In this case, we need
is said to be integrated of order one (often denoted I(1)) and can be made stationary by
to difference twice to make it stationary. Alternatively, a process can also be explosive,
taking first differences.
that is, have eigenvalues outside the unit circle. In this case, the impulse response function
Example 6.9 (Non-stationary AR(2).) The process y t D 1:5y t 1 0:5y t 2 C " t can be diverges.
written " # " #" # " #
yt 1:5 0:5 yt 1 "t Example 6.10 (Two unit roots.) Suppose y t in Example (6.9) is actually the first differ-
D C ;
yt 1 1 0 yt 2 0 ence of some other series, y t D z t z t 1 . We then have
where the matrix has the eigenvalues 1 and 0.5 and is therefore non-stationary. Note that
zt zt 1 D 1:5 .z t 1 zt 2/ 0:5 .z t 2 zt 3/ C "t
subtracting y t 1 from both sides gives y t y t 1 D 0:5 .y t 1 y t 2 /C" t , so the variable
z t D 2:5z t 1 2z t 2 C 0:5z t 3 C "t ;
x t D y t y t 1 is stationary.
which is an AR(3) with the following canonical form
The distinguishing feature of unit root processes is that the effect of a shock never
2 3 2 32 3 2 3
vanishes. This is most easily seen for the random walk. Substitute repeatedly in (6.53) to zt 2:5 2 0:5 zt 1 "t
6 7 6 76 7 6 7
4 zt 1 5 D 4 1 0 0 5 4 zt 2 5 C 4 0 5:
zt 2 0 1 0 zt 3 0

The eigenvalues are 1, 1, and 0.5, so z t has two unit roots (integrated of order 2: I(2) and

112 113
Real GDP and deflator, US 1947Q1−2008Q4 Residual from deflator = a + b×gdp Distribution of bLS, ρ=0.2 Distribution of bLS, ρ=1
2 2 6 6
R : 0.96
t−stat for slope: 75.89 0.2
1.5
log deflator

0.1 4 4

%
1 0
−0.1 2 2
0.5
−0.2
Auto (t) and DW stats: 15.62 0.01
0 0 0
0 0.5 1 1.5 2 1950 1960 1970 1980 1990 2000 −0.5 0 0.5 −0.5 0 0.5
log GDP Year b b

Illustration of spurious regressions


Figure 6.5: Example of a spurious regression y and x are uncorrelated AR(1) processes:

Model: yt = ρyt−1 + εt and xt = ρxt−1 + ηt


needs to be differenced twice to become stationary). where εt and ηt are uncorrelated
bLS is the LS estimate of b in
Example 6.11 (Explosive AR(1).) Consider the process y t D 1:5y t 1 C " t . The eigen-
yt = a + bxt + ut, T=200
value is then outside the unit circle, so the process is explosive. This means that the
impulse response to a shock to " t diverges (it is 1:5s for s periods ahead).
Figure 6.6: Distribution of LS estimator when y t and x t are independent AR(1) processes
6.7.2 Spurious Regressions
which is white noise (any finite difference, like y t y t s , will give a stationary series),
Strong trends often causes problems in econometric models where y t is regressed on x t .
so we could proceed by applying standard econometric tools to y t .
In essence, if no trend is included in the regression, then x t will appear to be significant,
One may then be tempted to try first-differencing all non-stationary series, since it
just because it is a proxy for a trend. The same holds for unit root processes, even if
may be hard to tell if they are unit root process or just trend-stationary. For instance, a
they have no deterministic trends. However, the innovations accumulate and the series
first difference of the trend stationary process, (6.52), gives
therefore tend to be trending in small samples. A warning sign of a spurious regression is
when R2 > DW statistics. yt yt D ˇ C "t "t (6.56)
1 1:
See Figure 6.5 for an empirical example and Figures 6.6–6.8 for a Monte Carlo simu-
lation. Its unclear if this is an improvement: the trend is gone, but the errors are now of MA(1)
For trend-stationary data, this problem is easily solved by detrending with a linear type (in fact, non-invertible, and therefore tricky, in particular for estimation).
trend (before estimating or just adding a trend to the regression).
However, this is usually a poor method for a unit root processes. What is needed is a 6.7.3 Testing for a Unit Root
first difference. For instance, a first difference of the random walk with drift is
Suppose we run an OLS regression of
y t D y t yt 1
y t D ay t 1 C "t ; (6.57)
D  C "t ; (6.55)

114 115
Distribution of R2, ρ=0.2 Distribution of R2, ρ=1
(The variance follows from the standard OLS formula where the variance of the estimator

50
15 is  2 .X 0 X=T / 1 . Here plim X 0 X=T DVar.y t / which we know is  2 = 1 a2 ).
40 It is well known (but not easy to show) that when a D 1, then aO is biased towards
30 10 zero in small samples. In addition, the asymptotic distribution is no longer (6.58). In
20 fact, there is a discontinuity in the limiting distribution as we move from a stationary to
5
10 a non-stationary variable. This, together with the small sample bias means that we have
0 0 to use simulated critical values for testing the null hypothesis of a D 1 based on the OLS
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
R2 R2 estimate from (6.57).
In practice, the approach is to run the regression (6.57) with a constant (and perhaps
even a time trend), calculate the test statistic
Distribution of autocorr, ρ=0.2 Distribution of autocorr, ρ=1
15 15 aO 1
DF D ; (6.59)
Std.a/
O
10 10
and reject the null of non-stationarity if DF is less than the critical values published by
5 5 Dickey and Fuller ( 2:86 at the 5% level if the regression has a constant, and 3:41 if
the regression includes a trend).
0 0
−0.2 0 0.2 0.4 0.6 0.8 −0.2 0 0.2 0.4 0.6 0.8 With more dynamics (to capture any serial correlation in " t in (6.57)), do an aug-
Autocorrelations of residuals Autocorrelation of residuals
mented DickeyFuller test (ADF)

y t D ı C 1 y t 1 C 2 y t 2 C "2t , or
Figure 6.7: Distribution of R2 and autorrelation of residuals. See Figure 6.6
y t D ı C .1 C 2 1/ y t 1 2 y t 1 C "2t ; (6.60)
Distribution of t−stat, ρ=0.2 Distribution of t−stat, ρ=1
0.4 0.1
and test if 1 C 2 1 D 0 (against the alternative, < 0) The critical values are as for the
0.3 DF test. If "2t is autocorrelated, then add further lags.
0.2 0.05 In principle, distinguishing between a stationary and a non-stationary series is very
difficult (and impossible unless we restrict the class of processes, for instance, to an
0.1
AR(2)), since any sample of a non-stationary process can be arbitrary well approximated
0 0
−5 0 5 −10 0 10 20 by some stationary process et vice versa. The lesson to be learned, from a practical point
t−stat t−stat
of view, is that strong persistence in the data generating process (stationary or not) invali-
dates the usual results on inference. We are usually on safer ground to apply the unit root
Figure 6.8: Distribution of t-statistics. See Figure 6.6 results in this case, even if the process is actually stationary.

where the true value of jaj < 1. The asymptotic distribution of the LS estimator is
p 
T .aO a/  N 0; 1 a2 : (6.58)

116 117
Bibliography
Brockwell, P. J., and R. A. Davis, 1991, Time series: theory and methods, Springer Verlag,
New York, second edn. 7 Predicting Asset Returns
Newbold, P., 1995, Statistics for business and economics, Prentice-Hall, 4th edn. Reference (medium): Elton, Gruber, Brown, and Goetzmann (2007) 17 (efficient markets)
Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric models and economic forecasts, and 19 (earnings estimation)
Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn. Additional references: Campbell, Lo, and MacKinlay (1997) 2 and 7; Cochrane (2001)
20.1
Priestley, M. B., 1981, Spectral analysis and time series, Academic Press. More advanced material is denoted by a star ( ). It is not required reading.

7.1 Asset Prices, Random Walks, and the Efficient Market Hypothe-
sis

Let P t be the price of an asset at the end of period t, after any dividend in t has been paid
(an ex-dividend price). The gross return (1 C R t C1 , like 1.05) of holding an asset with
dividends (per current share), D t C1 , between t and t C 1 is then defined as
P t C1 C D t C1
1 C R tC1 D : (7.1)
Pt
The dividend can, of course, be zero in a particular period, so this formulation encom-
passes the case of daily stock prices with annual dividend payment.

Remark 7.1 (Conditional expectations) The expected value of the random variable y t C1
conditional on the information set in t, E t y tC1 is the best guess of y tC1 using the infor-
mation in t . Example: suppose y t C1 equals x t C " t C1 , where x t is known in t, but all we
know about " t C1 in t is that it is a random variable with a zero mean and some (finite)
variance. In this case, the best guess of y t C1 based on what we know in t is equal to x t .

Take expectations of (7.1) based on the information set in t


E t P t C1 C E t D t C1
1 C E t R t C1 D or (7.2)
Pt
E t P t C1 C E t D t C1
Pt D : (7.3)
1 C E t R t C1

118 119
This formulation is only a definition, but it will help us organize the discussion of how if E t R tC1 is a constant—at least not if the market forms expectations rationally. We will
asset prices are determined. now analyze the logic of each of the traditional interpretations.
This expected return, E t R t C1 , is likely to be greater than a riskfree interest rate if the If price changes are unforecastable, then E t P t C1 P t equals a constant. Typically,
asset has positive systematic (non-diversifiable) risk. For instance, in a CAPM model this this constant is taken to be zero so P t is a martingale. Use E t P t C1 D P t in (7.2)
would manifest itself in a positive “beta.” In an equilibrium setting, we can think of this E t D t C1
as a “required return” needed for investors to hold this asset. E t R t C1 D : (7.4)
Pt
This says that the expected net return on the asset is the expected dividend divided by the
7.1.1 Different Versions of the Efficient Market Hypothesis
current price. This is clearly implausible for daily data since it means that the expected
The efficient market hypothesis casts a long shadow on every attempt to forecast asset return is zero for all days except those days when the asset pays a dividend (or rather, the
prices. In its simplest form it says that it is not possible to forecast asset prices, but there day the asset goes ex dividend)—and then there is an enormous expected return for the one
are several other forms with different implications. Before attempting to forecast financial day when the dividend is paid. As a first step, we should probably refine the interpretation
markets, it is useful to take a look at the logic of the efficient market hypothesis. This will of the efficient market hypothesis to include the dividend so that E t .P t C1 C D t C1 / D P t .
help us to organize the effort and to interpret the results. Using that in (7.2) gives 1 C E t R t C1 D 1, which can only be satisfied if E t R t C1 D 0,
A modern interpretation of the efficient market hypothesis (EMH) is that the informa- which seems very implausible for long investment horizons—although it is probably a
tion set used in forming the market expectations in (7.2) includes all public information. reasonable approximation for short horizons (a week or less).
(This is the semi-strong form of the EMH since it says all public information; the strong If returns are unforecastable, so E t R t C1 D R (a constant), then (7.3) gives
form says all public and private information; and the weak form says all information in
E t P t C1 C E t D t C1
price and trading volume data.) The implication is that simple stock picking techniques Pt D : (7.5)
1CR
are not likely to improve the portfolio performance, that is, abnormal returns. Instead,
The main problem with this interpretation is that it looks at every asset separately and
advanced (costly?) techniques are called for in order to gather more detailed information
that outside options are not taken into account. For instance, if the nominal interest rate
than that used in market’s assessment of the asset. Clearly, with a better forecast of the
changes from 5% to 10%, why should the expected (required) return on a stock be un-
future return than that of the market there is plenty of scope for dynamic trading strate-
changed? In fact, most asset pricing models suggest that the expected return E t R t C1
gies. Note that this modern interpretation of the efficient market hypothesis does not rule
equals the riskfree rate plus compensation for risk.
out the possibility of forecastable prices or returns. It does rule out that abnormal returns
If excess returns are unforecastable, then the compensation (over the riskfree rate)
can be achieved by stock picking techniques which rely on public information.
for risk is constant. The risk compensation is, of course, already reflected in the current
There are several different traditional interpretations of the EMH. Like the modern
price P t , so the issue is then if there is some information in t which is correlated with
interpretation, they do not rule out the possibility of achieving abnormal returns by using
the risk compensation in P t C1 . Note that such forecastability does not necessarily imply
better information than the rest of the market. However, they make stronger assumptions
an inefficient market or presence of uninformed traders—it could equally well be due to
about whether prices or returns are forecastable. Typically one of the following is as-
movements in risk compensation driven by movements in uncertainty (option prices sug-
sumed to be unforecastable: price changes, returns, or returns in excess of a riskfree rate
gest that there are plenty of movements in uncertainty). If so, the forecastability cannot be
(interest rate). By unforecastable, it is meant that the best forecast (expected value condi-
used to generate abnormal returns (over riskfree rate plus risk compensation). However,
tional on available information) is a constant. Conversely, if it is found that there is some
it could also be due to exploitable market inefficiencies. Alternatively, you may argue
information in t that can predict returns R t C1 , then the market cannot price the asset as

120 121
that the market compensates for risk which you happen to be immune to—so you are 7.2 Autocorrelations
interested in the return rather than the risk adjusted return.
This discussion of the traditional efficient market hypothesis suggests that the most 7.2.1 Autocorrelation Coefficients and the Box-Pierce Test
interesting hypotheses to test are if returns or excess returns are forecastable. In practice, The autocovariances of the y t process can be estimated as
the results for them are fairly similar since the movements in most asset returns are much
T
greater than the movements in interest rates. 1 X
Os D .y t N .y t
y/ s N ;
y/ (7.6)
T t D1Cs
7.1.2 Martingales and Random Walks 1X
T
with yN D yt : (7.7)
T t D1
Further reading: Cuthbertson (1996) 5.3
The accumulated wealth in a sequence of fair bets is expected to be unchanged. It is (We typically divide by T in (7.6) even if we have only T s full observations to estimate
then said to be a martingale. s from.) Autocorrelations are then estimated as
The time series x is a martingale with respect to an information set ˝ t if the expected
value of x t Cs (s  1) conditional on the information set ˝ t equals x t . (The information Os D Os = O0 : (7.8)
set ˝ t is often taken to be just the history of x: x t ; x t 1 ; :::)
The sampling properties of Os are complicated, but there are several useful large sam-
The time series x is a random walk if x t C1 D x t C " t C1 , where " t and " t Cs are
ple results for Gaussian processes (these results typically carry over to processes which
uncorrelated for all s ¤ 0, and E " t D 0. (There are other definitions which require that
are similar to the Gaussian—a homoskedastic process with finite 6th moment is typically
" t and " t Cs have the same distribution.) A random walk is a martingale; the converse is
enough, see Priestley (1981) 5.3 or Brockwell and Davis (1991) 7.2-7.3). When the true
not necessarily true.
autocorrelations are all zero (not 0 , of course), then for any i and j different from zero
Remark 7.2 (A martingale, but not a random walk). Suppose y tC1 D y t u t C1 , where u t " # " # " #!
p Oi 0 1 0
and u t Cs are uncorrelated for all s ¤ 0, and E t u t C1 D 1 . This is a martingale, but not T !d N ; : (7.9)
Oj 0 0 1
a random walk.
This result can be used to construct tests for both single autocorrelations (t-test or 2 test)
In any case, the martingale property implies that x t Cs D x t C" t Cs , where the expected and several autocorrelations at once (2 test).
value of " tCs based on ˝ t is zero. This is close enough to the random walk to motivate
the random walk idea in most cases. Example 7.3 (t-test) We want to test the hypothesis that 1 D 0. Since the N.0; 1/ dis-
tribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we
p
can reject the null hypothesis at the 10% level if T jO1 j > 1:65. With T D 100, we
p
therefore need jO1 j > 1:65= 100 D 0:165 for rejection, and with T D 1000 we need
p
jO1 j > 1:65= 1000  0:052.
p
The Box-Pierce test follows directly from the result in (7.9), since it shows that T Oi
p
and T Oj are iid N(0,1) variables. Therefore, the sum of the square of them is distributed

122 123
as a 2 variable. The test statistics typically used is Inference of the slope coefficient in long-run autoregressions like (7.12) must be done
L
with care. If only non-overlapping returns are used, the standard LS expression for the
X
QL D T Os2 !d L
2
: (7.10) standard deviation of the autoregressive parameter is likely to be reasonable. This is not
sD1 the case, if overlapping returns are used.
Example 7.4 (Box-Pierce) Let O1 D 0:165, and T D 100, so Q1 D 100  0:1652 D Remark 7.5 (Overlapping returns ) Consider the two-period return, y t C y t . Two
1
2:72. The 10% critical value of the 21 distribution is 2.71, so the null hypothesis of no successive observations with non-overlapping returns are then
autocorrelation is rejected.
y t C1 C y t C2 D a C b2 .y t 1 C y t / C " t C2
The choice of lag order in (7.10), L, should be guided by theoretical considerations,
y t C3 C y t C4 D a C b2 .y t C1 C y t C2 / C " t C4 :
but it may also be wise to try different values. There is clearly a trade off: too few lags may
miss a significant high-order autocorrelation, but too many lags can destroy the power of Suppose that y t is not autocorrelated, so the slope coefficient b2 D 0. We can then write
the test (as the test statistics is not affected much by increasing L, but the critical values the residuals as
increase).
" tC2 D a C y t C1 C y t C2
7.2.2 Autoregressions " t C4 D a C y t C3 C y t C4 ;

An alternative way of testing autocorrelations is to estimate an AR model which are uncorrelated. Compare this to the case where we use overlapping data. Two
successive observations are then
y t D c C a1 y t 1 C a2 y t 2 C ::: C ap y t p C "t ; (7.11)
y t C1 C y t C2 D a C b2 .y t 1 C y t / C " t C2
and then test if all slope coefficients (a1 ; a2 ; :::; ap ) are zero with a 2 or F test. This
y t C2 C y t C3 D a C b2 .y t C y t C1 / C " t C3 :
approach is somewhat less general than the Box-Pierce test, but most stationary time
series processes can be well approximated by an AR of relatively low order. As before, b2 D 0 if y t has no autocorrelation, so the residuals become

7.2.3 Long-Run Autoregressions " t C2 D a C y t C1 C y t C2

Consider an AR(1) of two-period sums (returns) " t C3 D a C y t C2 C y t C3 ;

y t C1 C y t C2 D a C b2 .y t C y t / C " t C2 : (7.12) which are correlated since y tC2 shows up in both. This demonstrates that overlapping
1
return data introduces autocorrelation of the residuals—which has to be handled in order
This can be estimated by LS on non-overlapping returns and b2 D 0 can be tested using to make correct inference.
standard methods. Clearly, b2 equals the first-order autocorrelation of two-period returns.
This type of autoregression can be done also on 3-period returns (non-overlapping) and
7.2.4 Autoregressions versus Autocorrelations
longer.
See Figure 7.4 for an illustration. It is straightforward to see the relation between autocorrelations and the AR model when
the AR model is the true process. This relation is given by the Yule-Walker equations.

124 125
SMI SMI daily excess returns, % Autocorr, daily excess returns Autocorr, weekly excess returns
8 10 0.3 0.3
SMI Autocorr with 90% conf band around 0
6 bill portfolio 5 0.2 S&P 500, 1979:1−2009:12 0.2

4 0 0.1 0.1

2 −5 0 0

0 −10 −0.1 −0.1


1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 1 2 3 4 5 1 2 3 4 5
Year Year lags (days) lags (weeks)

Daily SMI data, 1988:7−2009:6

1st order autocorrelation of returns (daily, weekly, monthly): 0.02 −0.08 0.05 Autocorr, daily abs(excess returns) Autocorr, weekly abs(excess returns)
1st order autocorrelation of absolute returns (daily, weekly, monthly): 0.29 0.29 0.17 0.3 0.3

0.2 0.2

Figure 7.1: Time series properties of SMI 0.1 0.1

0 0
For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi-
−0.1 −0.1
cient. For an AR(2), y t D a1 y t 1 C a2 y t 2 C " t , we have 1 2 3 4 5 1 2 3 4 5
lags (days) lags (weeks)
2 3 2 3
Cov.y t ; y t / Cov.y t ; a1 y t 1 C a2 y t 2 C " t /
6 7 6 7
4 Cov.y t 1 ; y t / 5 D 4 Cov.y t 1 ; a1 y t 1 C a2 y t 2 C " t / 5
Figure 7.2: Predictability of US stock returns
Cov.y t 2 ; y t / Cov.y t 2 ; a1 y t 1 C a2 y t 2 C " t /
2 3
a1 Cov.y t ; y t 1 / C a2 Cov.y t ; y t 2 / C Cov.y t ; " t / solve for the autoregression coefficients. This demonstrates that testing if all the autocor-
6 7
D 4 a1 Cov.y t 1 ; y t 1 / C a2 Cov.y t 1 ; y t 2 / 5 , or relations are zero is essentially the same as testing if all the autoregressive coefficients are
a1 Cov.y t C a2 Cov.y t
2; yt 1/ 2; yt 2/ zero. Note, however, that the transformation is non-linear, which may make a difference
2 3 2 3
0 a1 1 C a2 2 C Var." t / in small samples.
6 7 6 7
4 1 5 D 4 a1 0 C a2 1 5: (7.13)
2 a1 1 C a2 0 7.2.5 Variance Ratios

To transform to autocorrelation, divide by 0 . The last two equations are then A variance ratio is another way to measure predictability. It is defined as the variance of
" # " # " # " # a q-period return divided by q times the variance of a 1-period return
1 a1 C a2 1 1 a1 = .1 a2 / P 
D or D : (7.14) q 1
2 a1  1 C a2 2 a12 = .1 a2 / C a2 Var sD0 y t s
VRq D : (7.15)
q Var.y t /
If we know the parameters of the AR(2) model (a1 , a2 , and Var." t /), then we can
solve for the autocorrelations. Alternatively, if we know the autocorrelations, then we can

126 127
Autocorr, excess returns, smallest decile Autocorr, excess returns, 5th decile 2
Return = a + b*lagged Return, slope Return = a + b*lagged Return, R
Slope with 90% conf band
0.3 0.3 0.5
0.2 0.2 0.1
0
0.1 0.1
0.05
0 0
−0.5
−0.1 −0.1 0
1 2 3 4 5 1 2 3 4 5 0 20 40 60 0 20 40 60
lags (days) lags (days) Return horizon (months) Return horizon (months)
US stock returns 1926:1−2008:12
Autocorr with 90% conf band around 0
US daily data 1979:1−2008:12
Autocorr, excess returns, largest decile Return = a + b*E/P, slope Return = a + b*E/P, R2
0.6 0.3
0.3
0.2 0.4 0.2

0.1
0.2 0.1
0
−0.1 0 0
1 2 3 4 5 0 20 40 60 0 20 40 60
lags (days) Return horizon (months) Return horizon (months)

Figure 7.3: Predictability of US stock returns, size deciles Figure 7.4: Predictability of US stock returns

To see that this is related to predictability, consider the 2-period variance ratio. (7.15) with the sample variances, since this would require using non-overlapping long
Var.y t C y t 1 / returns—which wastes a lot of data points. For instance, if we have 24 years of data and
VR2 D (7.16)
2 Var.y t / we want to study the variance ratio for the 5-year horizon, then 4 years of data are wasted.
Var .y t / C Var .y t 1 / C 2 Cov .y t ; y t 1/ Instead, we typically rely on a transformation of (7.15)
D
2 Var .y t / P 
q 1
Cov .y t ; y t 1 / Var sD0 y t s
D1C VRq D
Var .y t / q Var.y t /
D 1 C 1 : (7.17) q 1
X  
jsj
D 1 s or
q
It is clear from (7.17) that if y t is not serially correlated, then the variance ratio is unity; sD .q 1/
q 1 
a value above one indicates positive serial correlation and a value below one indicates X s
D1C2 1 s : (7.18)
negative serial correlation. The same applies to longer horizons. sD1
q
The estimation of VRq is typically not done by replacing the population variances in
To estimate VRq , we first estimate the autocorrelation coefficients (using all available data

128 129
Variance Ratio, 1926− Variance Ratio, 1957− 7.3 Other Predictors and Methods
1.5 VR with 90% conf band 1.5
There are many other possible predictors of future stock returns. For instance, both the
dividend-price ratio and nominal interest rates have been used to predict long-run returns,
1 1
and lagged short-run returns on other assets have been used to predict short-run returns.

0.5 0.5
7.3.1 Lead-Lags
0 20 40 60 0 20 40 60
Return horizon (months) Return horizon (months)
Stock indices have more positive autocorrelation than (most) individual stocks: there
should therefore be fairly strong cross-autocorrelations across individual stocks. (See
US stock returns 1926:1−2008:12
Campbell, Lo, and MacKinlay (1997) Tables 2.7 and 2.8.) Indeed, this is also what is
Confidence bands use asymptotic sampling distribution of VR
found in US data where weekly returns of large size stocks forecast weekly returns of
small size stocks.
Figure 7.5: Variance ratios, US excess stock returns See Figures 7.6–7.7 for an illustration.

points for each estimation) and then calculate (7.18). Correlation of largest decile with lags of Correlation of 5th decile with lags of
smallest decile smallest decile
b
Remark 7.6 ( Sampling distribution of V Rq ) Under the null hypothesis that there is no 0.3 5th decile
largest decile
0.3 5th decile
largest decile
autocorrelation, (7.9) and (7.18) give 0.2 0.2
" q 1   # 0.1 0.1
p   X s 2
b
T V Rq 1 !d N 0; 4 1
q
: (7.19) 0 0
sD1 −0.1 −0.1
1 2 3 4 5 1 2 3 4 5
Days Days
b
Example 7.7 (Sampling distributions of V R2 and V R3 ) b
p  
b
T V R2 1 !d N .0; 1/ or V R2 !d N .1; 1=T / b US size deciles
p   Correlation of smallest decile with lags of US daily data 1979:1−2008:12
b b
and T V R3 1 !d N .1; 20=9/ or V R3 !d N Œ1; .20=9/=T  : smallest decile
0.3 5th decile
largest decile
The results in CLM Table 2.5 and 2.6 (weekly CRSP stock index returns, early 1960s 0.2
to mid 1990s) show variance ratios above one and increasing with the number of lags, q. 0.1
The results for individual stocks in CLM Table 2.7 show variance ratios close to, or even 0
below, unity. Cochrane Tables 20.5–6 report weak evidence for more mean reversion in −0.1
1 2 3 4 5
multi-year returns (annual NYSE stock index,1926 to mid 1990s). Days
See Figure 7.5 for an illustration.

Figure 7.6: Cross-correlation across size deciles

130 131
Regression of largest decile on lags of Regression of 5th decile on lags of 7.3.3 Predictability but No Autocorrelation

0.2
Regression coefficient
0.2 The evidence for US stock returns is that long-run returns may perhaps be predicted by the
0.1 0.1 dividend-price ratio or interest rates, but that the long-run autocorrelations are weak (long-
0 0 run US stock returns appear to be “weak-form efficient” but not “semi-strong efficient”).
−0.1 self −0.1 self This should remind us of the fact that predictability and autocorrelation need not be the
largest decile
−0.2 −0.2 same thing: although autocorrelation implies predictability, we can have predictability
1 2 3 4 5 1 2 3 4 5
Days Days
without autocorrelation.

7.3.4 Trading Strategies


US size deciles
Regression of smallest decile on lags of US daily data 1979:1−2008:12
Another way to measure predictability and to illustrate its economic importance is to
0.2 Multiple regression with lagged return on calculate the return of a dynamic trading strategy, and then measure the “performance”
self and largest deciles as regressors.
0.1 The figures show regression coefficients.
of this strategy in relation to some benchmark portfolios. The trading strategy should, of
0 course, be based on the variable that is supposed to forecast returns.
−0.1 self A common way (since Jensen, updated in Huberman and Kandel (1987)) is to study
largest decile
−0.2 the performance of a portfolio by running the following regression
1 2 3 4 5
Days
R1t Rf t D ˛ C ˇ.Rmt Rf t / C " t , E " t D 0 and Cov.R1t Rf t ; " t / D 0; (7.21)

Figure 7.7: Coefficients from multiple prediction regressions where R1t Rf t is the excess return on the portfolio being studied and Rmt Rf t the
excess returns of a vector of benchmark portfolios (for instance, only the market portfolio
if we want to rely on CAPM; returns times conditional information if we want to allow
7.3.2 Dividend-Price Ratio as a Predictor for time-variation in expected benchmark returns). Neutral performance (mean-variance
intersection, that is, that the tangency portfolio is unchanged and the two MV frontiers
One of the most successful attempts to forecast long-run returns is a regression of future
intersect there) requires ˛ D 0, which can be tested with a t test.
returns on the current dividend-price ratio (here in logs)
See Figure 7.8 for an illustration.
q
X
r t Cs D ˛ C ˇq .d t p t / C " t Cq : (7.20)
sD1 7.4 Security Analysts
For instance, CLM Table 7.1, report R values from this regression which are close to
2
Reference: Makridakis, Wheelwright, and Hyndman (1998) 10.1 and Elton, Gruber,
zero for monthly returns, but they increase to 0.4 for 4-year returns (US, value weighted
Brown, and Goetzmann (2007) 19
index, mid 1920s to mid 1990s).
See Figure 7.4 for an illustration.

132 133
Buy winners and sell losers hundred companies. The paper has regressions like

7 excess return Actual change D ˛ C ˇ.forecasted change/ C residual,


alpha
6
and then studies the estimates of the ˛ and ˇ coefficients. With rational expectations (and
5
a long enough sample), we should have ˛ D 0 (no constant bias in forecasts) and ˇ D 1
4
(proportionality, for instance no exaggeration).
3 The main findings are as follows. The main result is that 0 < ˇ < 1, so that the
2 US data 1957:1−2009:9, 25 FF portfolios (B/M and size)
forecasted change tends to be too wild in a systematic way: a forecasted change of 1% is
’Buy’ (’sell’) portfolio: (on average) followed by a less than 1% actual change in the same direction. This means
1
the 5 assets with highest (lowest) return over the last 3 months that analysts in this sample tended to be too extreme—to exaggerate both positive and
0
2 4 6 8 10 12 negative news.
Evalutation horizon, months

7.4.3 High-Frequency Trading Based on Recommendations from Stock Analysts


Figure 7.8: Predictability of US stock returns, momentum strategy
Barber, Lehavy, McNichols, and Trueman (2001) give a somewhat different picture.
They focus on the profitability of a trading strategy based on analyst’s recommendations.
7.4.1 Evidence on Analysts’ Performance
They use a huge data set (some 360,000 recommendations, US stocks) for the period
Makridakis, Wheelwright, and Hyndman (1998) 10.1 shows that there is little evidence 1985-1996. They sort stocks in to five portfolios depending on the consensus (average)
that the average stock analyst beats (on average) the market (a passive index portfolio). recommendation—and redo the sorting every day (if a new recommendation is published).
In fact, less than half of the analysts beat the market. However, there are analysts which They find that such a daily trading strategy gives an annual 4% abnormal return on the
seem to outperform the market for some time, but the autocorrelation in over-performance portfolio of the most highly recommended stocks, and an annual -5% abnormal return on
is weak. The evidence from mutual funds is similar. For them it is typically also found the least favourably recommended stocks.
that their portfolio weights do not anticipate price movements. This strategy requires a lot of trading (a turnover of 400% annually), so trading costs
It should be remembered that many analysts also are sales persons: either of a stock would typically reduce the abnormal return on the best portfolio to almost zero. A less
(for instance, since the bank is underwriting an offering) or of trading services. It could frequent rebalancing (weekly, monthly) gives a very small abnormal return for the best
well be that their objective function is quite different from minimizing the squared forecast stocks, but still a negative abnormal return for the worst stocks. Chance and Hemler
errors—or whatever we typically use in order to evaluate their performance. (The number (2001) obtain similar results when studying the investment advise by 30 professional
of litigations in the US after the technology boom/bust should serve as a strong reminder “market timers.”
of this.)
7.4.4 The Characteristics of Individual Analysts’ Forecasts in Europe
7.4.2 Do Security Analysts Overreact?
Bolliger (2001) studies the forecast accuracy (earnings per share) of European (13 coun-
The paper by Bondt and Thaler (1990) compares the (semi-annual) forecasts (one- and tries) analysts for the period 1988–1999. In all, some 100,000 forecasts are studied. It
two-year time horizons) with actual changes in earnings per share (1976-1984) for several is found that the forecast accuracy is positively related to how many times an analyst has

134 135
forecasted that firm and also (surprisingly) to how many firms he/she forecasts. The ac- (b) downgrades: the ratings have a strong negative direct effect on the earnings
curacy is negatively related to the number of countries an analyst forecasts and also to the forecasts; the returns react ever quicker than analysts
size of the brokerage house he/she works for. (c) upgrades: the ratings have a small positive direct effect on the earnings fore-
casts; there is no effect on the returns
7.4.5 Bond Rating Agencies versus Stock Analysts
A possible reason for why bond ratings could drive earnings forecasts and prices is
Ederington and Goh (1998) use data on all corporate bond rating changes by Moody’s
that bond rating firms typically have access to more inside information about firms than
between 1984 and 1990 and the corresponding earnings forecasts (by various stock ana-
stock analysts and investors.
lysts).
A possible reason for the observed asymmetric response of returns to ratings is that
The idea of the paper by Ederington and Goh (1998) is to see if bond ratings drive
firms are quite happy to release positive news, but perhaps more reluctant to release bad
earnings forecasts (or vice versa), and if they affect stock returns (prices).
news. If so, then the information advantage of bond rating firms may be particularly large
after bad news. A downgrading would then reveal more new information than an upgrade.
1. To see if stock returns are affected by rating changes, they first construct a “normal”
The different reactions of the earnings forecasts and the returns are hard to reconcile.
return by a market model:

normal stock return t = ˛ C ˇ  return on stock index t , 7.4.6 International Differences in Analyst Forecast Properties

where ˛ and ˇ are estimated on a normal time period (not including the rating Ang and Ciccone (2001) study earnings forecasts for many firms in 42 countries over the
change). The abnormal return is then calculated as the actual return minus the period 1988 to 1997. Some differences are found across countries: forecasters disagree
normal return. They then study how such abnormal returns behave, on average, more and the forecast errors are larger in countries with low GDP growth, less accounting
around the dates of rating changes. Note that “time” is then measured, individually disclosure, and less transparent family ownership structure.
for each stock, as the distance from the day of rating change. The result is that there However, the most robust finding is that forecasts for firms with losses are special:
are significant negative abnormal returns following downgrades, but zero abnormal forecasters disagree more, are more uncertain, and are more overoptimistic about such
returns following upgrades. firms.

2. They next turn to the question of whether bond ratings drive earnings forecasts or 7.4.7 Analysts and Industries
vice versa. To do that, they first note that there are some predictable patterns in
Boni and Womack (2006) study data on on some 170,000 recommedation for a very
revisions of earnings forecasts. They therefore fit a simple autoregressive model
large number of U.S. companies for the period 1996–2002. Focusing on revisions of
of earnings forecasts, and construct a measure of earnings forecast revisions (sur-
recommendations, the papers shows that analysts are better at ranking firms within an
prises) from the model. They then relate this surprise variable to the bond ratings.
industry than ranking industries.
In short, the results are the following:

(a) both earnings forecasts and ratings react to the same information, but there is 7.5 Technical Analysis
also a direct effect of rating changes, which differs between downgrades and
upgrades. Main reference: Bodie, Kane, and Marcus (2002) 12.2; Neely (1997) (overview, foreign
exchange market)

136 137
Further reading: Murphy (1999) (practical, a believer’s view); The Economist (1993) the relative strength index3 , which is the ratio of average price level on “up” days to the
(overview, the perspective of the early 1990s); Brock, Lakonishok, and LeBaron (1992) average price on “down” days—during the last z (14 perhaps) days.
(empirical, stock market); Lo, Mamaysky, and Wang (2000) (academic article on return The trading range break-out rule typically amounts to buying when the price rises
distributions for “technical portfolios”) above a previous peak (local maximum). The idea is that a previous peak is a resistance
level in the sense that some investors are willing to sell when the price reaches that value
7.5.1 General Idea of Technical Analysis (perhaps because they believe that prices cannot pass this level; clear risk of circular
reasoning or self-fulfilling prophecies; round numbers often play the role as resistance
Technical analysis is typically a data mining exercise which looks for local trends or
levels). Once this artificial resistance level has been broken, the price can possibly rise
systematic non-linear patterns. The basic idea is that markets are not instantaneously
substantially. On the downside, a support level plays the same role: some investors are
efficient: prices react somewhat slowly and predictably to news. The logic is essentially
willing to buy when the price reaches that value. To implement this, it is common to let
that an observed price move must be due to some news (exactly which one is not very
the resistance/support levels be proxied by minimum and maximum values over a data
important) and that old patterns can tell us where the price will move in the near future.
window of length L. With a bandwidth b (perhaps 0.01), the rule for period t could be
This is an attempt to gather more detailed information than that used by the market as a
2 3
whole. In practice, the technical analysis amounts to plotting different transformations buy in t if P t > M t 1 .1 C b/
(for instance, a moving average) of prices—and to spot known patterns. This section 6 7
4 sell in t if P t < m t 1 .1 b/ 5 , where (7.23)
summarizes some simple trading rules that are used. no change otherwise
Mt 1 D max.p t 1; : : : ; pt S /
7.5.2 Technical Analysis and Local Trends
mt 1 D min.p t 1 ; : : : ; p t S /:
Many trading rules rely on some kind of local trend which can be thought of as positive
autocorrelation in price movements (also called momentum1 ). When the price is already trending up, then the trading range break-out rule may be
A moving average rule is to buy if a short moving average (equally weighted or ex- replaced by a channel rule, which works as follows. First, draw a trend line through
ponentially weighted) goes above a long moving average. The idea is that event signals previous lows and a channel line through previous peaks. Extend these lines. If the price
a new upward trend. Let S (L) be the lag order of a short (long) moving average, with moves above the channel (band) defined by these lines, then buy. A version of this is to
S < L and let b be a bandwidth (perhaps 0.01). Then, a MA rule for period t could be define the channel by a Bollinger band, which is ˙2 standard deviations from a moving
2 3 data window around a moving average.
buy in t if MA t 1 .S / > MA t 1 .L/.1 C b/ A head and shoulder pattern is a sequence of three peaks (left shoulder, head, right
6 7
4 sell in t if MA t 1 .S / < MA t 1 .L/.1 b/ 5 , where (7.22) shoulder), where the middle one (the head) is the highest, with two local lows in between
no change otherwise on approximately the same level (neck line). (Easier to draw than to explain in a thousand
MA t 1 .S/ D .p t 1 C : : : C pt S /=S: words.) If the price subsequently goes below the neckline, then it is thought that a negative
trend has been initiated. (An inverse head and shoulder has the inverse pattern.)
The difference between the two moving averages is called an oscillator (or sometimes, Clearly, we can replace “buy” in the previous rules with something more aggressive,
moving average convergence divergence2 ). A version of the moving average oscillator is for instance, replace a short position with a long.
1
In physics, momentum equals the mass times speed. 3
2 Not to be confused with relative strength, which typically refers to the ratio of two different asset prices
Yes, the rumour is true: the tribe of chartists is on the verge of developing their very own language.
(for instance, an equity compared to the market).

138 139
The trading volume is also often taken into account. If the trading volume of assets Distribution of all returns Inverted MA rule: after buy signal
with declining prices is high relative to the trading volume of assets with increasing prices, Mean Std Mean Std
0.6 0.6
then this is interpreted as a market with selling pressure. (The basic problem with this 0.03 1.17 0.06 1.73

interpretation is that there is a buyer for every seller, so we could equally well interpret 0.4 0.4

the situations as if there is a buying pressure.) 0.2 0.2

0 0
7.5.3 Technical Analysis and Mean Reversion −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Return Return
If we instead believe in mean reversion of the prices, then we can essentially reverse
the previous trading rules: we would typically sell when the price is high. See Figures
7.9–7.10. Inverted MA rule: after neutral signal Inverted MA rule: after sell signal
Some investors argue that markets show periods of mean reversion and then periods Mean Std Mean Std
0.6 0.6
0.03 0.92 0.00 0.92
with trends—an that both can be exploited. Clearly, the concept of support and resistance
levels (or more generally, a channel) is based on mean reversion between these points. A 0.4 0.4

new trend is then supposed to be initiated when the price breaks out of this band. 0.2 0.2

Inverted MA rule, S&P 500 0 0


−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
1350
Return Return
MA(3) and MA(25), bandwidth 0.01

1300
Figure 7.10: Examples of trading rules

1250 7.6 Spurious Regressions and In-Sample Overfit

References: Ferson, Sarkissian, and Simin (2003), Goyal and Welch (2008), and Camp-
1200
Long MA (−) bell and Thompson (2008)
Long MA (+)
Short MA
1150 7.6.1 Spurious Regressions
Jan Feb Mar Apr
1999
Ferson, Sarkissian, and Simin (2003) argue that many prediction equations suffer from
“spurious regression” features—and that data mining tends to make things even worse.
Figure 7.9: Examples of trading rules Their simulation experiment is based on a simple model where the return predictions
are
r t C1 D ˛ C ıZ t C v t C1 ; (7.24)

140 141
Hold index if MA(3)>MA(25) Hold index if Pt > max(Pt−1,...,Pt−5) Model: yt = 0.9xt + εt,

6 6 Autocorrelation of xtut where εt = ρεt−1 + ut,


SMI
Rule where ut is iid N(0,h) such that Std(εt) = 1, and
4 4 0.5 xt = κxt−1 + ηt

0 bLS is the LS estimate of b in


2 2
κ = −0.9 yt = a + bxt + ut
−0.5 κ=0
1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 κ = 0.9
Year Year
−0.5 0 0.5
ρ
Daily SMI data
Hold index if Pt/Pt−7 > 1 Weekly rebalancing: hold index or riskfree

6 Figure 7.12: Autocorrelation of x t u t when u t has autocorrelation 

4 Under the null hypothesis of ı D 0, this autocorrelated is loaded onto the residuals.
For that reason, the simulations use a Newey-West estimator of the covariance matrix
2 (with an automatic choice of lag order). This should, ideally, solve the problem with the
1990 1995 2000 2005 2010
inference—but the simulations show that it doesn’t: when Z t is very autocorrelated (0.95
Year or higher) and reasonably important (so an R2 from running (7.25), if we could, would be
0.05 or higher), then the 5% critical value (for a t-test of the hypothesis ı D 0) would be
Figure 7.11: Examples of trading rules 2.7 (to be compared with the nominal value of 1.96). Since the point estimates are almost
unbiased, the interpretation is that the standard deviations are underestimated. In contrast,
where Z t is a regressor (predictor). The true model is that returns follows the process with low autocorrelation and/or low importance of Z t , the standard deviations are much
more in line with nominal values.
r t C1 D  C Z t C u t C1 ; (7.25) See Figures 7.12–7.13 for an illustration. They show that we need a combination of an
autocorrelated residuals and an autocorrelated regressor to create a problem for the usual
where the residual is white noise. In this equation, Z t represents movements in expected
LS formula for the standard deviation of a slope coefficient. When the autocorrelation is
returns. The predictors follow a diagonal VAR(1)
very high, even the Newey-West estimator is likely to underestimate the true uncertainty.
" # " #" # " # " #!
Zt  0 Zt 1 "t "t To study the interaction between spurious regressions and data mining, Ferson, Sarkissian,
D C  , with Cov D ˙: (7.26)
Z t 0  Z t 1 "t "t and Simin (2003) let Z t be chosen from a vector of L possible predictors—which all are
generated by a diagonal VAR(1) system as in (7.26) with uncorrelated errors. It is as-
In the case of a “pure spurious regression,” the innovations to the predictors are uncor-
sumed that the researchers choose Z t by running L regressions, and then picks the one
related (˙ is diagonal). In this case, ı ought to be zero—and their simulations show that
with the highest R2 . When  D 0:15 and the researcher chooses between L D 10
the estimates are almost unbiased. Instead, there is a problem with the standard deviation
predictors, the simulated 5% critical value is 3.5. Since this does not depend on the im-
O If  is high, then the returns will be autocorrelated.
of ı.
portance of Z t , it is interpreted as a typical feature of “data mining,” which is bad enough.

142 143
Std of LS under autocorrelation, κ = −0.9 Std of LS under autocorrelation, κ = 0 ratio, T-bill rate and the inflation rate) have significant predictive power for one-month
0.1 2 −1
0.1 stock returns in the full sample (1871–2003 or early 1920s–2003, depending on predictor).
σ (X’X)
Newey−West
Simulated
To gauge the out-of-sample predictability, they estimate the prediction equation using
0.05 0.05
data up to and including t 1, and then make a forecast for period t. The forecasting
performance of the equation is then compared with using the historical average as the
predictor. Notice that this historical average is also estimated on data up to an including
0 0 t 1, so it changes over time. Effectively, they are comparing the forecast performance
−0.5 0 0.5 −0.5 0 0.5
ρ ρ of two models estimated in a recursive way (long and longer sample): one model has just
an intercept, the other has also a predictor. The comparison is done in terms of the RMSE
and an “out-of-sample R2 ”
Std of LS under autocorrelation, κ = 0.9
0.1 XT XT
2
ROS D1 .r t rOt /2 = .r t rNt /2 ; (7.27)
t Ds tDs

0.05
where s is the first period with an out-of-sample forecast, rOt is the forecast based on the
prediction model (estimated on data up to and including t 1) and rNt is the historical
average (also estimated on data up to and including t 1).
0 The evidence shows that the out-of-sample forecasting performance is very weak—as
−0.5 0 0.5
ρ claimed by Goyal and Welch (2008).
It is argued that forecasting equations can easily give strange results when they are
Figure 7.13: Variance of OLS estimator, autocorrelated errors estimated on a small data set (as they are early in the sample). They therefore try different
restrictions: setting the slope coefficient to zero whenever the sign is “wrong,” setting
With the autocorrelation is 0.95, then the importance of Z t start to become important— the prediction (or the historical average) to zero whenever the value is negative. This
“spurious regressions” interact with the data mining to create extremely high simulated improves the results a bit—although the predictive performance is still weak.
critical values. A possible explanation is that the data mining exercise is likely to pick out See Figure 7.14 for an illustration.
the most autocorrelated predictor, and that a highly autocorrelated predictor exacerbates
the spurious regression problem. 7.7 Empirical U.S. Evidence on Stock Return Predictability

7.6.2 In-Sample versus Out-of-Sample Forecasting The two most common methods for investigating the predictability of stock returns are
to calculate autocorrelations and to construct simple dynamic portfolios and see if they
Goyal and Welch (2008) find that the evidence of predictability of equity returns dis-
outperform passive portfolios. The dynamic portfolio could, for instance, be a simple
appears when out-of-sample forecasts are considered. Campbell and Thompson (2008)
filter rule that calls for rebalancing once a month by buying (selling) assets which have
claim that there is still some out-of-sample predictability, provided we put restrictions on
increased (decreased) by more than x% the last month. If this portfolio outperforms a
the estimated models.
passive portfolio, then this is evidence of some positive autocorrelation (“momentum”)
Campbell and Thompson (2008) first report that only few variables (earnings price
on a one-month horizon. The following points summarize some evidence which seems to

144 145
RMSE, E/P regression vs MA RMSE, max(E/P regression,0) vs MA variables. In particular, future stock returns seem to be predictable by the current
0.19 0.19 dividend-price ratio and earnings-price ratio (positively, one to several years), or
MA
Regression by the interest rate changes (negatively, up to a year). For instance, the coefficient
0.18 0.18
of determination (usually denoted R2 , but not to be confused with the return used
0.17 0.17 above) for predicting the two-year return on the US stock market by the current
dividend-price ratio is around 0.3 for the 1952-1994 sample. (See Campbell, Lo,
0.16 0.16
100 150 200 250 300 350 100 150 200 250 300 350 and MacKinlay (1997) Tables 7.1-2.) This evidence suggests that expected returns
Length of data window, months Length of data window, months may very well be time-varying and correlated with the business cycle.

US stock 1−year returns 1926:1−2008:12 5. Even if short-run returns, R t C1 , are fairly hard to forecast, it is often fairly easy
Predictions made for 1957:1−2008:12
to forecast volatility as measured by jR t C1 j or R2tC1 (for instance, using ARCH
Estimation is done on moving data window; forecasts made out of sample
In−sample RMSE: 0.16 or GARCH models). For an example, see Bodie, Kane, and Marcus (2002) Fig-
ure 13.7. This could possibly be used for dynamic trading strategies on options
(which directly price volatility). For instance, buying both a call and a put option (a
Figure 7.14: Predictability of US stock returns, in-sample and out-of-sample
“straddle” or a “strangle”), is a bet on a large price movement (in any direction).
hold for both returns and returns in excess of a riskfree rate (an interest rate).
6. It is sometimes found that stock prices behave differently in periods with high
1. The empirical evidence suggests some, but weak, positive autocorrelation in short volatility than in more normal periods. Granger (1992) reports that the forecast-
horizon returns (one day up to a month) — probably too little to trade on. The ing performance is sometimes improved by using different forecasting models for
autocorrelation is stronger for small than for large firms (perhaps no autocorrela- these two regimes. A simple and straightforward way to estimate a model for peri-
tion at all for weekly or longer returns in large firms). This implies that equally ods of normal volatility is to simply throw out data for volatile periods (and other
weighted stock indices have higher autocorrelations than value-weighted indices. exceptional events).
(See Campbell, Lo, and MacKinlay (1997) Table 2.4.)
7. It is important to assess forecasting models in terms of their out-of-sample forecast-
2. Stock indices have more positive autocorrelation than (most) individual stocks: ing performance. Too many models seem to fit data in-sample, but most of them
there must be fairly strong cross-autocorrelations across individual stocks. (See fail in out-of-sample tests. Forecasting models are of no use if they cannot forecast.
Campbell, Lo, and MacKinlay (1997) Tables 2.7 and 2.8.)
8. There are also a number of strange patterns (“anomalies”) like the small-firms-in-
3. There seems to be negative autocorrelation of multi-year stock returns, for instance January effect (high returns on these in the first part of January) and the book-
in 5-year US returns for 1926-1985. It is unclear what drives this result, how- to-market effect (high returns on firms with high book/market value of the firm’s
ever. It could well be an artifact of just a few extreme episodes (Great Depression). equity).
Moreover, the estimates are very uncertain as there are very few (non-overlapping)
multi-year returns even in a long sample—the results could be just a fluke.

4. The aggregate stock market returns, that is, a return on a value-weighted stock
index, seems to be forecastable on the medium horizon by various information

146 147
Bibliography Cuthbertson, K., 1996, Quantitative financial economics, Wiley, Chichester, England.

Ang, J. S., and S. J. Ciccone, 2001, “International differences in analyst forecast proper- Ederington, L. H., and J. C. Goh, 1998, “Bond rating agencies and stock analysts: who
ties,” mimeo, Florida State University. knows what when?,” Journal of Financial and Quantitative Analysis, 33, 569–585.

Barber, B., R. Lehavy, M. McNichols, and B. Trueman, 2001, “Can investors profit from Elton, E. J., M. J. Gruber, S. J. Brown, and W. N. Goetzmann, 2007, Modern portfolio
the prophets? Security analyst recommendations and stock returns,” Journal of Fi- theory and investment analysis, John Wiley and Sons, 7th edn.
nance, 56, 531–563.
Ferson, W. E., S. Sarkissian, and T. T. Simin, 2003, “Spurious regressions in financial
Bodie, Z., A. Kane, and A. J. Marcus, 2002, Investments, McGraw-Hill/Irwin, Boston, economics,” Journal of Finance, 57, 1393–1413.
5th edn.
Goyal, A., and I. Welch, 2008, “A comprehensive look at the empirical performance of
Bolliger, G., 2001, “The characteristics of individual analysts’ forecasts in Europe,” equity premium prediction,” Review of Financial Studies 2008, 21, 1455–1508.
mimeo, University of Neuchatel.
Granger, C. W. J., 1992, “Forecasting stock market prices: lessons for forecasters,” Inter-
Bondt, W. F. M. D., and R. H. Thaler, 1990, “Do security analysts overreact?,” American national Journal of Forecasting, 8, 3–13.
Economic Review, 80, 52–57.
Huberman, G., and S. Kandel, 1987, “Mean-variance spanning,” Journal of Finance, 42,
Boni, L., and K. L. Womack, 2006, “Analysts, industries, and price momentum,” Journal 873–888.
of Financial and Quantitative Analysis, 41, 85–109.
Lo, A. W., H. Mamaysky, and J. Wang, 2000, “Foundations of technical analysis: com-
Brock, W., J. Lakonishok, and B. LeBaron, 1992, “Simple technical trading rules and the putational algorithms, statistical inference, and empirical implementation,” Journal of
stochastic properties of stock returns,” Journal of Finance, 47, 1731–1764. Finance, 55, 1705–1765.

Brockwell, P. J., and R. A. Davis, 1991, Time series: theory and methods, Springer Verlag, Makridakis, S., S. C. Wheelwright, and R. J. Hyndman, 1998, Forecasting: methods and
New York, second edn. applications, Wiley, New York, 3rd edn.

Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial Murphy, J. J., 1999, Technical analysis of the financial markets, New York Institute of
markets, Princeton University Press, Princeton, New Jersey. Finance.

Campbell, J. Y., and S. B. Thompson, 2008, “Predicting the equity premium out of sam- Neely, C. J., 1997, “Technical analysis in the foreign exchange market: a layman’s guide,”
ple: can anything beat the historical average,” Review of Financial Studies, 21, 1509– Federal Reserve Bank of St. Louis Review.
1531.
Priestley, M. B., 1981, Spectral analysis and time series, Academic Press.
Chance, D. M., and M. L. Hemler, 2001, “The performance of professional market timers:
The Economist, 1993, “Frontiers of finance,” pp. 5–20.
daily evidence from executed strategies,” Journal of Financial Economics, 62, 377–
411.

Cochrane, J. H., 2001, Asset pricing, Princeton University Press, Princeton, New Jersey.

148 149
Standard deviation, different weekdays Standard deviation, different hours
0.036
0.04

8 ARCH and GARCH


0.034
0.03
0.032
Reference: Bodie, Kane, and Marcus (2005) 13.4 0.02
Reference (advanced): Taylor (2005) 8–9; Verbeek (2004) 8; Campbell, Lo, and MacKin- 0.03
Mon Tue Wed Thu Fri 0 6 12 18
lay (1997) 12; Franses and van Dijk (2000) Hour
5−minute data on EUR/USD changes, 1998:1−2008:2
Sample size: 763488
8.1 Heteroskedasticity

8.1.1 Descriptive Statistics of Heteroskedasticity Figure 8.1: Standard deviation for EUR/USD exchange rate changes

Time-variation in volatility (heteroskedasticity) is a common feature of macroeconomic


This methods is commonly used by practitioners. For instance, the RISK Metrics
and financial data.
uses this method with  D 0:94 for use on daily data. Alternatively,  can be chosen to
The perhaps most straightforward way to gauge heteroskedasticity is to estimate a
minimize some criterion function like ˙ tTD1 .u2t  t2 /2 .
time-series of variances on “rolling samples.” For a zero-mean variable, u t , this could
See Figure 8.2 for an example.
mean
 t2 D .u2t 1 C u2t 2 C : : : C u2t q /=q; (8.1)
8.1.2 Heteroskedastic Residuals in a Regression
where the latest q observations are used. Notice that  t2 depends on lagged information,
Suppose we have a regression model
and could therefore be thought of as the prediction (made in t 1) of the volatility in t.
See Figure 8.1 for an example. y t D b0 Cx1t b1 Cx2t b2 C  Cxkt bk C" t ; where E " t D 0 and Cov.xit ; " t / D 0: (8.4)
Unfortunately, this method can produce quite abrupt changes in the estimate. An
alternative is to apply an exponential moving average (EMA) estimator of volatility, which In the standard case we assume that " t is iid (independently and identically distributed),
uses all data points since the beginning of the sample—but where recent observations which rules out heteroskedasticity.
carry larger weights. The weight for lag s is .1 /s where 0 <  < 1, so In case the residuals actually are heteroskedastic, least squares (LS) is nevertheless a
useful estimator: it is still consistent (we get the correct values as the sample becomes
 t2 D .1 /.u2t 1 C u2t 2 C 2 u2t 3 C : : :/; (8.2) really large)—and it is reasonably efficient (in terms of the variance of the estimates).
However, the standard expression for the standard errors (of the coefficients) is (except in
which can also be calculated in a recursive fashion as
a special case, see below) not correct. This is illustrated in Figure 8.4.
 t2 D .1 /u2t 1 C  t2 1 : (8.3) There are two ways to handle this problem. First, we could use some other estimation
method than LS that incorporates the structure of the heteroskedasticity. For instance,
The initial value (before the sample) could be assumed to be zero or (perhaps better) the combining the regression model (8.4) with an ARCH structure of the residuals—and es-
unconditional variance in a historical sample. timate the whole thing with maximum likelihood (MLE) is one way. As a by-product

150 151
GARCH std, annualized EMA std, annualized, λ = 0.99 Std of LS estimator under heteroskedasticity
0.1
2 −1
40 40 σ (X’X)
0.09
White’s
0.08 Simulated
20 20
0.07

0.06
0 0
1980 1990 2000 2010 1980 1990 2000 2010
0.05

0.04
S&P 500 (daily) 1954:1−2009:6
0.03 Model: yt = 0.9xt + εt,
AR(1) of excess returns EMA std, annualized, λ = 0.9 2
0.02 where εt ∼ N(0,ht), with ht = 0.5exp(αxt )
with GARCH(1,1) errors
bLS is the LS estimate of b in
AR(1) coef: 0.10 40 0.01
ARCH&GARCH coefs: 0.08 0.92 yt = a + bxt + ut
0
20 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
α

0
1980 1990 2000 2010
Figure 8.4: Variance of OLS estimator, heteroskedastic errors

Figure 8.2: Conditional standard deviation, estimated by GARCH(1,1) model covariance matrix” is the most common.
To test for heteroskedasticity, we can use White’s test of heteroskedasticity. The null
Std of DAX, GARCH(1,1) Daily DAX returns 1991:1−2009:6
60 hypothesis is homoskedasticity, and the alternative hypothesis is the kind of heteroskedas-
model: σ2t = α0 + α1u2t−1 + β1σ2t−1, ut is N(0,σ2t)
ticity which can be explained by the levels, squares, and cross products of the regressors—
40 u is the de−meaned return
t
clearly a special form of heteroskedasticity. The reason for this specification is that if the
Std

Coef Std err


squared residual is uncorrelated with w t , then the usual LS covariance matrix applies—
20 α0 0.031 0.005
even if the residuals have some other sort of heteroskedasticity (this is the special case
α1 0.085 0.008
0 mentioned before).
1995 2000 2005 2010 β1 0.898 0.009
To implement White’s test, let wi be the squares and cross products of the regressors.
For instance, if the regressors include .1; x1t ; x2t / then w t is the vector (1; x1t ; x2t ; x1t
2 2
; x1t x2t ; x2t )—
Figure 8.3: Results for a univariate GARCH model since .1; x1t ; x2t /  1 is .1; x1t ; x2t / and 1  1 D 1. The test is then to run a regression of
squared fitted residuals on w t
we get the correct standard errors provided, of course, the assumed distribution is cor- "O2t D w t0 C vi ; (8.5)
rect. Second, we could stick to OLS, but use another expression for the variance of the and to test if all the slope coefficients (not the intercept) in are zero. (This can be done
coefficients: a “heteroskedasticity consistent covariance matrix,” among which “White’s be using the fact that TR2  p2 , p D dim.wi / 1:)

152 153
8.1.3 Autoregressive Conditional Heteroskedasticity (ARCH) ARCH std, annualized GARCH std, annualized

Autoregressive heteroskedasticity is a special form of heteroskedasticity—and it is often 40 40


found in financial data which shows volatility clustering (calm spells, followed by volatile
spells, followed by...). 20 20
To test for ARCH features, Engle’s test of ARCH is perhaps the most straightforward.
It amounts to running an AR(q) regression of the squared zero-mean variable (here de- 0 0
1980 1990 2000 2010 1980 1990 2000 2010
noted u t )
u2t D a0 C a1 u2t 1 C : : : C aq u2t q C v t ; (8.6)
S&P 500 (daily) 1954:1−2009:6
Under the null hypothesis of no ARCH effects, all slope coefficients are zero and the
AR(1) of excess returns
R2 of the regression is zero. (This can be tested by noting that, under the null hypothesis,
with ARCH(1) or GARCH(1,1) errors
TR2  2q .) This test can also be applied to the fitted residuals from a regression like (8.4).
AR(1) coef: 0.10
However, in this case, it is not obvious that ARCH effects make the standard expression ARCH coef: 0.34
for the LS covariance matrix invalid—this is tested by White’s test as in (8.5). GARCH coefs: 0.08 0.92

8.2 ARCH Models Figure 8.5: ARCH and GARCH estimates

This section discusses the Autoregressive Conditional Heteroskedasticity (ARCH) model. The non-negativity restrictions on ˛0 and ˛1 are needed in order to guarantee  t2 > 0.
It is a model of how volatility depends on recent volatility. The upper bound ˛1 < 1 is needed in order to make the conditional variance stationary
There are two basic reasons for being interested in an ARCH model. First, if resid- (more later).
uals of the regression model (8.4) have ARCH features, then an ARCH model (that is, a It is clear that the unconditional distribution of u t is non-normal. While the condi-
specification of exactly how the ARCH features are generated) can help us estimate the re- tional distribution of u t is N.0;  t2 /, the unconditional distribution of u t is a mixture of
gression model by maximum likelihood. Second, we may be interested in understanding normal distributions with different (and random) variances. It can be shown that the result
the ARCH features more carefully, for instance, as an input in a portfolio choice process is a distribution which has fatter tails than a normal distribution with the same variance
or option pricing. (excess kurtosis)—which is a common feature of financial data.
It is straightforward to show that the ARCH(1) model implies that we in period t can
8.2.1 Properties of ARCH(1)
forecast the future conditional variance ( t2Cs ) as
In the ARCH(1) model the residual in the regression equation (8.4), or some other zero-  
˛0 ˛0
mean variable, can be written E t  t2Cs D C ˛1s 1  t2C1 for s D 1; 2; : : : (8.9)
1 ˛1 1 ˛1

u t  N.0;  t2 /, with (8.7) Notice that  t2C1 is known in t . The conditional volatility behaves like an AR(1), and
 t2 D ˛0 C ˛1 u2t 1 , with ˛0 > 0 and 0  ˛1 < 1: (8.8) 0  ˛1 < 1 is necessary to keep it positive and stationary.
See Figure 8.5 for an illustration of the fitted volatilies.

154 155
8.2.2 Estimation of the ARCH(1) Model (8.10). Estimation using other likelihood functions, for instance, for a t-distribution can
then be used. Or the estimation can be interpreted as a quasi-ML (is typically consistent,
The most common way to estimate the model is to assume that v t iid N.0; 1/ and to
but requires different calculation of the covariance matrix of the parameters).
set up the likelihood function. The log likelihood is easily found, since the model is
It is straightforward to add more lags to (8.8). For instance, an ARCH(p) would be
conditionally Gaussian. It is
T
X  t2 D ˛0 C ˛1 u2t C : : : C ˛p u2t p: (8.11)
1 1 1 u2t 1
ln L D L t , where L t D ln .2/ ln  t2 : (8.10)
t D1
2 2 2  t2 The form of the likelihood function is the same except that we now need p starting values
The estimates are found by maximizing the likelihood function (by choosing the parame- and that the upper boundary constraint should now be ˙jpD1 ˛j  1.
ters). This is done by a numerical optimization routine, which should preferably impose
the constraints in (8.8). 8.3 GARCH Models
If u t is just a zero-mean variable (no regression equation), then this just amounts to
choosing the parameters (˛0 and ˛1 / in (8.8). Instead, if u t is a residual from a regression Instead of specifying an ARCH model with many lags, it is typically more convenient to
equation (8.4), then we instead need to choose both the regression coefficients (b0 ; :::; bk ) specify a low-order GARCH (Generalized ARCH) model. The GARCH(1,1) is a simple
in (8.4) and the parameters (˛0 and ˛1 / in (8.8). In either case, we need a starting value and surprisingly general model, where the volatility follows
of 12 D ˛0 C ˛1 u20 . This most common approach is to use the first observation as a
 t2 D ˛0 C ˛1 u2t 1 C ˇ1  t2 1 ,with ˛0 > 0; ˛1 ; ˇ1  0; and ˛1 C ˇ1 < 1: (8.12)
“starting point,” that is, we actually have a sample from (t D) 0 to T , but observation 0 is
only used to construct a starting value of 12 , and only observations 1 to T are used in the The non-negativity restrictions are needed in order to guarantee that  t2 > 0 in all
calculation of the likelihood function value. periods. The upper bound ˛1 C ˇ1 < 1 is needed in order to make the  t2 stationary and
Remark 8.1 (Regression with ARCH(1) residuals) To estimate the full model (8.4) and therefore the unconditional variance finite.
(8.8) by ML, we can do as follows. Remark 8.3 The GARCH(1,1) has many similarities with the exponential moving aver-
First, guess values of the parameters b0 ; :::; bk , and ˛0 , and ˛1 . The guess of b0 ; :::; bk can age estimator of volatility (8.3). The main differences are that the exponential moving
be taken from an LS estimation of (8.4), and the guess of ˛0 and ˛1 from an LS estimation average does not have a constant and volatility is non-stationary (the coefficients sum to
of "O2t D ˛0 C ˛1 "O2t 1 C " t where "O t are the fitted residuals from the LS estimation of (8.4). unity).
Second, loop over the sample (first t D 1, then t D 2, etc.) and calculate u t D "O t from
(8.4) and  t2 from (8.8). Plug in these numbers in (8.10) to find the likelihood value. The GARCH(1,1) corresponds to an ARCH.1/ with geometrically declining weights,
Third, make better guesses of the parameters and do the second step again. Repeat until which suggests that a GARCH(1,1) might be a reasonable approximation of a high-order
the likelihood value converges (at a maximum). ARCH. Similarly, the GARCH(1,1) model implies that we in period t can forecast the
future conditional variance ( t2Cs ) as
Remark 8.2 (Imposing parameter constraints on ARCH(1).) To impose the restrictions
 
in (8.8) when the previous remark is implemented, iterate over values of .b; ˛Q 0 ; ˛Q 1 / and ˛0 ˛0
E t  t2Cs D C .˛1 C ˇ1 /s 1  t2C1 ; (8.13)
let ˛0 D ˛Q 02 and ˛1 D exp.˛Q 1 /=Œ1 C exp.˛Q 11 /. 1 ˛ 1 ˇ1 1 ˛1 ˇ1

It is sometimes found that the standardized values of u t , u t = t , still have too fat tails which is of the same form as for the ARCH model (8.9), but where the sum of ˛1 and ˇ1
compared with N.0; 1/. This would violate the assumption about a normal distribution in is like an AR(1) parameter.

156 157
QQ plot of AR(1) residuals QQ plot of normalized AR(1)+GARCH residuals VaR and density of returns
S&P 500 returns (daily) 1954:1−2009:6
2 2
Empirical quantiles

Empirical quantiles
VaR95% = −(the 5% quantile)

0 0

−2 −2

−2 0 2 −2 0 2
Quantiles from estimated N(µ,σ2), % Quantiles from estimated N(µ,σ2), %

−VaR95% R
Figure 8.6: QQ-plot of residuals

To estimate the model consisting of (8.4) and (8.12) we can still use the likelihood Figure 8.7: Value at risk
function (8.10) and do a MLE (but we now have to choose a value of ˇ1 as well). We
typically create the starting value of u20 as in the ARCH(1) model, but this time we also 8.4 Non-Linear Extensions
need a starting value of 02 . It is often recommended to use 02 D Var.u t /.
A very large number of extensions have been suggested. I summarize a few of them,
Remark 8.4 (Imposing parameter constraints on GARCH(1,1).) To impose the restric- which can be estimated by using the likelihood function (8.10) to do a MLE.
tions in (8.12), iterate over values of .b; ˛Q 0 ; ˛Q 1 ; ˇQ1 / and let ˛0 D ˛Q 02 , ˛1 D exp.˛Q 1 /=Œ1 C An asymmetric GARCH (Glosten, Jagannathan, and Runkle (1993)) can be con-
exp.˛Q 1 / C exp.ˇQ1 /; and ˇ1 D exp.ˇQ1 /=Œ1 C exp.˛Q 1 / C exp.ˇQ1 /. structed as
(
1 if q is true
See Figure 8.6 for evidence of how the residuals become more normally distributed  t D ˛0 C˛1 u t 1 Cˇ1  t 1 C ı.u t 1 > 0/u t 1 , where ı.q/ D
2 2 2 2
(8.14)
0 else.
once the heteroskedasticity is handled.
This means that the effect of the shock u2t 1 is ˛1 if the shock was negative and ˛1 C
Remark 8.5 (Value at Risk) The value at risk (as fraction of the investment) at the ˛ level if the shock was positive. With < 0, volatility increases more in response to a negative
(say, ˛ D 0:95) is VaR˛ D cdf 1 .1 ˛/, where cdf 1 ./ is the inverse of the cdf—so u t 1 (“bad news”) than to a positive u t 1 .
cdf 1 .1 ˛/ is the 1 ˛ quantile of the return distribution. For instance, VaR0:95 D 0:08 The EGARCH (exponential GARCH, Nelson (1991)) sets
says that there is only an 5% chance that the loss will be greater than 8% of the investment.
ju t 1j ut 1
See Figure 8.7 for an illustration. When the return has an N.;  2 / distribution, then ln  t2 D ˛0 C ˛1 C ˇ1 ln  t2 1 C (8.15)
t 1 t 1
VaR95% D . 1:64/. See Figure 8.8 for an example of time-varying VaR, based on a
GARCH model. Apart from being written in terms of the log (which is a smart trick to make  t2 > 0 hold
without any restrictions on the parameters), this is an asymmetric model. The ju t 1 j term
is symmetric: both negative and positive values of u t 1 affect the volatility in the same

158 159
Value at Risk95% (one day), % from any of the models for heteroskedasticity) as a regressor
GARCH std, %
10
4
y t D x t0 b C ' t2 C u t : (8.16)

5 Note that  t2 is predetermined, since it is a function of information in t 1. This model


2
can be estimated by using the likelihood function (8.10) to do MLE.

0
1980 1990 2000 2010
0
1980 1990 2000 2010
Remark 8.6 (Coding of (G)ARCH-M) We can use the same approach as in Remark 8.1,
S&P 500, daily data 1954:1−2009:6
except that we use (8.16) instead of (8.4) to calculate the residuals (and that we obviously
also need a guess of ').

Distribution of returns, VaR Example 8.7 (Theoretical motivation of GARCH-M) A mean variance investor solves
Estimated N(), unconditional
0.4 max˛ E Rp p2 k=2; subject to
Rp D ˛Rm C .1 ˛/Rf ;
0.2
where Rm is the return on the risky asset (the market index) and Rf is the riskfree return.
0 The solution is
−4 −2 0 2 4 1 E.Rm Rf /
˛D :
k m2
Figure 8.8: Conditional volatility and VaR In equilibrium, this weight is one (since the net supply of bonds is zero), so we get

way. The linear term in u t 1 modifies this to make the effect asymmetric. In particular, E.Rm Rf / D km2 ;
if < 0, then the volatility increases more in response to a negative u t 1 (“bad news”)
which says that the expected excess return is increasing in both the market volatility and
than to a positive u t 1 .
risk aversion (k).
Hentschel (1995) estimates several models of this type, as well as a very general
formulation on daily stock index data for 1926 to 1990 (some 17,000 observations). Most
standard models are rejected in favour of a model where  t depends on  t 1 and ju t 1
8.6 Multivariate (G)ARCH
bj3=2 . 8.6.1 Different Multivariate Models

This section gives a brief summary of some multivariate models of heteroskedasticity.


8.5 (G)ARCH-M
Suppose u t is an n  1 vector. For instance, u t could be the residuals from n different
It can make sense to let the conditional volatility enter the mean equation—for instance, regressions or just n different demeaned return series.
as a proxy for risk which may influence the expected return. We define the conditional (on the information set in t 1) covariance matrix of u t as
We modify the “mean equation” (8.4) to include the conditional variance  t2 (taken
˙t D Et 1 u t u0t : (8.17)

160 161
Remark 8.8 (The vech operator) vech(A) of a matrix A gives a vector with the elements The Constant Correlation Model
on and below the principal diagonal
2 A3stacked on top of each other (column wise). For
" # The constant correlation model assumes that every variance follows a univariate GARCH
a11
a11 a12 6 7 process and that the conditional correlations are constant. With n D 2 the covariance
instance, vech D 4 a21 5.
a21 a22 matrix is
a22 " # "p #" # "p #
11;t 12;t 11;t 0 1 12 11;t 0
It may seem as if a multivariate (matrix) version of the GARCH(1,1) model would D p p (8.21)
12;t 22;t 0 22;t 12 1 0 22;t
be simple, but it is not. The reason is that it would contain far too many parameters.
Although we only need to care about the unique elements of ˙ t , that is, vech.˙ t /, this and each of 11t and 22t follows a GARCH process. Assuming a GARCH(1,1) as in
still gives very many parameters (8.12) gives 7 parameters (2  3 GARCH parameters and one correlation), which is con-
venient. The price is, of course, the assumption of no movements in the correlations. To
0
vech.˙ t / D C C Avech.u t 1ut 1/ C Bvech.˙ t 1 /: (8.18) get a positive definite ˙ t , each individual GARCH model must generate a positive vari-
ance (same restrictions as before), and that all the estimated (constant) correlations are
For instance, with n D 2 we have
between 1 and 1.
2 3 2 2 3 2 3
11;t u1;t 1 11;t 1
6 7 6 7 6 7 Remark 8.9 (Estimating the constant correlation model) A quick (and dirty) method for
4 21;t 5 D C C A 4 u1;t 1 u2;t 1 5 C B 4 21;t 1 5; (8.19)
22;t u22;t 22;t 1
estimating is to first estimate the individual GARCH processes and then estimate the cor-
1 p p
relation of the standardized residuals u1t = 11;t and u2t = 22;t .
where C is 3  1, A is 3  3, and B is 3  3. This gives 21 parameters, which is already
hard to manage. We have to limit the number of parameters. We also have to find a By also specifying how the correlation can change over time, we get a dynamic cor-
way to impose restrictions so ˙ t is positive definite (compare the restrictions of positive relation model. It is slightly harder to estimate.
coefficients in (8.12)).

The Diagonal Model


Bibliography
The diagonal model assumes that A and B are diagonal. This means that every element Bodie, Z., A. Kane, and A. J. Marcus, 2005, Investments, McGraw-Hill, Boston, 6th edn.
of ˙ t follows a univariate process. With n D 2 we have
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
2 3 2 3 2 32 2 3 2 32 3
11;t c1 a1 0 0 u1;t 1 b1 0 0 11;t 1 markets, Princeton University Press, Princeton, New Jersey.
6 7 6 7 6 76 7 6 76 7
4 21;t 5 D 4c2 5 C 4 0 a2 0 5 4 u1;t 1 u2;t 1 5 C 4 0 b2 0 5 4 21;t 1 5 ;
2
Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,
22;t c3 0 0 a3 u2;t 1 0 0 b3 22;t 1
Cambridge University Press.
(8.20)
which gives 3 C 3 C 3 D 9 parameters (in C , A, and B, respectively). To make sure that Glosten, L. R., R. Jagannathan, and D. Runkle, 1993, “On the relation between the ex-
˙ t is positive definite we have to impose further restrictions. The obvious drawback of pected value and the volatility of the nominal excess return on stocks,” Journal of Fi-
this model is that there is no spillover of volatility from one variable to another. nance, 48, 1779–1801.

162 163
Std of DAX, GARCH(1,1) Std of FTSE, GARCH(1,1)
60 60

40 40
9 Risk Measures
Std

Std
20 20
Reference: Hull (2006) 18; McDonald (2006) 25; Fabozzi, Focardi, and Kolm (2006)
0 0 4–5; McNeil, Frey, and Embrechts (2005)
1995 2000 2005 2010 1995 2000 2005 2010

Sample (daily) 1991:1−2009:6 9.1 Symmetric Dispersion Measures


GARCH(1,1) of demeaned log index changes
Correlation of FTSE 100 and DAX 30 9.1.1 Mean Absolute Deviation
1 The Std are annualized

The variance (and standard deviation) is very sensitive to the tails of the distribution.
For instance, even if the standard normal distribution and a student-t distribution with
Corr

0.5
4 degrees of freedom look fairly similar, the latter has a variance that is twice as large
DCC
CC
(recall: the variance of a tn distribution is n=.n 2/ for n > 2). This may or may not be
0 what the investor cares about. If not, the mean absolute deviation is an alternative. Let 
1995 2000 2005 2010
be the mean, then the definition is

Figure 8.9: Results for multivariate GARCH models mean absolute deviation D E jR j: (9.1)

Hentschel, L., 1995, “All in the family: nesting symmetric and asymmetric GARCH This measure of dispersion is much less sensitive to the tails—essentially because it does
models,” Journal of Financial Economics, 39, 71–104. not involve squaring the variable.
Notice, however, that for a normally distributed return the mean absolute deviation
Nelson, D. B., 1991, “Conditional heteroskedasticity in asset returns,” Econometrica, 59, is proportional to the standard deviation—see Remark 9.1. Both measures will therefore
347–370. lead to the same portfolio choice (for a given mean return). In other cases, the portfolio
Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University choice will be different (and perhaps complicated to perform since it is typically not easy
Press. to calculate the mean absolute deviation of a portfolio).

Verbeek, M., 2004, A guide to modern econometrics, Wiley, Chichester, 2nd edn. Remark 9.1 (Mean absolute deviation of N.;  2 / and tn ) If R  N.;  2 /, then
p
E jR j D 2=  0:8:
p
If R  tn , then E jRj D 2 n=Œ.n 1/B.n=2; 0:5/, where B is the beta function. For
n D 4, E jRj D 1 which is just 25% higher than for a N.0; 1/ distribution. In contrast,
p
the standard deviation is 2, which is 41% higher than for the N.0; 1/.

164 165
9.1.2 Index Tracking Errors 9.2 Downside Risk

9.2.1 Value at Risk


VaR and density of returns
The mean-variance framework is often criticized for failing to distinguish between down-
side (considered to be risk) and upside (considered to be potential).
VaR95% = −(the 5% quantile)
The 95% Value at Risk (VaR95% ) says that there is only a 5% chance that the return
(R) will be less than VaR95%

Pr.R  VaR˛ / D 1 ˛: (9.3)

See Figure 9.1.

Example 9.2 (Quantile of a distribution) The 0.05 quantile is the value such that there is
−VaR95% R only a 5% probability of a lower number, Pr.R quantile0:05 / D 0:05.

We can solve this expression for the VaR˛ as


Figure 9.1: Value at risk
VaR˛ D cdfR1 .1 ˛/, (9.4)
Suppose instead that our task, as fund managers, say, is to track a benchmark portfolio
where cdfR1 ./ is the inverse cumulative distribution function of the returns, so cdfR1 .1
(returns Rb and portfolio weights wb )—but we are allowed to make some deviations. For
˛/ is the 1 a quantile (or “critical value”) of the return distribution. For instance,
instance, we are perhaps asked to track a certain index. The deviations, typically measured
VaR95% is the negative of the 0:05 quantile of the return distribution. Notice that the return
in terms of the variance of the tracking errors for the returns, can be motivated by practical
distribution depends on the investment horizon, so a value at risk measure is typically
considerations and by concerns about trading costs. If our portfolio has the weights w,
calculated for a stated investment period (for instance, one day).
then the portfolio return is Rp D w 0 R, where R are the original assets. Similarly, the
If the return is normally distributed, R  N.;  2 / and c1 ˛ is the 1 ˛ quantile of
benchmark portfolio (index) has the return Rb D wb0 R. If the variance of the tracking
a N(0,1) distribution (for instance, 1:64 for 1 ˛ D 1 0:95), then
error should be less than U , then we have the restriction
VaR˛ D . C c1 ˛  /: (9.5)
Var.Rp Rb / D .w wb /0 ˙.w wb /  U; (9.2)
This is illustrated in Figure 9.2.
where ˙ is the covariance matrix of the original assets. This type of restriction is fairly
Notice that the value at risk for a normally distributed return is a strictly increasing
easy to implement numerically in the portfolio choice model (the optimization problem).
function of the standard deviation (and the variance). Minimizing the VaR at a given mean
return therefore gives the same solution (portfolio weights) as minimizing the variance at
the same given mean return. In other cases, the portfolio choice will be different (and
perhaps complicated to perform).

166 167
Remark 9.3 (Critical values of N.;  2 /) If R  N.;  2 /, then there is a 5% proba- Value at Risk95% (one day), %
GARCH std, %
bility that R   1:64, a 2.5% probability that R   1:96 , and a 1% probability 10
that R   2:33 . 4

5
2 2
Density of N(0,1) Density of N(8,16 )
5% quantile is c = −1.64 3 5% quantile is µ + c*σ = −18
0.4
0 0
0.3 2 1980 1990 2000 2010 1980 1990 2000 2010
pdf

pdf
0.2 S&P 500, daily data 1954:1−2009:6
1
0.1
0 0 Distribution of returns, VaR
−3 c 0 3 −40 0 40
x R Estimated N(), unconditional
0.4

cdf of N(8,162) Inverse of cdf of N(8,162) 0.2


1
40
0
−4 −2 0 2 4
cdf

0.5 0
R

−40 Figure 9.3: Conditional volatility and VaR


0
−40 0 40 0 0.2 0.4 0.6 0.8 1
R cdf Remark 9.7 (VaR from t-distribution) The assumption of normally distributed returns
rules thick tails. As an alternative, suppose the normalized return has a t-distribution
with v degrees of freedom
Figure 9.2: Finding critical value of N(, 2 ) distribution R 
 tv :
s
Example 9.4 (VaR with R  N.;  2 /) If  D 8% and  D 16%, then VaR95% D Notice that s is not the variance of R, since Var.R/ D vs 2 =.v 2/ (assuming v > 2,
2

.0:08 1:640:16/  0:18; we are 95% sure that we will not loose more than 18% of the so the variance is defined). In this case, (9.5) still holds, but with c1 ˛ calculated as
investment, that is, VaR95% D 0:18. Similarly, VaR97:5% D .0:08 1:96  0:16/  0:24. the 1 ˛ quantile of a tv distribution. In practice, for a given value of Var.R/, the t
distribution gives a smaller value of the VaR than the normal distribution. The reason is
Example 9.5 (VaR and regulation of bank capital) Bank regulations have used 3 times
that the variance of a t-distribution is very high for low degrees of freedom.
the 99% VaR for 10-day returns as the required bank capital.
The VaR concept has been criticized for having poor aggregation properties. In par-
Remark 9.6 (Multi-period VaR) If the returns are iid, then a q-period return has the
ticular, the VaR for a portfolio is not necessarily (weakly) lower than the portfolio of the
mean q and variance q 2 , where  and  2 are the mean and variance of the one-period
p VaRs, which contradicts the notion of diversification benefits. (To get this unfortunate
returns respectively. If the mean is zero, then the q-day VaR is q times the one-day VaR.

168 169
property, the return distributions must be heavily skewed.) VaR and expected shortfall, R ~ N(0.08,0.162)
Figures 9.3–9.4 illustrate the VaR calculated from a time series model (to be precise, 3
VaR95% = −(µ−1.64σ) = 18%
−ES95% −VaR95%
a AR(1)+GARCH(1,1) model) for daily S&P returns. ES95% = −µ+σφ(−1.64)/0.05 = 25%
2.5

Backtesting VaR from GARCH(1,1), daily S&P 500 returns 2


0.1

0.09 1.5

0.08 1
Empirical Prob(R<VaR)

0.07
0.5
0.06
0
0.05 −40 0 40
R, %
0.04

0.03
Figure 9.5: Value at risk and expected shortfall
0.02
For a normally distributed return R  N.;  2 / we have
0.01 Daily S&P 500 returns, 1954:1−2009:6

.c1 ˛ /
0
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ES˛ D C ; (9.7)
1 ˛
Theoretical Prob(R<VaR)
where ./ is the pdf or a N.0; 1/ variable and where c1 ˛ is the 1 ˛ quantile of a N(0,1)
distribution (for instance, 1:64 for 1 ˛ D 0:05).
Figure 9.4: Backtesting VaR from a GARCH model, assuming normally distributed
Proof. (of (9.7)) If x  N.;  2 /, then E.xjx  b/ D  .b0 /=˚.b0 / where
shocks
b0 D .b /= and where ./ and ˚./ are the pdf and cdf of a N.0; 1/ variable
respectively. To apply this, use b D VaR˛ so b0 D c1 ˛ . Clearly, ˚.c1 ˛ / D 1 ˛ (by
9.2.2 Expected Shortfall definition of the 1 ˛ quantile). Multiply by 1.

The expected shortfall (also called conditional VaR) is the expected loss when the return Example 9.8 (ES) If  D 8% and  D 16%, the 95% expected shortfall is ES95% D
actually is below the VaR˛ , that is, 0:08 C 0:16. 1:64/=0:05  0:25 and the 97.5% expected shortfall is ES97:5% D
0:08 C 0:16. 1:96/=0:025  0:29.
ES˛ D E.RjR  VaR˛ /: (9.6)
Notice that the expected shortfall for a normally distributed return (9.7) is a strictly
This might be more informative than the VaR˛ , which is the minimum loss that will happen increasing function of the standard deviation (and the variance). Minimizing the expected
with a 1 ˛ probability. shortfall at a given mean return therefore gives the same solution (portfolio weights) as
minimizing the variance at the same given mean return. In other cases, the portfolio
choice will be different (and perhaps complicated to perform).

170 171
Probability density function (pdf) Contribution to variance In comparison with a variance
3 2 2
N(µ,σ ) pdf(x)(x−µ)
µ = 0.08 0.04 Var(x)=area p2 D E.Rp p /2 ; (9.9)
2
σ = 0.16
the target semivariance differs on two accounts: (i) it uses the target level h as a reference
0.02
1
point instead of the mean p : and (ii) only negative deviations from the reference point
0 0
are given any weight. See Figure 9.6 for an illustration (based on a normally distributed
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 variable).
x, % x, %

Std and mean


15
The markers for target semivariance (sv) indicate the std
Contribution to target semivariance Target semivariance as function of σ2 of the portfolio that minimizes the target semivariance
0.02 2
at the given mean return
pdf(x)min(x−0.02,0)2 N(0.08,σ )
0.04 target semivariance(x)=area 0.015 Target level: 0.02
10
MV (risky)

Mean, %
0.01
0.02 MV (risky&riskfree)
0.005 target sv (risky)
target sv (risky&riskfree)
0 0 5
−60 −40 −20 0 20 40 60 0 0.02 0.04 0.06
x, % σ2

0
0 5 10 15
Figure 9.6: Target semivariance as a function of mean and standard deviation for a Std, %
N(, 2 ) variable

9.2.3 Target Semivariance (Lower Partial 2nd Moment) Figure 9.7: Standard deviation and expected returns

Reference: Bawa and Lindenberg (1977) and Nantell and Price (1979) For a normally distributed variable, the target semivariance p .h/ is increasing in
Using the variance (or standard deviation) as a measure of portfolio risk (as a mean- the standard deviation (for a given mean)—see Remark 9.9. See also Figure 9.6 for an
variance investor does) fails to distinguish between the downside and upside. As an alter- illustration. This means that minimizing p .h/ at a given mean return gives the same
native, one could consider using a target semivariance (lower partial 2nd moment) instead. solution (portfolio weights) as minimizing p (or p2 ) at the same given mean return. As
It is defined as a result, with normally distributed returns, an investor who wants to minimize the lower
p .h/ D EŒmin.Rp h; 0/2 ; (9.8) partial 2nd moment (at a given mean return) is behaving just like a mean-variance investor.
where h is a “target level” chosen by the investor. In the subsequent analysis it will be set In other cases, the portfolio choice will be different (and perhaps complicated to perform).
Rh
equal to the riskfree rate. (It can clearly also be written p .h/ D 1 .Rp h/2 f .Rp /dRp , See Figure 9.7 for an illustration.
where f ./ is the pdf of the portfolio return.)
Remark 9.9 (Target semivariance calculation for normally distributed variable ) For an

172 173
N.;  2 / variable, target semivariance around the target level h is QQ plot of daily S&P 500 returns
6
p .h/ D  2 a.a/ C  2 .a2 C 1/˚.a/, where a D .h /=; 0.1th to 99.9th percentiles

4
where ./ and ˚./ are the pdf and cdf of a N.0; 1/ variable respectively. Notice that
p .h/ D  2 =2 for h D . See Figure 9.6 for a numerical illustration. It is straightfor- 2

Empirical quantiles
ward to show that
@p .h/
D 2˚.a/; 0
@
so the target semivariance is a strictly increasing function of the standard deviation.
−2

Daily returns, full Daily returns, zoomed in vertically −4


8000 25
Number of days

Number of days
20
6000 Daily S&P 500 returns, 1957:1−2009:6
−6
15
4000
10 −6 −4 −2 0 2 4 6
2000 5 Quantiles from estimated N(µ,σ2), %
0 0
−20 −10 0 10 −20 −10 0 10
Daily excess return, % Daily excess return, % Figure 9.9: Quantiles of daily S&P returns

9.3 Empirical Return Distributions


Daily returns, zoomed in horizontally
8000 Daily S&P 500 returns, 1957:1−2009:6
Are returns normally distributed? Mostly not, but it depends on the asset type and on the
Number of days

The solid line is an estimated normal distribution


6000 data frequency. Options returns typically have very non-normal distributions (in partic-
4000 ular, since the return is 100% on many expiration days). Stock returns are typically
2000 distinctly non-linear at short horizons, but can look somewhat normal at longer horizons.
To assess the normality of returns, the usual econometric techniques (Bera–Jarque
0
−3 −2 −1 0 1 2 3 and Kolmogorov-Smirnov tests) are useful, but a visual inspection of the histogram and a
Daily excess return, %
QQ-plot also give useful clues. See Figures 9.8–9.10 for illustrations.
There is one caveat to this way of studying data: it only provides evidence on the
Figure 9.8: Distribution of daily S&P returns unconditional distribution. For instance, nothing rules out the possibility that we could
estimate a model for time-varying volatility (for instance, a GARCH model) of the returns
and thus generate a description for how the VaR changes over time. However, data with
time varying volatility will typically not have an unconditional normal distribution.

174 175
QQ plot of daily returns QQ plot of weekly returns allows for a much heavier tail that suggested by a normal distribution. The generalized
10 Pareto (GP) distribution is often used. See Figure 9.11 for an illustration.
5
Empirical quantiles

Empirical quantiles
5
0 0

−5
−5
−10 generalized Pareto distribution
−6 −4 −2 0 2 4 6 −10 −5 0 5 10
Quantiles from N(µ,σ2), % Quantiles from N(µ,σ2), %
90% probability mass, unknown shape

QQ plot of monthly returns Circles denote 0.1th to 99.9th percentiles


Empirical quantiles

10 Daily S&P 500 returns, 1957:1−2009:6


u (threshold)
Loss
0

−10 Figure 9.11: Loss distribution


−20
−20 −10 0 10 Remark 9.10 (Cdf and pdf of the generalized Pareto distribution) The generalized Pareto
Quantiles from N(µ,σ2), % distribution is described by a scale parameter (ˇ > 0) and a shape parameter (). The
cdf (Pr.Z  z/, where Z is the random variable and z is a value) is
(
Figure 9.10: Distribution of S&P returns (different horizons) 1 .1 C z=ˇ/ 1= if  ¤ 0
G.z/ D
1 exp. z=ˇ/  D 0;
9.4 Threshold Exceedance
for 0  z and z  ˇ= in case  < 0. The pdf is therefore
Reference: McNeil, Frey, and Embrechts (2005) 7 (
In risk control, the focus is the distribution of losses beyond some threshold level
1
ˇ
.1 C z=ˇ/ 1= 1 if  ¤ 0
g.z/ D
(denoted u below). This has three direct implications. First, the object under study is the ˇ
1
exp. z=ˇ/  D 0:
loss The mean is defined (finite) if  < 1 and is then E.z/ D ˇ=.1 /. See Figure 9.12 for
X D R; (9.10) an illustration.
that is, the negative of the return. Second, the attention is on the probability of exceeding Consider the loss X (the negative of the return) and let u be a threshold. Assume
the threshold level (denoted 1 Pu ) below and how the distribution looks like beyond that the threshold exceedance (X u) has a generalized Pareto distribution. Let Pu be
the threshold. In contrast, the exact shape of the distribution below that point is typically probability of X  u. Then, the cdf of the loss for values greater than the threshold
disregarded (only the probability of being below the threshold, denoted Pu , is of inter- (Pr.X  x/ for x > u) can be written
est).Third, modelling the tail of the distribution is best done by using a distribution that
F .x/ D Pu C G.x u/.1 Pu /, for x > u; (9.11)

176 177
Pdf of generalized Pareto distribution (β = 0.15) Loss distributions, Pr(loss>12) = 10%
7
ξ=0 1 N(0.08,0.162)
6 ξ = 0.25 generalized Pareto (ξ=0.22,β=0.16)
ξ = 0.45 0.8
5
VaR ES
4 0.6
Normal dist 18.2 25.3
3 GP dist 24.5 48.4
0.4
2
0.2
1

0 0
0 0.1 0.2 0.3 0.4 0.5 15 20 25 30 35 40 45 50 55 60
Loss (−R), %

Figure 9.12: Generalized Pareto distributions


Figure 9.13: Comparison of a normal and a generalized Pareto distribution for the tail of
losses
where G.z/ is the cdf of the generalized Pareto distribution. Clearly, the pdf is
We can then write
f .x/ D g.x u/.1 Pu /, for x > u; (9.12)
Pr.X > x/ D Pr.X > u/ Pr.X > xjX > u/
where g.z/ is the pdf of the generalized Pareto distribution.
D Pr.X > u/ Pr.X u>x ujX u > 0/;
Remark 9.11 (Integrating the GP pdf) Notice that integrating the pdf in (9.12) from x D
u to infinity shows that the probability mass of X above u is 1 Pu . Since the probability where the second line comes from subtracting u from both X and x in Pr.X > xjX > u/.
mass below u is Pu , it adds up to unity (as it should). We assume that X u has a generalized Pareto distribution. Then, Pr.X u > x ujX
u > 0/ is the generalized Pareto distribution, 1 G.x u/.
It is often useful to use the cdf in (9.11) to calculate the tail probability Pr.X > x/,
The VaR˛ (say, ˛ D 0:95) is the ˛-th quantile of the loss distribution
which is
1 F .x/ D .1 Pu /Œ1 G.x u/; (9.13) VaR˛ D cdfX 1 .˛/; (9.14)
where G.z/ is the cdf of the generalized Pareto distribution.
where cdfX 1 ./ is the inverse cumulative distribution function of the losses, so cdfX 1 .˛/
Remark 9.12 (Direct derivation of (9.13) ) The probability that the random variable X is the a quantile of the loss distribution. For instance, VaR95% is the 0:95 quantile of the
exceeds a value x, conditional on X exceeding some threshold u can be written loss distribution. This clearly means that the probability of the loss to be less that VaR˛
equals ˛
Pr.X > x/
Pr.X > xjX > u/ D : Pr.X  VaR˛ / D ˛: (9.15)
Pr.X > u/
(Equivalently, the Pr.X >VaR˛ / D 1 ˛:)

178 179
Assuming ˛ is higher than Pu (so VaR˛  u), the cdf (9.11) of the losses gives Xt  for those observations where X t > 
8̂     PT (
< uC ˇ 1 ˛
1 if  ¤ 0 tD1 .X t /ı.X t > / 1 if q is true
 1 Pu O
e./ D PT ; where ı.q/ D (9.19)
VaR˛ D   , for ˛  Pu : (9.16) t D1 .X t > / 0 else.
:̂ u ˇ ln 1 Pu 1 ˛
D0
If it is found that e./
O is increasing (more or less) linearly with the threshold level (),
Proof. (of (9.16)) Set F .x/ D ˛ in (9.11) and use z D x u in the cdf from Remark then it is reasonable to model the tail of the distribution from that (or a somewhat lower)
9.10 and solve for x. point as a generalized Pareto distribution. The estimation of the parameters of the dis-
If we assume  < 1 (to make sure that the mean is finite), then straightforward inte- tribution ( and ˇ) is typically done by maximum likelihood. See Figure 9.15 for an
gration using (9.12) shows that the expected shortfall is illustration.

ES˛ D E.XjX  VaR˛ / Expected exeedance (loss minus threshold, v)


30
VaRa ˇ u
D C , for ˛ > Pu : (9.17)
1  1  25

Subtracting VaR˛ from both sides (and changing the notation from VaRa to ) gives the 20
expected exceedance of the loss over another threshold  > u
N(0.08,0.162)
15
generalized Pareto (ξ=0.22,β=0.16,u=12)
e./ D E .X jX > /
10
 ˇ u
D C , for  > u: (9.18)
1  1  5
Notice that u shows up in the expressions for VaR, ES and e./ because we assume that
0
the threshold exceedance (X u) has a generalized Pareto distribution. 15 20 25 30 35 40
threshold v, %

Remark 9.13 (Expected exceedance from a normal distribution) If X  N.;  2 /, then

.0 / Figure 9.14: Expected exceedance, normal and generalized Pareto distribution
E.X jX > / D  C  ; with 0 D . /=
1 ˚.0 /
Remark 9.14 (Log likelihood function of the loss distribution) Since we have assume
where ./ and ˚ are the pdf and cdf of a N.0; 1/ variable respectively.
that the threshold exceedance (X u) has a generalized Pareto distribution, Remark 9.10
As seen from (9.19) the expected exceedance of a generalized Pareto distribution with shows that the log likelihood for the observation of the loss above the threshold (X t > u)
 > 0 is increasing with the threshold level . This indicates that the tail of the distribution is
is very long. In contrast, a normal distribution would typically show a negative relation X
LD Lt
(see Figure 9.14 for an illustration). This provides a way of assessing which distribution
t st. X t >u
that best fits the tail of the historical histogram. In particular, the expected exceedance (
ln ˇ .1= C 1/ ln Œ1 C  .X t u/ =ˇ if  ¤ 0
over  is often compared with an empirical estimate of the same thing: the mean of ln L t D
ln ˇ .X t u/ =ˇ  D 0:

180 181
This allows us to estimate of  and ˇ by maximum likelihood. Typically, u is not estimated, Fabozzi, F. J., S. M. Focardi, and P. N. Kolm, 2006, Financial modeling of the equity
but imposed a priori (based on the expected exceedance). market, Wiley Finance.

Hull, J. C., 2006, Options, futures, and other derivatives, Prentice-Hall, Upper Saddle
Expected exceedance Estimated loss distribution
River, NJ, 6th edn.
loss minus threshold, v

(50th to 99th percentiles) u = 1.3, Pr(loss>u) = 6.6%


1.2 0.1
ξ = 0.25, β = 0.56
McDonald, R. L., 2006, Derivatives markets, Addison-Wesley, 2nd edn.
1
0.8
0.05 McNeil, A. J., R. Frey, and P. Embrechts, 2005, Quantitative risk management, Princeton
0.6 University Press.
0
0 0.5 1 1.5 2 2.5 1.5 2 2.5 3 3.5 4 Nantell, T. J., and B. Price, 1979, “An analytical comparison of variance and semivariance
threshold v, % Loss, %
capital market theories,” Journal of Financial and Quantitative Analysis, 14, 221–242.

QQ plot
Daily S&P 500 returns, 1957:1−2009:6
Empirical quantiles

(94th to 99th percentiles)


2.5

1.5
1.5 2 2.5
Quantiles from estimated GPD, %

Figure 9.15: Results from S&P 500 data

Example 9.15 (Estimation of the generalized Pareto distribution on S&P daily returns).
Figure 9.15 (upper left panel) shows that it may be reasonable to fit a GP distribution
with a threshold u D 1:3. The the upper right panel illustrates the estimated distribution,
while the lower left panel shows that the highest quantiles are well captured by estimated
distribution.

Bibliography
Bawa, V. S., and E. B. Lindenberg, 1977, “Capital market equilibrium in a mean-lower
partial moment framework,” Journal of Financial Economics, 5, 189–200.

182 183
.D..3 pdfs of normal distributions (also called mixture components)

f .x t I 1 ; 2 ; 12 ; 22 ; / D .1 /.x t I 1 ; 12 / C .x t I 2 ; 22 /; (10.1)

where .xI i ; i2 / is the pdf of a normal distribution with mean i and variance i2 . It
10 Risk Measures II
thus contains five parameters: the means and the variances of the two components and
More advanced material is denoted by a star ( ). It is not required reading. their relative weight ().
See Figures 10.1–10.4 for an illustration.

10.1 Fitting a Mixture Normal Distribution to Data Distribution of S&P500,1957:1−2009:6


0.6
Reference: Hastie, Tibshirani, and Friedman (2001) 8.5 Mixture pdf 1
pdf 1 pdf 2 Mixture pdf 2
mean −0.03 0.02 Total pdf
Distribution of of S&P500,1957:1−2009:6 0.5 std 2.05 0.66
weight 0.14 0.86
0.6 normal pdf
0.4

0.5
0.3

0.4
0.2

0.3
0.1

0.2
0
−5 −4 −3 −2 −1 0 1 2 3 4 5
0.1 Monthly excess return, %

0 Figure 10.2: Histogram of returns and a fitted mixture normal distribution


−5 −4 −3 −2 −1 0 1 2 3 4 5
Monthly excess return, %

Remark 10.1 (Estimation of the mixture normal pdf) With 2 mixture components, the log
Figure 10.1: Histogram of returns and a fitted normal distribution likelihood is just
XT
LL D ln f .x t I 1 ; 2 ; 12 ; 22 ; /;
t D1
A normal distribution often fits returns poorly. If we need a distribution, then a mixture
of two normals is typically much better, and still fairly simple. where f ./ is the pdf in (10.1) A numerical optimization method could be used to maximize
The pdf of this distribution is just a weighted average of two different (bell shaped) this likelihood function. However, this is tricky so an alternative approach is often used.
This is an iterative approach in three steps:
(1) Guess values of 1 ; 2 ; 12 ; 22 and . For instance, pick 1 D x1 , 2 D x2 , 12 D

184 185
QQ plot of daily S&P 500 returns QQ plot of daily S&P 500 returns
6 6
0.1th to 99.9th percentiles 0.1th to 99.9th percentiles

4 4

2 2
Empirical quantiles

Empirical quantiles
0 0

−2 −2

−4 −4

Daily S&P 500 returns, 1957:1−2009:6 Daily S&P 500 returns, 1957:1−2009:6
−6 −6

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Quantiles from estimated N(µ,σ2), % Quantiles from estimated mixture normal, %

Figure 10.3: Quantiles of daily S&P returns Figure 10.4: Quantiles of daily S&P returns

22 D Var.x t / and  D 0:5. previous) iteration.


(2) Calculate

.x t I 2 ; 22 / 10.2 Recap of Univariate Distributions


t D for t D 1; : : : ; T:
.1 /.x t I 1 ; 12 / C .x t I 2 ; 22 /
The cdf (cumulative distribution function) measures the probability that the random vari-
(3) Calculate (in this order) able Xi is below or at some numerical value xi ,
PT PT
.1 t /x t 2 tD1 .1 t /.x t 1 /2 ui D Fi .xi / D Pr.Xi  xi /: (10.2)
1 D Pt D1 T
, 1 D PT ;
t D1 .1 t / t D1 .1 t /
PT PT
t xt 2 2 /2 For instance, with an N.0; 1/ distribution, F . 1:64/ D 0:05. Clearly, the cdf values
t D1 t .x t
2 D Pt D1 ,  D P , and
T 2 T are between (and including) 0 and 1. The distribution of Xi is often called the marginal
t D1 t tD1 t
XT distribution of Xi —to distinguish it from the joint distribution of Xi and Xj . (See below
D t =T .
t D1 for more information on joint distributions.)
Iterate over (2) and (3) until the parameter values converge. (This is an example of the so The pdf (probability density function) fi .xi / is the “height” of the distribution in the
called EM algorithm.) Notice that the calculation of i2 uses i from the same (not the

186 187
sense that the cdf F .xi / is the integral of the pdf from minus infinity to xi 10.3 Beyond (Linear) Correlations
Z xi
Fi .xi / D fi .s/ds: (10.3) Reference: Alexander (2008) 6, McNeil, Frey, and Embrechts (2005) 5
sD 1
The standard correlation (also called Pearson’s correlation) measures the linear rela-
(Conversely, the pdf is the derivative of the cdf, fi .xi / D @Fi .xi /=@xi .) The Gaussian tion between two variables, that is, to what extent one variable can be explained by a
pdf (the normal distribution) is bell shaped. linear function of the other variable (and a constant). That is adequate for most issues
in finance, but we sometimes need to go beyond the correlation—to capture non-linear
Remark 10.2 (Quantile of a distribution) The ˛ quantile of a distribution (˛ ) is the value
relations. It also turns out to be easier to calibrate/estimate copulas (see below) by using
of x such that there is a probability of ˛ of a lower value. We can solve for the quantile by
other measures of dependency.
inverting the cdf, ˛ D F .˛ / as ˛ D F 1 .˛/. For instance, the 5% quantile of a N.0; 1/
Spearman’s rank correlation (called Spearman’s rho) of two variables measures to
distribution is 1:64 D ˚ 1 .0:05/, where ˚ 1 ./ denotes the inverse of an N.0; 1/ cdf.
what degree their relation is monotonic. That is, if one variable tends to be high when the
See Figure 10.5 for an illustration.
other also is—without imposing the restriction that this relation must be linear.

Corr = 0.90 Corr = 0.03


Density of N(0,1) Density of N(8,162)
5% quantile is c = −1.64 3 2 2
5% quantile is µ + c*σ = −18
0.4
1 1
0.3 2 0 0

y
pdf

pdf

0.2 −1 −1
1
0.1 −2 ρ = 0.88, τ = 0.69 −2 ρ = 0.03, τ = 0.01

0 0 −5 0 5 −5 0 5
−3 c 0 3 −40 0 40 x x
x R

Corr = −0.88 Corr = 0.49


cdf of N(8,162) Inverse of cdf of N(8,162)
1 2 2
40 1 1
0 0

y
cdf

0.5 0
R

−1 −1

−40 −2 ρ = −0.84, τ = −0.65 −2 ρ = 1.00, τ = 1.00


0 −5 0 5 −5 0 5
−40 0 40 0 0.2 0.4 0.6 0.8 1 x x
R cdf

Figure 10.6: Illustration of correlation and rank correlation


Figure 10.5: Finding quantiles of a N(, 2 ) distribution
It is computed in two steps. First, the data is ranked from the smallest (rank 1) to the

188 189
largest (ranked T , where T is the sample size). Ties (when two or more observations have instance, with three data points (.x1 ; y1 /; .x2 ; y2 /; .x3 ; y3 /) we first calculate
the same values) are handled by averaging the ranks. The following illustrates this
Changes of x Changes of y
Data Rank x2 x1 y2 y1
(10.7)
2 2:5 x3 x1 y3 y1
10 4 (10.4) x3 x2 y3 y2 ;
3 1
which gives T .T 1/=2 (here 3) pairs. Then, we investigate if the pairs are concordant
2 2:5
(same sign of the change of x and y) or discordant (different signs) pairs
In the second step, simply estimate the correlation of the ranks of two variables
ij is concordant if .xj xi /.yj yi / > 0 (10.8)
Spearman’s  D CorrŒrank.x t /; rank.y t /: (10.5) ij is discordant if .xj xi /.yj yi / < 0:

Clearly, this correlation is between 1 and 1. (There is an alternative way of calculating Finally, we count the number of concordant (Tc ) and discordant (Td ) pairs and calculate
the rank correlation based on the difference of the ranks, d t Drank.x t / rank.y t /,  D Kendall’s tau as
1 6˙ tTD1 d t2 =.T 3 T /. It gives the same result if there are no tied ranks.) See Figure Tc Td
Kendall’s  D : (10.9)
T .T 1/=2
10.6 for an illustration.
The rank correlation can be tested by using the fact that under the null hypothesis the It can be shown that  
4T C 10
rank correlation is zero. We then get Kendall’s  !d N. 0; ; (10.10)
9T .T 1/
p so it is straightforward to test  by a t-test.
T 1O !d N.0; 1/: (10.6)
p Example 10.4 Kendall’s tau) Suppose the data is
(For samples of 20 to 40 observations, it is often recommended to use .T 2/=.1 O2 /O
which has an tT 2 distribution.) x y
2 7
Remark 10.3 (Spearman’s  for a distribution ) If we have specified the joint distribu- 10 9
tion of the random variables X and Y , then we can also calculate the implied Spearman’s 3 10:
 (sometimes only numerically) as CorrŒFX .X/; FY .Y / where FX .X / is the cdf of X and
We then get the following changes
FY .Y / of Y .
Changes of x Changes of y
Kendall’s rank correlation (called Kendall’s ) is similar, but is based on compar-
x2 x1 D 10 2 D 8 y2 y1 D 9 7 D 2 concordant
ing changes of x t (compared to x1 ; : : : x t 1 ) with the corresponding changes of y t . For
x3 x1 D 3 2 D 5 y3 y1 D 10 7 D 3 discordant
x3 x2 D 3 10 D 13 y3 y2 D 10 9 D 1; discordant.

Kendall’s tau is therefore


1 2 1
D D :
3.3 1/=2 3

190 191
If x and y actually has bivariate normal distribution with correlation , then it can be (Ox;˛ and Oy;˛ ) by simply sorting the data and then picking out the value of observation
shown that on average we have ˛T of this sorted list (do this individually for x and y). Then, calculate the estimate
(
Spearman’s rho =
6
arcsin.=2/   (10.11) O 1 XT 1 if x t  Ox;˛ and y t  Oy;˛
 G˛ D ı t ; where ı t D (10.14)
T t D1 0 otherwise.
2
Kendall’s tau D arcsin./: (10.12)
 See 10.8 for an illustration based on a joint normal distribution.
In this case, all three measures give similar messages (although the Kendall’s tau tends to
Pr(x<quantile,y<quantile), normal distribution Pr(x<quantile,y<quantile), normal distribution
be lower than the linear correlation and Spearman’s rho) This is illustrated in Figure 10.7. 100 10
Clearly, when data is not normally distributed, then these measures can give distinctly corr = 0.25
corr = 0.5
different answers.

%
50 5
ρ and τ as a function of the linear correlation (Gaussian distribution)
1
Spearman’s ρ 0 0
Kendall’s τ 0 20 40 60 80 100 0 2 4 6 8 10
0.8 quantile level,% quantile level,%

0.6
Figure 10.8: Probability of joint low returns, bivariate normal distribution
0.4

0.2 10.4 Copulas

0 Reference: McNeil, Frey, and Embrechts (2005), Alexander (2008) 6


0 0.2 0.4 0.6 0.8 1
correlation Portfolio choice and risk analysis depend crucially on the joint distribution of asset
returns. Emprical evidence suggest that many returns have non-normal distribution, es-
pecially when we focus on the tails. There are several ways of estimating complicated
Figure 10.7: Spearman’s rho and Kendall’s tau if data has a bivariate normal distribution
(non-normal) distributions: using copulas is one. This approach has the advantage that
it proceeds in two steps: first we estimate the marginal distribution of each returns sepa-
A joint ˛-quantile exceedance probability measures how often two random variables
rately, then we model the comovements by a copula.
(x and y, say) are both above their ˛ quantile. Similarly, we can also define the probability
that they are both below their ˛ quantile
10.4.1 Multivariate Distributions and Copulas
G˛ D Pr.x  x;˛ ; y  y;˛ /; (10.13) A joint cdf of two random variables (X1 and X2 ) is defined as

x;˛ and y;˛ are ˛-quantile of the x- and y-distribution respectively. F1;2 .x1 ; x2 / D Pr.X1  x1 and X2  x2 /: (10.15)
In practice, this can be estimated from data by first finding the empirical ˛-quantiles

192 193
This cdf is obtained by integrating the joint pdf f1;2 .x1 ; x2 / over both variables only. They therefore provide a way of calibrating/estimating the copula without having to
Z x1 Z x2 involve the marginal distributions directly.
F1;2 .x1 ; x2 / D f1;2 .s; t /dsdt: (10.16)
sD 1 tD 1 Example 10.6 (Independent X and Y ) If X and Y are independent, then we know that
(Conversely, the pdf is the mixed derivative of the cdf, f1;2 .x1 ; x2 / D @ F1;2 .x1 ; x2 /=@x1 @x2 .)
2 f1;2 .x1 ; x2 / D f1 .x1 /f2 .x2 /, so the copula density function is just a constant equal to
See Figure 10.9 for an illustration. one.

pdf of bivariate normal distribution, corr=0.8 cdf of bivariate normal distribution, corr=0.8 Remark 10.7 (Cdfs and copulas ) The joint cdf can be written as

F1;2 .x1 ; x2 / D C ŒF1 .x1 /; F2 .x2 /;


0.2 1

0.1 0.5 where C./ is the unique copula function. Taking derivatives gives (10.17) where
2 2
0 0 @2 C.u1 ; u2 /
2 0 2 0 c.u1 ; u2 / D :
0 0 @u1 @u2
y −2 −2 x y −2 −2 x
Notice the derivatives are with respect to ui D Fi .xi /, not xi . Conversely, integrating the
density over both u1 and u2 gives the copula function C./.
Figure 10.9: Bivariate normal distributions

Any pdf can also be written as 10.4.2 The Gaussian and Other Copula Densities

The Gaussian copula density function is


f1;2 .x1 ; x2 / D c.u1 ; u2 /f1 .x1 /f2 .x2 /; with ui D Fi .xi /; (10.17)
 2 2 
1  1 21 2 C 2 22
where c./ is a copula density function and ui D Fi .xi / is the cdf value as in (10.2). The c.u1 ; u2 / D p exp , with i D ˚ 1
.ui /; (10.18)
1 2 2.1 2 /
extension to three or more random variables is straightforward.
where ˚ 1 ./ is the inverse of an N.0; 1/ distribution. Notice that when using this func-
Remark 10.5 (Joint pdf and copula density, n variables) For n variables (10.17) gener- tion in (10.17) to construct the joint pdf, we have to first calculate the cdf values ui D
alizes to Fi .xi / and then the quantiles of those according to a standard normal distribution i D
˚ 1 .ui / D ˚ 1 ŒFi .xi /.
f1;2;:::;n .x1 ; x2 ; : : : ; xn / D c.u1 ; u2 ; : : : ; un /f1 .x1 /f2 .x2 / : : : fn .xn /; with ui D Fi .xi /;
It can be shown that assuming that the marginal pdfs (f1 .x1 / and f2 .x2 /) are normal
Equation (10.17) means that if we know the joint pdf f1;2 .x1 ; x2 /—and thus also the and then combining with the Gaussian copula density recovers a bivariate normal dis-
cdfs F1 .x1 / and F2 .x2 /—then we can figure out what the copula density function must tribution. However, the way we typically use copulas is to assume (and estimate) some
be. Alternatively, if we know the pdfs f1 .x1 / and f2 .x2 /—and thus also the cdfs F1 .x1 / other type of univariate distribution, for instance, with fat tails—and then combine with
and F2 .x2 /—and the copula function, then we can construct the joint distribution. (This a (Gaussian) copula density to create the joint distribution. See Figure 10.10 for an illus-
is called Sklar’s theorem.) This latter approach will turn out to be useful. tration.
The correlation of x1 and x2 depends on both the copula and the marginal distribu- A zero correlation ( D 0) makes the copula density (10.18) equal to unity—so the
tions. In contrast, both Spearman’s rho and Kendall’s tau are determined by the copula joint density is just the product of the marginal densities. A positive correlation makes the

194 195
copula density high when both x1 and x2 deviate from their means in the same direction. where ˛ ¤ 0. When ˛ > 0, then correlation on the downside is much higher than on the
The easiest way to calibrate a Gaussian copula is therefore to set upside (where it goes to zero as we move further out the tail).
See Figure 10.10 for an illustration.
 D Spearman’s rho, (10.19) For the Clayton copula we have
as suggested by (10.11). Kendall’s  D
˛
, so (10.21)
Alternatively, the  parameter can calibrated to give a joint probability of both x1 ˛C2
2
and x2 being lower than some quantile as to match data: see (10.14). The values of this ˛D : (10.22)
1 
probability (according to a copula) is easily calculated by finding the copula function
(essentially the cdf) corresponding to a copula density. Some results are given in remarks The easiest way to calibrate a Clayton copula is therefore to set ˛ according to (10.22).
below. See Figure 10.8 for results from a Gaussian copula. This figure shows that a Figure 10.11 illustrates how the probability of both variables to be below their respec-
higher correlation implies a larger probability that both variables are very low—but that tive quantiles depend on the ˛ parameter. These parameters are comparable to the those
the probabilities quickly become very small as we move towards lower quantiles (lower for the correlations in Figure 10.8 for the Gaussian copula, see (10.11)–(10.12). The figure
returns). are therefore comparable—and the main point is that Clayton’s copula gives probabilities
of joint low values (both variables being low) that do not decay as quickly as according to
Remark 10.8 (The Gaussian copula function ) The distribution function corresponding the Gaussian copulas. Intuitively, this means that the Clayton copula exhibits much higher
to the Gaussian copula density (10.18) is obtained by integrating over both u1 and u2 and “correlations” in the lower tail than the Gaussian copula does—although they imply the
the value is C.u1 ; u2 I / D" ˚# ."1 ; 2 /#!
where i is defined in (10.18) and ˚ is the bi- same overall correlation. That is, according to the Clayton copula more of the overall
0 1  correlation of data is driven by synchronized movements in the left tail. This could be
variate normal cdf for N ; . Most statistical software contains numerical
0  1 interpreted as if the correlation is higher in market crashes than during normal times.
returns for calculating this cdf.
Remark 10.10 (Multivariate Clayton copula density ) The Clayton copula density for n
Remark 10.9 (Multivariate Gaussian copula density ) The Gaussian copula density for variables is
n variables is     
1 1 0 Pn ˛ n 1=˛ Qn ˛ 1 Qn
c.u/ D p exp  .˙ 1 In / ; c.u/ D 1 nC i D1 ui i D1 ui i D1 Œ1 C .i 1/˛ :
j˙ j 2
where ˙ is the correlation matrix with determinant j˙j and  is a column matrix with Remark 10.11 (Clayton copula function ) The copula function (the cdf) corresponding
i D ˚ 1 .ui / as the i th element. to (10.20) is
C.u1 ; u2 / D . 1 C u1 ˛ C u2 ˛ / 1=˛ :
The Gaussian copula is useful, but it has the drawback that it is symmetric—so the
downside and the upside look the same. This is at odds with evidence from many financial The following steps summarize how the copula is used to construct the multivariate
markets that show higher correlations across assets in down markets. The Clayton copula distribution.
density is therefore an interesting alternative
1. Construct the marginal pdfs fi .xi / and thus also the marginal cdfs Fi .xi /. For in-
c.u1 ; u2 / D . 1 C u1 ˛ C u2 ˛ / 2 1=˛
.u1 u2 / ˛ 1
.1 C ˛/; (10.20) stance, this could be done by fitting a distribution with a fat tail. With this, calculate
the cdf values for the data ui D Fi .xi / as in (10.2).

196 197
Gaussian copula density, corr = −0.5 Gaussian copula density, corr = 0 (a) for the Gaussian copula (10.18)
i. assume (or estimate/calibrate) a correlation  to use in the Gaussian cop-
5 5
ula
ii. calculate i D ˚ 1
.ui /, where ˚ 1
./ is the inverse of a N.0; 1/ distribu-
0 0
2 2 tion
0 0
2 2 iii. combine to get the copula density value c.u1 ; u2 /
x2 −2 −2 x1
0 x2 −2 −2 x1
0
(b) for the Clayton copula (10.20)
i. assume (or estimate/calibrate) an ˛ to use in the Clayton copula (typically
Gaussian copula density, corr = 0.5 Clayton copula density, α = 0.5 (τ = 0.2) based on Kendall’s  as in (10.22))
ii. calculate the copula density value c.u1 ; u2 /
5 5
3. Combine the marginal pdfs and the copula density as in (10.17), f1;2 .x1 ; x2 / D
0 0 c.u1 ; u2 /f1 .x1 /f2 .x2 /, where ui D Fi .xi / is the cdf value according to the marginal
2 2 distribution of variable i .
0 0
2 2
x2 −2 −2 0 x2 −2 −2 0
x1 x1 See Figure 10.12 for an illustration.

Remark 10.12 (Tail Dependence ) The measure of lower tail dependence starts by find-
Figure 10.10: Copula densities (as functions of xi ) ing the probability that X1 is lower than its qth quantile (X1  F1 1 .q/) given that X2 is
Pr(x<quantile,y<quantile), Clayton Pr(x<quantile,y<quantile), Clayton
lower than its qth quantile (X2  F2 1 .q/)
100 10
α = 0.16 l D PrŒX1  F1 1 .q/jX2  F2 1 .q/;
α = 0.33
and then takes the limit as the quantile goes to zero
%

50 5

l D limq!0 PrŒX1  F1 1 .q/jX2  F2 1 .q/:


0 0
0 20 40 60 80 100 0 2 4 6 8 10
quantile level,% quantile level,% It can be shown that a Gaussian copula gives zero or very weak tail dependence,
unless the correlation is 1. It can also be shown that the lower tail dependence of the
Clayton copula is
Figure 10.11: Probability of joint low returns, Clayton copula
l D 2 1=˛ if ˛ > 0

2. Calculate the copula density as follows (for the Gaussian or Clayton copulas, re- and zero otherwise.
spectively):

198 199
Joint pdf, Gaussian copula, corr = −0.5 Joint pdf, Gaussian copula, corr = 0 really negative returns happens more often than the estimated normal distribution would
2 2 suggest. For that reason, the joint distribution is estimated by first fitting generalized
1 1 Pareto distributions to each of the series and then these are combined with a copula as in
(10.16) to generate the joint distribution. In particular, the Clayton copula seems to give
x2

x2
0 0
−1 −1 a long joint negative tail.
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2 Prob(both returns < quantile) Prob(both returns < quantile)
x1 x1 100 5
Data
4
estimated N()
3

%
50
Joint pdf, Gaussian copula, corr = 0.5 Joint pdf, Clayton copula, α = 0.5 2
2 2 1
1 1 0 0
0 20 40 60 80 100 0 2 4 6 8 10
x2

x2

0 0 quantile level,% quantile level,%


−1 −1 US data 1979:1−2008:12
small value stocks and large value stocks
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
x1 x1
Figure 10.13: Probabilities of joint low returns

Figure 10.12: Contours of bivariate pdfs Expected exceedance, small value stocks Expected exceedance, large value stocks

loss minus threshold, v

loss minus threshold, v


1.5 (50th to 99th percentiles) 1.5
10.5 Joint Tail Distribution
1 1
The methods for estimating the (marginal, that is, for one variable at a time) distribution
of the lower tail can be combined with a copula to model the joint tail distribution. In 0.5 0.5
particular, combining the generalized Pareto distribution (GPD) with the Clayton copula
0 1 2 3 0 1 2 3
provides a flexible way. threshold v, % threshold v, %
This can be done by first modelling the loss (X t D R t ) beyond some threshold (u),
that is, the variable X t u with the GDP. To get a distribution of the return, we simply use
Figure 10.14: Expected exceedances
the fact that pdfR . z/ D pdfX .z/ for any value z. Then, in a second step we calibrate the
copula by using Kendall’s  for the subsample when both returns are less than u. Figures
10.13–10.16 provides an illustration.
Bibliography
Remark 10.13 Figure 10.13 suggests that the joint occurrence (of these two assets) of
Alexander, C., 2008, Market Risk Analysis: Practical Financial Econometrics, Wiley.

200 201
Loss distribution, small value stocks Loss distribution, large value stocks
u = 0.5, Pr(loss>u) = 20.0% u = 0.5, Pr(loss>u) = 27.0%
0.2 ξ = 0.16, β = 0.31 0.2 ξ = 0.17, β = 0.55
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 2 3 4 5 1 2 3 4 5
Loss, % Loss, %
Joint pdf, Gaussian copula Joint pdf, Clayton copula

large value stocks

large value stocks


−1 −1
Return distribution, small value stocks Return distribution, large value stocks
−2 −2

0.2 0.2 −3 −3
0.15 0.15 Spearman’s ρ = 0.32 Kendall’s τ = 0.22, α = 0.56
−4 −4
0.1 0.1 −4 −3 −2 −1 −4 −3 −2 −1
0.05 0.05 small value stocks small value stocks

0 0
−5 −4 −3 −2 −1 −5 −4 −3 −2 −1
Return, % Return, %
Joint pdf, independent copula
US data 1979:1−2008:12

large value stocks


−1
Figure 10.15: Estimation of marginal loss (and return) distributions
−2

Hastie, T., R. Tibshirani, and J. Friedman, 2001, The elements of statistical learning: data −3
mining, inference and prediction, Springer Verlag.
−4
−4 −3 −2 −1
McNeil, A. J., R. Frey, and P. Embrechts, 2005, Quantitative risk management, Princeton small value stocks
University Press.

Figure 10.16: Joint pdfs with different copulas

202 203
Pdf of N(0,1) Pdf of N(0,1)
0.4 0.4
Pr(x ≤ −1): 0.16 Pr(x ≤ 0.5): 0.69
0.3 0.3
11 Option Pricing and Estimation of Continuous Time 0.2 0.2

Processes 0.1 0.1

0 0
Reference: Hull (2006) 19, Elton, Gruber, Brown, and Goetzmann (2003) 22 or Bodie, −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Kane, and Marcus (2005) 21 x x

Reference (advanced): Taylor (2005) 13–14; Campbell, Lo, and MacKinlay (1997) 9;
Gourieroux and Jasiak (2001) 12–13
Cdf of N(0,1)
More advanced material is denoted by a star ( ). It is not required reading. 1

11.1 The Black-Scholes Model 0.5

11.1.1 The Black-Scholes Option Price Model


0
A European call option contract traded (contracted and paid) in t may stipulate that the −3 −2 −1 0 1 2 3
x
buyer of the contract has the right (not the obligation) to buy one unit of the underlying
asset (from the issuer of the option) in t C m at the strike price K. The option payoff
(in t C m) is clearly max .0; S t Cm K/ ;where S t Cm is the asset price, and K the strike Figure 11.2: Pdf and cdf of N(0,1)
price. See Figure 11.1 for the timing convention.
The Black-Scholes formula for a European call option price is where ˚./ is the cumulative distribution function for a variable distributed as N(0,1). For
 instance, ˚.2/ is the probability that y  2 — see Figure 11.2. In this equation, S0 is the
p ln.S t =K/ C r C  2 =2 m
C t D S t ˚.d1 / Ke rm ˚.d1  m/, where d1 D p : (11.1) price of the underlying asset in period t, and r is the continuously compounded interest
 m
rate (on an annualized basis).
The B-S formula can be derived from several stochastic processes of the underlying
asset price (discussed below), but they all imply that the distribution of log asset price in
t C m (conditional on the information in t) is normal with some mean ˛ (not important
-
t t Cm for the option price) and the variance m 2

European buy option, if S > K: pay ln S t Cm  N.˛; m 2 /: (11.2)


call option: agree on K, pay C K and get asset

Figure 11.1: Timing convention of option contract

204 205
BS call price as a function of volatility CBOE volatility index (VIX)
7 Kuwait Stock market crash LTCM/Russia pharma Higher rates Lehman

80 Gulf war Asia 9/11 Iraq sub prime


6 The base case has
S, K, m, y, and σ: 70
5
42 42 0.5 0.05 0.2
4 60

3 50

2 40
1
30
0
0 0.1 0.2 0.3 0.4 0.5 20
Standard deviation, σ
10
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010
Figure 11.3: Call option price, Black-Scholes model

Figure 11.4: CBOE VIX, summary measure of implied volatities (30 days) on US stock
11.1.2 Implied Volatility
markets
The pricing formula (11.1) contains only one unknown parameter: the standard deviation
Implied volatility, SMI option
 in the distribution of ln S t Cm , see (11.2). With data on the option price, spot price, 1
the interest rate, and the strike price, we can solve for standard deviation: the implied 2009−07−14
2009−08−14
volatility. This should not be thought of as an estimation of an unknown parameter— 0.8
rather as just a transformation of the option price. Notice that we can solve (by trial-and-
error or some numerical routine) for one implied volatility for each available strike price. 0.6

See Figure 11.3 for an illustration and Figure 11.4 for a time series of implied volatilities.
0.4
If the Black-Scholes formula is correct, that is, if the assumption in (11.2) is correct,
then these volatilities should be the same across strike prices. It is often found that the
0.2
implied volatility is a “smirk” (equity markets) or “smile” (FX markets) shaped function
of the strike price. One possible explanation for a smirk shape is that market participants 0
assign a higher probability to a dramatic drop in share prices than a normal distribution 3000 4000 5000 6000 7000 8000
Strike price
suggests. A possible explanation for a smile shape is that the (perceived) distribution of
the future asset price has relatively more probability mass in the tails (“fat tails”) than a
normal distribution has. See Figure 11.5 for an illustration. Figure 11.5: Implied volatilities of SMI options, selected dates

206 207
11.1.3 Brownian Motion without Mean Reversion: The Random Walk Remark 11.3 (iid random variable in discrete time.) Suppose x t has the constant mean 
and a variance  2 . Then E.x t Cx t 1 / D 2 and Var.x t Cx t 1 / D 2 2 C2 Cov.x t ; x t 1 /.
The basic assumption behind the B-S formula (11.1) is that the log price of the underlying
If x t is iid, then the covariance is zero, so Var.x t Cx t 1 / D 2 2 . Both mean and variance
asset, ln S t , follows a geometric Brownian motion—with or without mean reversion.
scale linearly with the horizon.
This section discusses the standard geometric Brownian motion without mean rever-
sion
11.1.4 Brownian Motion with Mean Reversion
d ln S t D dt C d W t ; (11.3)
The mean reverting Ornstein-Uhlenbeck process is
where d ln S t is the change in the log price over a very short time interval. On the right
hand side,  is the drift (typically expressed on annual basis), dt just indicates the change d ln S t D . ln S t /dt C d W t , with  > 0: (11.6)
in time,  is the standard deviation (per year), and d W t is a random component (Wiener
R1
process) that has an N.0; 1/ distribution if we cumulate d W t over a year ( 0 d W t  This process makes ln S t revert back to the mean , and the mean reversion is faster if 
N.0; 1/). By comparing (11.1) and (11.3) we notice that only the volatility (), not the is large. It is used in, for instance, the Vasicek model of interest rates.
drift (), show up in the option pricing formula. In essence, the drift is already accounted To estimate the parameters in (11.6) on real life data, we (once again) have to under-
for by the current spot price in the option pricing formula (as the spot price certainly stand what the model implies for discretely sampled data. It can be shown that it implies
depends on the expected drift of the asset price). a discrete time AR(1)

Remark 11.1 (Alternative stock price process.) If we instead of (11.3) assume the pro- ln S t D ˛ C  ln S t h C " t , with (11.7)
 
cess dS t D SQ t dt C S t d W t , then we get the same option price. The reason is that Itô’s De h
; ˛ D .1 /, and " t  N 0;  2 .1 2
 /=.2/ : (11.8)
lemma tells us that (11.3) implies this second process with Q D  C  2 =2. The difference
is only in terms of the drift, which does not show up (directly, at least) in the B-S formula. We know that the maximum likelihood estimator (MLE) of the discrete AR(1) is least
squares combined with the traditional estimator of the residual variance. MLE has the
Remark 11.2 ((11.3) as a limit of a discrete time process.) (11.3) can be thought of as further advantage of being invariant to parameter transformations, which here means that
the limit of the discrete time process ln S t ln S t h D h C " t (where " t is white noise) the MLE of ;  and  2 can be backed out from the LS estimates of ; ˛ and Var." t ) by
as the time interval h becomes very small. using (11.8).

We can only observe the value of the asset price at a limited number of times, so we Example 11.4 Suppose ;  and  2 are 2, 0, and 0.25 respectively—and the periods are
need to understand what (11.3) implies for discrete time intervals. It is straightforward to years (so one unit of time corresponds to a year) . Equations (11.7)–(11.8) then gives the
show that (11.3) implies that we have normally distributed changes and that the changes following AR(1) for weekly (h D 1=52) data
are uncorrelated (for non-overlapping data)
ln S t D 0:96 ln S t h C " t with Var." t /  0:24:
ln.S t =S t h/  N.h;  2 h/ (11.4)
CovŒln.S t =S t h /; ln.S t Ch =S t / D 0: (11.5)

Notice that both the drift and the variance scale linearly with the horizon h. The reason
is, or course, that the growth rates (even for the infinitesimal horizon) are iid.

208 209
Distribution of future asset price Distribution of future asset price Profit of straddle, call + put

0.06 K = 42 0.06 K = 47 Call(K)


Std = 0.14 Std = 0.14 Put(K)
0.04 0.04 Straddle

0.02 0.02

0 0
20 30 40 50 60 70 20 30 40 50 60 70
Asset price Asset price

0
Figure 11.6: Distribution of future stock price

11.2 Estimation of the Volatility of a Random Walk Process


K
11.2.1 Option Pricing and Volatility

Option pricing is basically about forecasting the volatility (until expiration of the option) Stock price

of the underlying asset. This is clear from the Black-Scholes model where the only un-
known parameter is the volatility. It is also true more generally—which can be seen in Figure 11.7: Profit of straddle portfolio
at least two ways. First, a higher volatility is good for an owner of a call option since -
it increases the upside potential (higher probabaility of a really good outcome), at the 0 h 2h T D nh
same time as the down side is protected. Second, a many option portfolios highlight how
volatility matters for the potential profits. For instance, a straddle (a long position in both
a call and a put at the same strike price) pays off if the price of the underlying asset moves -
a lot (in either direction) from the strike price, that is, when volatility is high. See Figures 0 h 2h T D nh
11.6–11.7 for illustrations.
This section discusses different ways of estimating the volatility. We will assume that
we have data for observations in  D 1; 2; ::; n. This could be days or weeks or whatever.
Let the time between  and  C 1 be h (years). The sample therefore stretches over Figure 11.8: Two different samplings with same time span T
T D nh periods (years). For instance, for daily data h D 1=365 (or possibly something
like 1=252 if only trading days are counted). Instead, with weekly data h D 1=52. See 11.2.2 Standard Approach
Figure 11.8 for an illustration.
We first estimate the variance for the sampling frequency we have, and then convert to the
annual frequency.
According to (11.4) the growth rates, ln.S t =S t h /, are iid over any sampling fre-

210 211
quency. To simplify the notation, let y D ln.S =S 1 / be the observed growth rates. where we have used the fact that the second moment equals the variance plus the squared
The classical estimator of the variance of an iid data series is mean (cf. (11.12)). Clearly, this relative exaggeration is zero if the mean is zero. The
Xn relative exaggeration is small if the maturity is small.
sO 2 D .y N 2 =n, where
y/ (11.9)
 D1
Xn If we have high frequency data on the asset price or the return, then we can choose
yN D y =n: (11.10)
 D1
which sampling frequency to use in (11.9)–(11.10). Recall that a sample with n obser-
(This is also the maximum likelihood estimator.) To annualize these numbers, use vations (where the length of time between the observations is h) covers T D nh periods
(years). It can be shown that the asymptotic variances (that is, the variances in a very
O 2 D sO 2 = h, and O D y=
N h: (11.11)
large sample) of the estimators of  and  2 in (11.9)–(11.11) are

Example 11.5 If .y;


N sO 2 / D .0:001; 0:03/ on daily data, then the annualized values are Var./
O D  2 =T and Var.O 2 / D 2 4 =n: (11.13)
.;  / D .0:001  250; 0:03  250/ D .0:25; 7:5/ if we assume 250 trading days per
2

year. Therefore, to get a precise estimator of the mean drift, , a sample that stretches over
a long period is crucial: it does not help to just sample more frequently. However, the
Notice that is can be quite important to subtract the mean drift, y.
N Recall that for any sampling frequency is crucial for getting a precise estimator of  2 , while a sample that
random variable, we have stretches over a long period is unimportant. For estimating the volatility (to use in the B-S
 2 D E.x 2 / 2 ; (11.12) model) we should therefore use high frequency data.
so a non-zero mean drives a wedge between the variance (which we want) and the second
moment (which we estimate if we assume yN D 0). 11.2.3 Exponential Moving Average

The traditional estimator is based on the assumption that volatility is constant—which is


Example 11.6 For the US stock market index excess return since WWII we have approx-
consistent with the assumptions of the B-S model. In reality, volatility is time varying.
imately a variance of 0:162 and a mean of 0:08. In this case, (11.12) becomes
A practical ad hoc approach to estimate time varying volatility is to modify (11.9)–
0:162 D E.x 2 / 0:082 , so E.x 2 /  0:182 : (11.10) so that recent observations carry larger weight. The exponential moving average
(EMA) model lets the weight for lag s be .1 /s where 0 <  < 1. If we assume that
Assuming that the drift is zero gives an estimate of the variance equal to 0:182 which is yN is the same in all periods, then we have
25% too high.
sO2 D Os2 1 C .1 / .y 1 N 2:
y/ (11.14)
Remark 11.7 ( Variance vs. second moment, the effect of the maturity) Suppose we are
interested in the variance over an m-period horizon, for instance, because we want to Clearly, a higher  means that old data plays a larger role—and at the limit as  goes
price an option that matures in t C m. How important is it then to use the variance (m 2 ) towards one, we have the traditional estimator. See Figure 11.10 for a comparison using
rather than the second moment? The relative error is daily US equity returns. This method is commonly used by practitioners. For instance,
the RISK Metrics is based on  D 0:94 on daily data. Alternatively,  can be chosen to
Second moment - variance m2 2 m2
D D 2 ; minimize some criterion function.
variance m 2 

Remark 11.8 (EMA with time-variation in the mean.) If we want also the mean to be

212 213
GARCH std, annualized EMA std, annualized, λ = 0.99 Std, EMA estimate, λ = 0.9 CBOE volatility index (VIX)

40 40 40 40

20 20 20 20

0 0 0 0
1980 1990 2000 2010 1980 1990 2000 2010 1990 1995 2000 2005 2010 1990 1995 2000 2005 2010
S&P 500, daily data 1954:1−2009:6

S&P 500 (daily) 1954:1−2009:6

AR(1) of excess returns EMA std, annualized, λ = 0.9 Figure 11.10: Different estimates of US equity market volatility
with GARCH(1,1) errors
40
AR(1) coef: 0.10 volatility as a function of the latest squared shock
ARCH&GARCH coefs: 0.08 0.92

20 s2 D ˛0 C ˛1 u2 1 ; (11.15)

0 where u is a zero-mean variable. The model requires ˛0 > 0 and 0  ˛1 < 1 to


1980 1990 2000 2010
guarantee that the volatility stays positive and finite. The variance reverts back to an
average variance (˛0 =.1 ˛1 /). The rate of mean reversion is ˛1 , that is, the variance
Figure 11.9: Different estimates of US equity market volatility behaves much like an AR(1) model with an autocorrelation parameter of ˛1 . The model
parameters are typically estimated by maximum likelihood. Higher-order ARCH models
time-varying, then we can use the estimator include further lags of the squared shocks (for instance, u2 2 ).
  Instead of using a high-order ARCH model, it is often convenient to use a first-order
sO2 D .1 / .y 1 yN /2 C  .y 2 yN /2 C 2 .y 3 yN /2 C : : :
generalized ARCH model, the GARCH(1,1) model. It adds a term that captures direct
yN D Œy 1 C y 2 C y 3 C : : : =. 1/: autoregression of the volatility

Notice that the mean is estimated as a traditional sample mean, using observations 1 to s2 D ˛0 C ˛1 u2 C ˇ1 s2 1 : (11.16)
1
 1. This guarantees that the variance will always be a non-negative number.
We require that ˛0 > 0, ˛1  0, ˇ1  0, and ˛1 C ˇ1 < 1 to guarantee that the
It should be noted, however, that the B-S formula does not allow for random volatility. volatility stays positive and finite. This is very similar to the EMA in (11.14), except that
the variance reverts back to the mean (˛0 =.1 ˛1 ˇ1 /). The rate of mean reversion is
11.2.4 Autoregressive Conditional Heteroskedasticity ˛1 C ˇ1 , that is, the variance behaves much like an AR(1) model with an autocorrelation
The model with Autoregressive Conditional Heteroskedasticity (ARCH) is a useful tool parameter of ˛1 C ˇ1 .
for estimating the properties of volatility clustering. The first-order ARCH expresses

214 215
11.2.5 Time-Variation in Volatility and the B-S Formula α , β and T (days) are 0.8 0.09 10 α , β and T (days) are 0.8 0.09 100
1 1 1 1
0.1 0.03
The ARCH and GARCH models imply that volatility is random, so they are (strictly N
speaking) not consistent with the B-S model. However, they are often combined with the Sim 0.02
B-S model to provide an approximate option price. See Figure 11.11 for a comparison of 0.05
the actual distribution of the log asset price at different horizons (10 and 100 days) when 0.01

the daily returns are generated by a GARCH model—and a normal distribution with the
0 0
same mean and variance. To be specific, the figure with a 10-day horizon shows the −20 −10 0 10 20 −40 −20 0 20 40
Return Return
distribution of ln S t Cm , as seen from period t, where ln S t Cm is calculated as

ln S t Cm D ln S t C u C u C1 C : : : C uC10 ; (11.17)
α1, β1 and T (days) are 0.09 0.8 10 α1, β1 and T (days) are 0.09 0.8 100
0.04
where each of the growth rates (u ) is drawn from an N.0; s2 )distribution where the
variance follows the GARCH(1,1) process in (11.16). 0.1
0.03

It is clear the normal distribution is a good approximation unless the horizon is short 0.02
and the ARCH component (˛1 u2 1 ) dominates the GARCH component (ˇ1 s2 1 ). Intu- 0.05
0.01
itively, we get (almost) a normal distribution in two cases. First, when the random part
0 0
of the volatility (˛1 u2 1 ) is relatively small compared to the non-random part (ˇ1 s2 1 ). −10 0 10 −40 −20 0 20 40
Return Return
For instance, if there is no random part at all, then we get exactly a normal distribution
(the sum of normally distributed variables is normally distributed—if all the variances are
deterministic). Second, the summing of many uncorrelated growth rates in (11.17) give Figure 11.11: Comparison of normal and simulated distribution of m-period returns
the same effect as taking an average: if we average sufficiently many components we get
a distribution that looks more and more similar to a normal distribution (this is the central Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The Econometrics of Financial
limit theorem). Markets, Princeton University Press, Princeton, New Jersey.

Remark 11.9 A time-varying, but non-random volatility could be consistent with (11.2): Elton, E. J., M. J. Gruber, S. J. Brown, and W. N. Goetzmann, 2003, Modern Portfolio
if ln S t Cm is the sum (integral) of normally distributed changes with known (but time- Theory and Investment Analysis, John Wiley and Sons, 6th edn.
varying variances), then this sum has a normal distribution (recall: if the random vari- Gourieroux, C., and J. Jasiak, 2001, Financial Econometrics: Problems, Models, and
ables x and y are normally distributed, so is x C y). A random variance does not fit this Methods, Princeton University Press.
case, since a variable with a random variance is not normally distributed.
Hull, J. C., 2006, Options, Futures, and Other Derivatives, Prentice-Hall, Upper Saddle
River, NJ, 6th edn.

Bibliography Taylor, S. J., 2005, Asset Price Dynamics, Volatility, and Prediction, Princeton University
Press.
Bodie, Z., A. Kane, and A. J. Marcus, 2005, Investments, McGraw-Hill, Boston, 6th edn.

216 217
firm 1 firm 2 time
−1 0 1 −1 0 1

12 Event Studies
Figure 12.1: Event days and windows
Reference: Bodie, Kane, and Marcus (2005) 12.3 or Copeland, Weston, and Shastri
(2005) 11 ably) simple, we “normalize” the time so period 0 is the time of the event. Clearly the
Reference (advanced): Campbell, Lo, and MacKinlay (1997) 4 actual calendar time of the events for assets i and j are likely to differ, but we shift the
More advanced material is denoted by a star ( ). It is not required reading. time line for each asset individually so the time of the event is normalized to zero for
every asset. See Figure 12.1 for an illustration.
12.1 Basic Structure of Event Studies The (cross-sectional) average abnormal return at the event time (time 0) is
Pn
The idea of an event study is to study the effect (on stock prices or returns) of a special uN 0 D .u1;0 C u2;0 C ::: C un;0 / =n D i D1 ui;0 =n: (12.2)
event by using a cross-section of such events. For instance, what is the effect of a stock
To control for information leakage and slow price adjustment, the abnormal return is often
split announcement on the share price? Other events could be debt issues, mergers and
calculated for some time before and after the event: the “event window” (often ˙20 days
acquisitions, earnings announcements, or monetary policy moves.
or so). For lead s (that is, s periods after the event time 0), the cross sectional average
The event is typically assumed to be a discrete variable. For instance, it could be a
abnormal return is
merger or not or if the monetary policy surprise was positive (lower interest than expected)
Pn
or not. The basic approach is then to study what happens to the returns of those assets uN s D .u1;s C u2;s C ::: C un;s / =n D i D1 ui;s =n: (12.3)
that have such an event.
Only news should move the asset price, so it is often necessary to explicitly model For instance, uN 2 is the average abnormal return two days after the event, and uN 1 is for
the previous expectations to define the event. For earnings, the event is typically taken to one day before the event.
be the earnings announcement minus (some average of) analysts’ forecast. Similarly, for The cumulative abnormal return (CAR) of asset i is simply the sum of the abnormal
monetary policy moves, the event could be specified as the interest rate decision minus return in (12.1) over some period around the event. It is often calculated from the be-
previous forward rates (as a measure of previous expectations). ginning of the event window. For instance, if the event window starts at w, then the
The abnormal return of asset i in period t is q-period (day?) car for asset i is

ui;t D Ri;t normal


Ri;t ; (12.1) cari;q D ui; w C ui; wC1 C : : : C ui; wCq 1 : (12.4)

where Rit is the actual return and the last term is the normal return (which may differ The cross sectional average of the q-period car is
across assets and time). The definition of the normal return is discussed in detail in Section  P
carq D car1;q C car2;q C ::: C carn;q =n D niD1 cari;q =n: (12.5)
12.2. These returns could be nominal returns, but more likely (at least for slightly longer
horizons) real returns or excess returns. See Figure 12.2 for an empirical example.
Suppose we have a sample of n such events (“assets”). To keep the notation (reason-
Example 12.1 (Abnormal returns for ˙ day around event, two firms) Suppose there are

218 219
Cumulative excess return (average) with 90% conf band “estimation window,” that ends before the event window. See Figure 12.3 for an illustra-
100 tion. (When there is no return data before the event window (for instance, when the event
is an IPO), then the estimation window can be after the event window.)
80
In this way, the estimated behaviour of the normal return should be unaffected by the
event. It is almost always assumed that the event is exogenous in the sense that it is not
Returns, %

60
due to the movements in the asset price during either the estimation window or the event
40 window. This allows us to get a clean estimate of the normal return.
The constant mean return model assumes that the return of asset i fluctuates randomly
20 around some mean i

0
Sample: 196 IPOs on the Shanghai Stock Exchange, 2001−2004 Ri;t D i C i;t with E.i;t / D Cov.i;t ; i;t s / D 0: (12.6)
0 5 10 15 20 25
Days after IPO This mean is estimated by the sample average (during the estimation window). The nor-
mal return in (12.1) is then the estimated mean. O i so the abnormal return becomes Oi;t .
Figure 12.2: Event study of IPOs in Shanghai 2001–2004. (Data from Nou Lai.) The market model is a linear regression of the return of asset i on the market return

two firms and the event window contains ˙1 day around the event day, and that the Ri;t D ˛i C ˇi Rm;t C "it with E."i;t / D Cov."i;t ; "i;t s / D Cov."i;t ; Rm;t / D 0: (12.7)
abnormal returns (in percent) are
Notice that we typically do not impose the CAPM restrictions on the intercept in (12.7).
Time Firm 1 Firm 2 Cross-sectional Average The normal return in (12.1) is then calculated by combining the regression coefficients
1 0:2 0:1 0:05 with the actual market return as ˛O i C ˇOi Rm;t , so the the abnormal return becomes "Oit .
0 1:0 2:0 1:5 When we restrict ˛i D 0 and ˇi D 1, then this approach is called the market-adjusted-
1 0:1 0:3 0:2 return model. This is a particularly useful approach when there is no return data before
the event, for instance, with an IPO.
We have the following cumulative returns Recently, the market model has increasingly been replaced by a multi-factor model
Time Firm 1 Firm 2 Cross-sectional Average which uses several regressors instead of only the market return. For instance, Fama and
1 0:2 0:1 0:05 French (1993) argue that (12.7) needs to be augmented by a portfolio that captures the
0 1:2 1:9 1:55 different returns of small and large firms and also by a portfolio that captures the different
1 1:3 2:2 1:75 returns of firms with high and low book-to-market ratios.
Finally, another approach is to construct a normal return as the actual return on assets
which are very similar to the asset with an event. For instance, if asset i is a small man-
12.2 Models of Normal Returns ufacturing firm (with an event), then the normal return could be calculated as the actual
return for other small manufacturing firms (without events). In this case, the abnormal
This section summarizes the most common ways of calculating the normal return in
return becomes the difference between the actual return and the return on the matching
(12.1). The parameters in these models are typically estimated on a recent sample, the

220 221
We therefore get
time
Var."i t / D Var.Ri;t / ˇi2 Var.Rm;t /
estimation window event window
(for normal return) D Var.Ri;t / Cov.Ri;t ; Rm;t /2 = Var.Rm;t /
D Var.Ri;t / Corr.Ri;t ; Rm;t /2 Var.Ri;t /
Figure 12.3: Event and estimation windows D .1 R2 / Var.Ri;t /:

portfolio. This type of matching portfolio is becoming increasingly popular. The second equality follows from the fact that ˇi D Cov.Ri;t ; Rm;t /= Var.Rm;t /, the
All the methods discussed here try to take into account the risk premium on the asset. third equality from multiplying and dividing the last term by Var.Ri;t / and using the
It is captured by the mean in the constant mean mode, the beta in the market model, and definition of the correlation, and the fourth equality from the fact that the coefficient
by the way the matching portfolio is constructed. However, sometimes there is no data in of determination in a simple regression equals the squared correlation of the dependent
the estimation window. The typical approach is then to use the actual market return as the variable and the regressor.
normal return—that is, to use (12.7) but assuming that ˛i D 0 and ˇi D 1. Clearly, this This variance is crucial for testing the hypothesis of no abnormal returns: the smaller
does not account for the risk premium on asset i , and is therefore a fairly rough guide. is the variance, the easier it is to reject a false null hypothesis (see Section 12.3). The
Apart from accounting for the risk premium, does the choice of the model of the constant mean model has R2 D 0, so the market model could potentially give a much
normal return matter a lot? Yes, but only if the model produces a higher coefficient of smaller variance. If the market model has R2 D 0:75, then the standard deviation of
determination (R2 ) than competing models. In that case, the variance of the abnormal the abnormal return is only half that of the constant mean model. More realistically,
return is smaller for the market model which the test more precise (see Section 12.3 for R2 might be 0.43 (or less), so the market model gives a 25% decrease in the standard
a discussion of how the variance of the abnormal return affects the variance of the test deviation, which is not a whole lot. Experience with multi-factor models also suggest that
statistic). To illustrate this, consider the market model (12.7). Under the null hypothesis they give relatively small improvements of the R2 compared to the market model. For
that the event has no effect on the return, the abnormal return would be just the residual these reasons, and for reasons of convenience, the market model is still the dominating
in the regression (12.7). It has the variance (assuming we know the model parameters) model of normal returns.
High frequency data can be very helpful, provided the time of the event is known.
Var.ui;t / D Var."i t / D .1 R2 / Var.Ri;t /; (12.8) High frequency data effectively allows us to decrease the volatility of the abnormal return
since it filters out irrelevant (for the event study) shocks to the return while still capturing
where R2 is the coefficient of determination of the regression (12.7).
the effect of the event.
Proof. (of (12.8)) From (12.7) we have

Var.Ri;t / D ˇi2 Var.Rm;t / C Var."it /: 12.3 Testing the Abnormal Return

In testing if the abnormal return is different from zero, there are two sources of sampling
uncertainty. First, the parameters of the normal return are uncertain. Second, even if
we knew the normal return for sure, the actual returns are random variables—and they
will always deviate from their population mean in any finite sample. The first source

222 223
of uncertainty is likely to be much smaller than the second—provided the estimation if the abnormal returns are uncorrelated across time and assets.
window is much longer than the event window. This is the typical situation, so the rest of Figures 4.2a–b in Campbell, Lo, and MacKinlay (1997) provide a nice example of an
the discussion will focus on the second source of uncertainty. event study (based on the effect of earnings announcements).
It is typically assumed that the abnormal returns are uncorrelated across time and
across assets. The first assumption is motivated by the very low autocorrelation of returns. Example 12.2 (Variances of abnormal returns) If the standard deviations of the daily
The second assumption makes a lot of sense if the events are not overlapping in time, so abnormal returns of the two firms in Example 12.1 are 1 D 0:1 and and 2 D 0:2, then
that the event of assets i and j happen at different (calendar) times. It can also be argued we have the following variances for the abnormal returns at different days
that the model for the normal return (for instance, a market model) should capture all Time Firm 1 Firm 2 Cross-sectional Average
common movements by the regressors — leaving the abnormal returns (the residuals) 
1 0:12 0:22 0:12 C 0:22 =4
uncorrelated across firms. In contrast, if the events happen at the same time, the cross- 
0 0:12 0:22 0:12 C 0:22 =4
correlation must be handled somehow. This is, for instance, the case if the events are 
1 0:12 0:22 0:12 C 0:22 =4
macroeconomic announcements or monetary policy moves. An easy way to handle such
synchronized (clustered) events is to form portfolios of those assets that share the event Similarly, the variances for the cumulative abnormal returns are
time—and then only use portfolios with non-overlapping events in the cross-sectional Time Firm 1 Firm 2 Cross-sectional Average
study. For the rest of this section we assume no autocorrelation or cross correlation. 
1 0:12 0:22 0:12 C 0:22 =4
Let i2 D Var.ui;t / be the variance of the abnormal return of asset i. The variance of 
0 2  0:12 2  0:22 2  0:12 C 0:22 =4
the cross-sectional (across the n assets) average, uN s in (12.3), is then 
1 3  0:12 3  0:22 3  0:12 C 0:22 =4
 P
Var.uN s / D 12 C 22 C ::: C n2 =n2 D niD1 i2 =n2 ; (12.9) Example 12.3 (Tests of abnormal returns) By dividing the numbers in Example 12.1 by
the square root of the numbers in Example 12.2 (that is, the standard deviations) we get
since all covariances are assumed to be zero. In a large sample (where the asymptotic
the test statistics for the abnormal returns
normality of a sample average starts to kick in), we can therefore use a t-test since
Time Firm 1 Firm 2 Cross-sectional Average
uN s = Std.uN s / !d N.0; 1/: (12.10)
1 2 0:5 0:4
The cumulative abnormal return over q period, cari;q , can also be tested with a t-test. 0 10 10 13:4
Since the returns are assumed to have no autocorrelation the variance of the cari;q 1 1 1:5 1:8

Similarly, the variances for the cumulative abnormal returns we have


Var.cari;q / D qi2 : (12.11)
Time Firm 1 Firm 2 Cross-sectional Average
This variance is increasing in q since we are considering cumulative returns (not the time
1 2 0:5 0:4
average of returns).
0 8:5 6:7 9:8
The cross-sectional average cari;q is then (similarly to (12.9))
1 7:5 6:4 9:0
 P
Var.carq / D q12 C q22 C ::: C qn2 =n2 D q niD1 i2 =n2 ; (12.12)

224 225
12.4 Quantitative Events

Some events are not easily classified as discrete variables. For instance, the effect of
positive earnings surprise is likely to depend on how large the surprise is—not just if there 13 Kernel Density Estimation and Regression
was a positive surprise. This can be studied by regressing the abnormal return (typically
the cumulative abnormal return) on the value of the event (xi ) 13.1 Non-Parametric Regression
cari;q D a C bxi C i : (12.13) Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Härdle (1990); Pagan and Ullah
(1999); Mittelhammer, Judge, and Miller (2000) 21
The slope coefficient is then a measure of how much the cumulative abnormal return
reacts to a change of one unit of xi .
13.1.1 Simple Kernel Regression

Non-parametric regressions are used when we are unwilling to impose a parametric form
Bibliography on the regression equation—and we have a lot of data.
Let the scalars y t and x t be related as
Bodie, Z., A. Kane, and A. J. Marcus, 2005, Investments, McGraw-Hill, Boston, 6th edn.

Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial y t D b.x t / C " t ; " t is iid and E " t D Cov Œb.x t /; " t  D 0; (13.1)
markets, Princeton University Press, Princeton, New Jersey.
where b./ is a, possibly non-linear, function.
Copeland, T. E., J. F. Weston, and K. Shastri, 2005, Financial theory and corporate policy, Suppose the sample had 3 observations (say, t D 3, 27, and 99) with exactly the same
Pearson Education, 4 edn. value of x t , say 1:9. A natural way of estimating b.x/ at x D 1:9 would then be to
average over these 3 observations as we can expect average of the error terms to be close
Fama, E. F., and K. R. French, 1993, “Common risk factors in the returns on stocks and to zero (iid and zero mean).
bonds,” Journal of Financial Economics, 33, 3–56. Unfortunately, we seldom have repeated observations of this type. Instead, we may
try to approximate the value of b.x/ (x is a single value, 1.9, say) by averaging over
observations where x t is close to x. The general form of this type of estimator is

O w.x1 x/y1 C w.x2 x/y2 C : : : C w.xT x/yT


b.x/ D
w.x1 x/ C w.x2 x/ C : : : C w.xT x/
PT
w.x t x/y t
D Pt D1
T
; (13.2)
t D1 w.x t x/

where w.x t x/=˙ tTD1 w.x t x/ is the weight given to observation t. The function
w.xi x/ is a positive and (weakly) increasing in the distance between xi and x. Note
that denominator makes the weights sum to unity. The basic assumption behind (13.2) is
that the b.x/ function is smooth so local (around x) averaging makes sense.

226 227
As an example of a w.:/ function, it could give equal weight to the k values of x t Data and weights for b(1.7) Data and weights for b(1.9)
which are closest to x and zero weight to all other observations (this is the “k-nearest 5 5
neighbor” estimator, see Härdle (1990) 3.2). As another example, the weight function weights
⊗ 1 1
O
could be defined so that it trades off the expected squared errors, EŒy t b.x/ 2
, and the
2O

expected squared acceleration, EŒd b.x/=dx  . This defines a cubic spline (and is often
2 2
4 4
used in macroeconomics, where x t D t and is then called the Hodrick-Prescott filter).
0 0
A Kernel regression uses a pdf as the weight function w.:/. The pdf of N.0; h2 / 1.4 1.6 1.8 2 2.2 2.4 1.4 1.6 1.8 2 2.2 2.4
is commonly used, where the choice of h allows us to easily vary the relative weights xt xt

of different observations. This weighting function is positive so all observations get a


Data on x: 1.5 2.0 2.5
positive weight, but the weights are highest for observations close to x and then tapers of
Data and weights for b(2.1) Data on y: 5.0 4.0 3.5
in a bell-shaped way. See Figure 13.1 for an illustration.
5 N(0,0.252) kernel
A low value of h means that the weights taper off fast—the weight function is then a
o denotes the data
normal pdf with a low variance. With this particular kernel, we get the following estimator 1
⊗ denotes the fitted b(x)
of b.x/ at a point x
4 ⊗ Left y−axis: data; right y−axis: weights
PT
O Kh .x t x/y t expŒ .x t x/2 =.2h2 /
b.x/ D Pt D1
T
, where Kh .x t x/ D p : (13.3) 0
h 2 1.4 1.6 1.8 2 2.2 2.4
t D1 Kh .x t x/
xt
Note that Kh .x t x/ corresponds to the weighting function w.x t x/ in (13.2).
O
In practice we have to estimate b.x/ at a finite number of points x. This could, for
instance, be 100 evenly spread points in the interval between the minimum and maximum Figure 13.1: Example of kernel regression with three data points
values observed in the sample. See Figure 13.2 for an illustration. Special corrections
The denominator is
might be needed if there are a lot of observations stacked close to the boundary of the
XT   p
support of x (see Härdle (1990) 4.4). Kh .x t x/ D e .1:5 1:9/2 =2
Ce .2 1:9/2 =2
Ce .2:5 1:9/2 =2
= 2
t D1
p
Example 13.1 Suppose the sample has three data points Œx1 ; x2 ; x3  D Œ1:5; 2; 2:5 and  2:75= 2:
Œy1 ; y2 ; y3  D Œ5; 4; 3:5. Consider the estimation of b.x/ at x D 1:9. With h D 1, the
The estimate at x D 1:9 is therefore
numerator in (13.3) is
XT  2 2 2
 p O
b.1:9/  11:52=2:75  4:19:
Kh .x t x/y t D e .1:5 1:9/ =2  5 C e .2 1:9/ =2  4 C e .2:5 1:9/ =2  3:5 = 2
t D1
p
 .0:92  5 C 1:0  4 C 0:84  3:5/ = 2 Kernel regressions are typically consistent, provided longer samples are accompanied
p by smaller values of h, so the weighting function becomes more and more local as the
D 11:52= 2:
sample size increases.
See Figures 13.3–13.4 for an example. Note that the volatility is defined as the square
of the drift minus expected drift (from the same estimation method).

228 229
Kernel regression Drift vs level, kernel regression Vol vs level, kernel regression
1.5
Data 0.2 Small h Small h
5
h= 0.25 Large h 1 Large h

Volt+1
∆yt+1
h= 0.2
Volt+1 = (∆yt+1 − fitted ∆yt+1)2
0
0.5
4.5
y

−0.2 0
0 5 10 15 20 0 5 10 15 20
4 yt yt
Daily federal funds rates 1954:7−2009:6

3.5 Drift vs level, LS and LAD Vol vs level, LS and LAD


1.5
1.4 1.6 1.8 2 2.2 2.4 2.6
0.2 LS LS
x
LAD 1 LAD

Volt+1
∆yt+1
0
Figure 13.2: Example of kernel regression with three data points 0.5

Drift vs level, in bins Volatility vs level, in bins −0.2 0


1.5 0 5 10 15 20 0 5 10 15 20
yt yt
0.2 Volt+1 = (∆yt+1 − fitted ∆yt+1)
2
1
Volt+1
∆yt+1

0
0.5
Figure 13.4: Federal funds rate

−0.2 0 each of x and z, we get 400 bins, so we need a large sample to have a reasonable number
0 5 10 15 20 0 5 10 15 20
yt yt of observations in every bin.
Daily federal funds rates 1954:7−2009:6 In any case, the most common way to implement the kernel regressor is to let
PT
O z/ D Pt D1 Khx .x t x/Khz .z t z/y t ;
b.x; (13.5)
Figure 13.3: Federal funds rate T
t D1 Khx .x t x/Khz .z t z/

13.1.2 Multivariate Kernel Regression where Khx .x t x/ and Khz .z t z/ are two kernels like in (13.3) and where we may
allow hx and hy to be different (and depend on the variance of x t and y t ). In this case.
Suppose that y t depends on two variables (x t and z t ) the weight of the observation (x t ; z t ) is proportional to Khx .x t x/Khz .z t z/, which
is high if both x t and y t are close to x and y respectively.
y t D b.x t ; z t / C " t ; " t is iid and E " t D 0: (13.4)
See Figure 13.4 for an example.
This makes the estimation problem much harder since there are typically few observations
in every bivariate bin (rectangle) of x and z. For instance, with as little as a 20 intervals of

230 231
13.2 Examples of Non-Parametric Estimation the variables that also show up in the Black-Scholes formula (spot price of underlying
asset, strike price, time to expiry, interest rate, and dividends). For instance, Ait-Sahalia
13.2.1 A Model for the Short Interest Rate and Lo (1998) applies this to daily data for Jan 1993 to Dec 1993 on S&P 500 index
Interest rate models are typically designed to describe the movements of the entire yield options (14,000 observations). They find interesting patterns of the implied moments
curve in terms of a small number of factors. For instance, the model assumes that the (mean, volatility, skewness, and kurtosis) as the time to expiry changes. In particular, the
short interest rate, r t , is a mean-reverting AR(1) process non-parametric estimates suggest that distributions for longer horizons have increasingly
larger skewness and kurtosis. Whereas the distributions for short horizons are not too
r t D r t 1 C " t ; where " t  N.0;  2 /, so (13.6) different from normal distributions, this is not true for longer horizons.
rt rt 1 D . 1/r t 1 C "t ; (13.7)

and that all term premia are constant. This means that the drift is decreasing in the interest Bibliography
rate, but that the volatility is constant.
Ait-Sahalia, Y., 1996, “Testing Continuous-Time Models of the Spot Interest Rate,” Re-
(The usual assumption is that the short interest rate follows an Ornstein-Uhlenbeck
view of Financial Studies, 9, 385–426.
diffusion process, which implies the discrete time model in (13.6).) It can then be shown
that all interest rates (for different maturities) are linear functions of short interest rates. Ait-Sahalia, Y., and A. W. Lo, 1998, “Nonparametric Estimation of State-Price Densities
To capture more movements in the yield curve, models with richer dynamics are used. Implicit in Financial Asset Prices,” Journal of Finance, 53, 499–547.
For instance, Cox, Ingersoll, and Ross (1985) construct a model which implies that the
short interest rate follows an AR(1) as in (13.6) except that the variance is proportional to Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The Econometrics of Financial
the interest rate level, so " t  N.0; r t 1  2 /. Markets, Princeton University Press, Princeton, New Jersey.
Recently, non-parametric methods have been used to estimate how the drift and volatil- Cox, J. C., J. E. Ingersoll, and S. A. Ross, 1985, “A Theory of the Term Structure of
ity are related to the interest rate level (see, for instance, Ait-Sahalia (1996)). Figures Interest Rates,” Econometrica, 53, 385–407.
13.3–13.4 give an example. Note that the volatility is defined as the square of the drift
minus expected drift (from the same estimation method). Härdle, W., 1990, Applied Nonparametric Regression, Cambridge University Press, Cam-
bridge.
13.2.2 Non-Parametric Option Pricing
Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric Foundations, Cam-
There seems to be systematic deviations from the Black-Scholes model. For instance, bridge University Press, Cambridge.
implied volatilities are often higher for options far from the current spot (or forward)
Pagan, A., and A. Ullah, 1999, Nonparametric Econometrics, Cambridge University
price—the volatility smile. This is sometimes interpreted as if the beliefs about the future
Press.
log asset price put larger probabilities on very large movements than what is compatible
with the normal distribution (“fat tails”).
This has spurred many efforts to both describe the distribution of the underlying asset
price and to amend the Black-Scholes formula by adding various adjustment terms. One
strand of this literature uses non-parametric regressions to fit observed option prices to

232 233

You might also like