Professional Documents
Culture Documents
CONTENTS
1 Introduction
1.1
1.2
1.3
8
8
9
11
12
2.2
13
2.3
15
Basic Statistics
19
21
3.1
3.2
3.3
3.4
Statistical Models . . . . . . . . . . . .
Random Variables . . . . . . . . . . .
Moments of random variables . . . . .
Popular Distributions in Econometrics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
24
26
3.5
3.6
3.7
3.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
29
30
30
35
35
38
41
II
43
CONTENTS
.
.
.
.
.
.
.
.
.
45
45
46
47
47
48
50
52
54
55
56
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.2
7.3
7.4
7.5
.
.
.
.
.
.
.
.
.
.
67
68
69
70
70
71
71
71
72
72
.
.
.
.
.
.
73
75
75
76
76
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
85
90
III
93
97
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
regression. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
91
97
97
98
99
99
101
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
102
103
104
104
104
105
106
CONTENTS
117
121
15 The
15.1
15.2
15.3
15.4
15.5
15.6
15.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16 Encompassing
125
125
127
127
129
130
131
133
137
17 ARCH Models
139
17.0.1 Practical Modelling Tips . . . . . . . . . . . . . . . . . . . . 141
17.1 Some ARCH Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 141
17.2 Some Dierent Types of ARCH and GARCH Models . . . . . . . . 143
17.3 The Estimation of ARCH models . . . . . . . . . . . . . . . . . . . 146
18 Econometrics and Rational Expectations
18.0.1 Rational v.s. other Types of Expectations . . .
18.0.2 Typical Errors in the Modeling of Expectations
18.0.3 Modeling Rational Expectations . . . . . . . .
18.0.4 Testing Rational Expectations . . . . . . . . .
19 A Research Strategy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
147
147
148
150
150
153
20 References
157
20.1 APPENDIX 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
20.2 Appendix III Operators . . . . . . . . . . . . . . . . . . . . . . . . 160
20.2.1 The Expectations Operator . . . . . . . . . . . . . . . . . . 161
20.2.2 The Variance Operator . . . . . . . . . . . . . . . . . . . . 162
20.2.3 The Covariance Operator . . . . . . . . . . . . . . . . . . . 162
20.2.4 The Sum Operator . . . . . . . . . . . . . . . . . . . . . . . 162
20.2.5 The Plim Operator . . . . . . . . . . . . . . . . . . . . . . . 163
20.2.6 The Lag and the Dierence Operators . . . . . . . . . . . . 164
Abstract
CONTENTS
CONTENTS
1. INTRODUCTION
Please respect that this is work in progress. It has never been my intention to
write a commercial book, or a perfect textbook in time series econometrics. It is
simply a collection of lectures in a popular form that can serve as a complement
to ordinary textbooks and articles used in education. The parts dealing with
tests for unit roots (order of integration) and cointegration are not well developed.
These topics have a memo of their own "A Guide to testing for unit roots and
cointegration".
When I started to put these lecture notes together some years ago I decided
on title "Lectures in Modern Time Series Econometrics" because I thought that
the contents where a bit "modern" compared to standard econometric textbook.
During the fall of 2010 as I started to update the notes I thought that it was
time to remove the word "modern" from the title. A quick look in Damodar
Gujaratis textbook "Basic Econometrics" from 2009 convinced my to keep the
word "modern" in te title. Gujaratis text on time series hasnt changed since the
1970s even though time series econometrics has changed completely since the 70s.
Thus, under these circumstances I see no reason to change the title, at least not
yet.
There are four ways in which one do time series econometrics. The rst is to use
the approach of the 1970s, view your time series model just like any linear regression, and impose a number of ad hoc restrictions that will hide all problems you
nd. This is not a good approach. This approach is only found in old textbooks
and never in todays research. You might only see it used in very low scientic
journals. Second, you can use theory to derive a time series model, and interesting parameters, that you then estimate with appropriate estimators. Examples
of this ti derive utility functions, assume that agents have rational expectations
etc. This is a proper research strategy. However, it typically takes good data,
and you need to be original in your approach, but you can get published in good
journals. The third, approach is simply to do statistical description of the data
series, in the form of a vector autoregressive system, or reduced form of the vector
error correction model. This system can used for forecasting, analysing relationships among data series and investigated with respect to unforeseen shocks such
as drastic changes in energy prices, money supply etc. The fourth way is to go
beyond the vector autoregressive system and try to estimate structural parameters
in the form of elasticities and policy intervention parameters. If you forget about
the rst method, the choice depends on the problem at hand and you chose to
formulate it. This book aims at telling you how to use methods three and four.
The basic thinking is that your data is the real world, theories are abstractions
that we use to understand the real world. In applied econometric time series you
should always strive to build well-dened statistical models, that is models that
are consistent with the data chosen. There is a complex statistical theory behind
all this, that I will try to popularize in this book. I do not see this book as a
substitute for an ordinary textbook. It is simply a complement.
INTRODUCTION
INTRODUCTION
decisions on some view of the economy where we assume that certain events are
linked to each other in more or less complex ways. Economists call this a model
of the economy. We can describe the economy and the behavior of the individuals in terms of multivariate stochastic processes. Decisions based on stochastic
sequences play a central role economics and in nance. Stochastic processes are
the basis for our understanding about the behavior of economic agents and of how
their behavior determine the future path of the economy. Most econometric text
books deal with stochastic time series as a special application of the linear regression technique. Though this approach is acceptable for an introductory course in
econometrics, it is unsatisfactory for students with a deeper interest in economics
and nance. To understand the empirical and theoretical work in these areas, it
is necessary to understand some of the basic philosophy behind stochastic time
series.
This work is a work in progress. It is based on my lectures on Modern Economic Time Series Analysis at the Department of Economics rst at University
of Gothenburg and later at University of Skovde and Linkping University in
Sweden. The material is not ready for a widespread distribution. This work, most
likely, contains lots of errors, some are known by the author, and some are not
yet detected. The dierent sections do not necessarily follow in a logical order.
Therefore, I invite anyone who has opinions about this work to share them me.
The rst part of this work provides a repetition of some basic statistical concepts, which are necessary understanding modern economic time series analysis.
The motive for repeating these concepts is that they play a larger role in econometrics than many contemporary textbooks in econometrics indicate. Econometrics did not change much from the rst edition of Johnston in the 60s until the
revised version of Kmenta in the mid 80s. However, as a consequence of the critique against the use of econometrics delivered by Sims, Lucas, Leamer, Hendry
and others, in combination with new insights into the behavior of non-stationary
time series and the rapid development of computer technology, have revolutionized
econometric modeling, and resulted in an explosion of knowledge. The demand for
writing a decent thesis, or a scientic paper, based on econometric methods has
risen far beyond what one can learn in an introductory course in econometrics.
between the groups were not big, but su ciently big for those wanting to conrm
their a priori beliefs that GM food is bad. A somewhat embarrassing detail, never
reported in the media, is that rats in general do not like potatoes. As a consequence
both groups of rats in this study were suering from starvation, which severely
aected the test. It was not possible to determine if the dierence between the two
groups were caused by starvation, or by GM food. Once the researcher conditioned
on the eects of starvation, the dierence became insignicant. This is an example
of Junk science, bad science getting a lot of media exposure because the results
ts the interests of lobby groups, and can be used to scare people.
The lesson for econometricians is obvious, if you come up with good results
you get rewarded, bad results on the other hand can quickly be forgotten. The
GM food example is extreme econometric work. Econometric research seldom get
such media coverage, though there are examples such as Swedens economic growth
is less than other similar countries, the assumed dynamic eects of a reduction
of marginal taxes. There are signicant results that depend on one single outlier.
Once the outlier is removed, the signicance is gone, and the whole story behind
this particular book is also gone.
In these lectures we will argue that the only way to avoid junk econometrics
is careful and systematic construction and testing of models. Basically, this is the
modern econometric time series approach. Why is this modern, and why stress the
idea of testing? The answers are simply that careers have been build on running
junk econometric equations, most people are unfamiliar with scientic methods in
general and the consequences of living in a world surrounded by random variables
in particular.
10
INTRODUCTION
11
a random sample in the classical statistical sense, because the econometrician cannot control the sampling process of variables. Variables like, GDP,
money, prices and dividends are given from history. To get a dierent sample we would have to re-run history, which of course is impossible. The way
statistic theory deals with this situation is to reverse the approach taken in
classical statistic analysis, and build a model that describes the behavior of
the observed data. A model which achieves this is called a well dened statistical model, it can be understood as a parsimonious time invariant model
with white noise residuals, that makes sense from economic theory.
Finally, from the view of economics, the subject of statistics deals mainly
with the estimation and inference of covariances only. The econometrician,
however, must also give estimated parameters an economic interpretation.
This problem cannot always be solved ex post, after the a model has been estimated. When it comes to time series, economic theory is an integrated part
of the modeling process. Given a well dened statistical model, estimated
parameters should represent behavior of economic agents. Many econometric
studies fail because researchers assume that their estimates can be given an
economic interpretation without considering the statistical properties of the
model, or the simple fact there is in general not a one to one correspondence
with observed variables and the concepts dened in economic theory.1
2.1 Programs
Here is a list of statistical software that you should be familiar with, please goggle,
(those recommended for time series are marked with *):
*RATS and CATS in RATS, Regression Analysis of Time Series and Cointegrating Analysis of Time Series (www.estima.com)
- *PcGive - Comes highly recommended. Included in Oxmetrics modules, see
also Timberlake consultants for more programs.
- *Gretl (Free GNU license, very good for students in econometrics)
- *JMulti (Free for multivariate time series analysis, updated? The discussion
forum is quite dead, www.jmulti.com)
- *EViews
- Gauss (good for simulation)
- STATA (used by the World Bank, good for microeconometrics, panel data,
OK on time series)
- LIMDEP (Mostly free with some editions of Greens Econometric text
book?, you need to pay for duration models?)
- SAS - Statistical Analysis System (good for big data sets, but not time series,
mainly medicine, "the calculus program for decision makers")
- Shazam
And more, some are very special programs for this and that, ... but I dont
nd them worth mentioning in this context.
1 For a recent discussion about the controversies in econometrics see The Economic Journal
1996.
12
There is a bunch of software that allows you to program your own models or
use other peoples modules:
- Matlab
- R (Free, GNU license, connects with Gretl)
- Ox
You should also know about C, C++, and LaTeX to be a good econometrician.
Please google.
For Data Envelopment Analysis (DEA) I recommend Tom Coellis DEAP 2.1
or Paul W. Wilsons FEAR.
13
14
In some situation it is necessary to start from the very beginning. A time series
is data ordered by time. A stochastic time series is a set of random variables
ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.
Observations on this random variable is often indicated as yit . In general terms
a stochastic time series is a series of random variables ordered by time. A series
starting at time t = 1 and
n ending at timeo t = T , consisting of T dierent random
variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is
built up by individual random variables, with their own independent probability
distributions is a complex thought. But, nothing in our denition of stochastic
time series rules out that the data is made up by completely dierent random
variables. Sometimes, to understand and nd solutions to practical problems, it
will be necessary to go all the way back to the most basic assumptions.
Suppose we are given a time series consisting of yearly observations of interest
rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic
series in the sense that these number were generated by one stochastic process or
perhaps several dierent stochastic processes? Further questions would be to ask
if the process or processes are best represented as continuous or discrete, are the
observations independent or dependent? Quite often we will assume that the series
are generated by the same identical stochastic process in discrete time. Based on
these assumptions the modelling process tries to nd systematic historical patters
and cross-correlations with other variables in the data.
All time series methods aim at decomposing the series into separate parts in
some way. The standard approach in time series analysis is to decompose as
yt = Tt;d + St;d + Ct;d + It ;
where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is
deterministic cyclical components and I is process representing irregular factors3 .
For time series econometrics this denition is limited, since the econometrician
is highly interested in the irregular component. As an alternative, let fyt g be a
stochastic time series process, which is composed as,
yt
Td + Ts + Sd + Ss + fyt g + et ,
(2.1)
15
In you rst course in statistics you learned how to use descriptive statistics;
the mean and the variance. Next you learned to calculate the mean and variances
from a sample that represents the whole underlying population. For the mean and
the variance to work as a description of the underlying population it is necessary
to construct the sample in such a way that the dierence between the sample
mean and the true population mean is non-systematic meaning that the dierence
between the sample mean and the population is unpredictable. This man that
your estimated sample mean is random variable with known characteristics.
The most important thing is to construct a sampling mechanism so that the
mean calculated from the sample has the characteristics you want to have. That
is the estimated mean should be unbiased, e cient and consistent. You learn
about random variables, probabilities, distributions functions and frequency distributions.
Your rst course in econometrics
"A theory should be as simple as possible, but not simpler" Albert Einstein
To be completed...
Random variables, OLS, minimize the sum of squares, assumptions 1 - 5(6),
understanding, multiple regression, multicollinearity, properties of OLS estimator
Matrix algebra
Tests and solutionsfor heteroscedasticity (cross-section), and autocorrelation
(time series).
If you read a good course you should have learned the three golden rules: test
test test, and learned about the probabilities of the OLS estimator.
Generalized least squares GLS
System estimation: demand and supply models.
Further extensions:
Panel data, Tobit, Heckit, discrete choice, probit/logit, duration
Time series: distributed lag models, partial adjustment models, error correction models, lag structure, stationarity vs. non-stationarity, co-integration
What need to know ...
What you probably do not know but should know.
OLS
Ordinary least squares is a common estimation method. Suppose there are two
series fyt ; xt g
yt = + xt + "t
Minimize
sample t = 1; :2:::T ,
PT the sum
PTof Squares over the
2
S = t=1 "2t = t=1 (yt
xt )
Take the derivative of S with respect to and , set the expressions to zero,
and solve for and :
S
=
S
=
^ =s
T SS = ESS + RSS
RSS
1 = ESS
T SS + T SS
RSS
2
R = 1 T SS = ESS
T SS
Basic assumptions
1) E("t ) = 0 for all t
16
2)
3)
4)
5)
6) "t s N ID(0; 2 )
Discuss these properties
Properties
Gauss-Markow BLUE
Deviations
Misspecication, add extra variable, forget relevant variable
Multicollinearity
Error in variables problem
Homoscedasticity Heteroscedasticity
Autocorrelation
17
18
Part I
Basic Statistics
19
21
(3.1)
x) f or
1 < x < 1;
(3.3)
variables (RV:s) are also called stochastic variables, chance variables, or variates.
RANDOM VARIABLES
23
It follows that for any two constants (a) and (b), with a < b, the probability
~ takes on a value on the interval from (a) to (b) is given by
that X
F (b)
F (a)
=
=
f (u)du
1
b
f (u)du
(3.6)
f (u)du
(3.7)
xf (x)
(3.8)
where E is the expectation operator and f (x) is the value of its probability
~ Thus, E(X)
~ represents the mean of the discrete random variable
function at X.
~
X: Or, in other words, the rst moment of the random variable. For a continuous
~ the mathematical expectation is
random variable (X),
Z 1
~ =
x f (x)dx
(3.9)
E(X)
1
24
where f (x) is the value of its probability density at x. The rst moment can
also be referred to as the location of the random variable. Location is a more
generic concept than the rst moment or the mean.
The term moments are used in situations where we are interested in the expected value of a function of a random variable, rather than the expectation of the
specic variable itself. Say that we are interested in Y~ , whose values are related
~ by the equation y = g(x). The expectation of Y~ is equal to the expectation
to X
of g(x), since E(Y~ ) = E [g(x)]. In the continuous case this leads to,
Z 1
~ =
g(x)f (x)dx:
(3.10)
E(Y~ ) = E[g(X)]
1
Like density, the term moment, or moment about the origin, has its explanation
in physics. (In physics the length of a lever arm is measured as the distance from
the origin. Or if we refer to the example with the rod above, the rst moment
around the mean would correspond to horizontal center of gravity of the rod.)
Reasoning from intuition, the mean can be seen as the midpoint of the limits of
the density. The midpoint can be scaled in such a way that its becomes the origin
of the x- axis.
The term moments of a random variable is a more general way of talking
about the mean and variance of a variable. Setting g(x) equal to x, we get the
r:th moment around the origin,
X
0
~r
xr f (x)
(3.11)
r = E(X ) =
~ is a discrete variable. In the continuous case we get,
when X
Z 1
0
~r
xr f (x)dx:
r = E(X ) =
(3.12)
~
The rst moment is nothing else than the mean, or the expected value of X.
The second moment is the variance. Higher moments give additional information
about the distribution and density functions of random variables.
0
~ = (X
~
Now, dening g(X)
r ) we get what is called the r:th moment about
~ For r = 0, 1, 2, 3 ... we
the mean of the distribution of the random variable X.
get for a discrete variable,
X
0 r
0 r
~
~
(X
(3.13)
r = E[(X
r) ] =
r ) f (x)
~ is continuous
and when X
r
~
= E[(X
0 r
r) ]
~
(X
0 r
) f (x)dx:
(3.14)
The second moment about the mean, also called the second central moment,
is nothing else than the variance of g(x) = x;
Z 1
~
~ E(X)]
~ 2 f (x)dx
var(X)
=
[X
(3.15)
1
Z 1
~ 2 f (x)dx [E(X)]
~ 2
=
X
(3.16)
1
~ 2)
E(X
~ 2;
[E(X)]
(3.17)
where f (x) is the value of probability density function of the random variable
~ at x:A more generic expression for the variance is dispersion. We can say that
X
MOMENTS OF RANDOM VARIABLES
25
the second moment, or the variance, is a measure of dispersion, in the same way
as the mean is a measure of location.
The third moment, r = 3, measures asymmetry around the mean, referred
to as skewness. The normal distribution is asymmetric around the mean. The
likelihood of observing a value above or below the mean is the same for a normal
distribution. For a right skewed distribution, the likelihood of observing a value
higher than the mean is higher than observing a lower value. For a left skewed
distribution, the likelihood of observing a value below the mean is higher than
observing a value above the mean.
The fourth moment, referred to as kurtosis, measures the thickness of the
tails of the distribution. A distribution with thicker tails than the normal, is
characterized by a higher likelihood of extreme events compared with the normal distribution. Higher moments give further information about the skewness,
tails and the peak of the distribution. The fth, the seventh moments etc. give
more information about the skewness. Even moments, above four, give further
information the thickness of the tails and the peak.
(x
)2
2 2
:
2 2
The normal distribution characterized by the following: the distribution is
symmetric around its mean, and it has only two moments, the mean and the
variance, N ( ; 2 ). The normal distribution can be standardised to have a mean of
zero and variance of unity (say ( x E(x)) and is consequently called a standardised
normal distribution, N (0; 1).
In addition, it follows that the rst four moments, the mean, the variance, the
~ = , V ar(X)
~ = 2 ; Sk(X)
~ = 0;and Ku(X)
~ =
skewness and kurtosis, are E(X)
3:There are random variables that are not normal by themselves but becomes
normal if they are logged. The typical examples are stock prices and various
macroeconomic variables. Let St be a stock price. The dollar return over a given
interval, Rt = St St 1 is not likely to be normally distributed due to simple
fact that the stock price is raising over time, partly due to the fact that investors
demand a return on their investment but mostly due to ination. However, if you
take the log of the stock price and calculate the per cent return (approximately),
f (x)
26
27
JB = T
m23 =m32
6
[(m4 =m22 )
24
3]2
+T
3m21
m1 m3
+
2m2
m22
(2):
This test is known as the Jarque-Bera (JB) test and is the most common
test for normality in regression analysis. The null hypothesis is that the series is
normally distributed. Let 1 ; 2 , 3 and 4 represent the mean, the variance, the
skewness and the kurtosis. The null of a normal distribution is rejected if the test
statistics is signicant. The fact that the test is only valid asymptotically, means
that we do not know the reason for a rejection in a limited sample. In a less than
asymptotic sample rejection of normality is often caused by outliers. If we think
the most extreme value(s) in the sample are non-typical outliers, removingthem
from the calculation the sample moments usually results in a non-signicant JB
test. Removing outliers is add hoc. It could be that these outliers are typical
values of the true underlying distribution.
2 For these moments to be meaningful, the series must be stationary. Also, we would like
fxt g to an independent process. Finally, notice that the here suggested estimators of the higher
moments are not necessarily e cient estimators.
3 This test statistics is for a variable with a non-zero mean. If the variable is adjusted for its
mean (say an estimated residual), the second should be removed from the expression.
28
xn
1
x1
dxp ;
(3.19)
f (xn ):
(3.20)
For independent random variables we can dene the r:th product moment as,
~ 1 r1 ; X
~ 2 r2 ; :::; X
~ n rn )
E(X
Z 1
Z 1
x1 r1 x2 r2
1
(3.21)
xn rn f (x1 ; x2 ; :::; xn )dx1 dx2
dxn ; (3.22)
~ n rn ):
E(X
(3.23)
It follows from this result that the variance of a sum of independent random
variables is merely the sum of these individual variances,
~1 + X
~ 2 + ::: + X
~ n ) = var(X
~ 1 ) + var(X
~ 2 ) + ::: + var(X
~ n ):
var(X
(3.24)
p X
p
X
ai aj
ij :
(3.26)
i=1 j=1
~ Z = B X,
~ and the
These results hold for matrices as P
well. If we have Y~ = AX,
~
~
covariance matrix between X and Y ( ), we have also that,
cov(Y ; Y ) = A
Z)
=B
cov(Z;
and
=A
cov(Y ; Z)
A0 ;
(3.27)
B0;
(3.28)
B0:
(3.29)
29
f (x1 ; x2 ; x2 )
;
g(x2 ; x3 )
(3.30)
or
f (x1 ; x2 ; x3 ) = '(x1 j x2 ; x3 )g(x2 x3 ):
(3.31)
~1,
Of course we can dene a conditional density for various combinations of X
~
~
X2 and X3 , like, p(x1 , x3 ; j x2 ) or g(x3 j x1 , x2 ). And, instead of three dierent
variables we can talk about the density function for one random variable, say Y~t ,
for which we have a sample of T observations. If all observations are independent
we get,
f (y1 ; y2 ; :::; yt ) = f (y1 )f (y2 ):::f (yt ):
(3.32)
Like before we can also look at conditional densities, like
f (yt j y1 ; y2 ; :::; yt
1 );
(3.33)
which in this case would mean that (yt ) the observation at time t is dependent
on all earlier observations on Y~t .
It is seldom that we deal with independent variables when modeling economic
time series. For example, a simple rst order autoregressive model like yt =
yt 1 + t , implies dependence between the observations. The same holds for all
time series models. Despite this shortcoming, density functions with independent
random variables, are still good tools for describing time series modelling, because
the results based on independent variables carries over to dependent variables in
almost every case.
+ x+ ;
(3.34)
x=
+ y+ :
(3.35)
and
Whether one chooses to condition y on x, or x on y depends on the parameter
of interest. In the following it is shown how these regression expression are constructed from the correlation between x and y, and their rst moments by making
use of the (bivariate) joint density function of x and y. (One can view this section
as an exercise in using density functions).
30
Without explicitly stating what the density function looks like, we will assume
~
that we know the joint density function for the two random variables Y~ and X,
and want to estimate a set of parameters, and . Hence we got, the joint density,
D(y; x; );
(3.36)
+ x:
The parameters in 3.38 can be estimated by using means, variances and covariances of the variables. Or in other terms, by using some of the lower moments
~ and Y~ . Hence, the rst step rewrite 3.38 in such a
of the joint distribution of X
~ and Y~ .
way that we can write and in terms of the means of X
Looking at the LHS of 3.38 it can be seen that a multiplication of the condi~ g(x), leads to the joint density.
tional density with the marginal density for X,
Given the joint density we can choose to integrate out either x or y. In this case
we chose to integrate over x. Thus we have after multiplication,
Z
y D(yj x; )dyg(x)= g(x ) + x g(x ):
(3.39)
Integrating over x leads to, at the LHS,
Z Z
yD(yjx; )dydg(x )
Z Z
=
yD(y;xj )dydxg(x )
Z
=
yD(yj ) = E(yj ) = y :
(3.40)
~ =
E(X)
x:
(3.41)
~ =
E(X)
x:
(3.42)
We now have one equation two solve for the two unknowns. Since we have
used up the means let us turn to the variances by multiplying both sides of 3.38
with x and perform the same operations again.
THE LINEAR REGRESSION MODEL A GENERAL DESCRIPTION
31
(3.43)
Integrate over x,
Z Z
=
xyD(yj x ; )dydxg(x )
Z
x g(x )dx +
x 2 g(x )dx :
(3.44)
~Y
~ );
)dydx = E (X
xyD(y; x j
x g(x)dx +
~ +
E(X)
x2 g(x)dx =
(3.45)
~ 2 ):
E(X
(3.46)
~ +
E(X)
~ 2 ):
E(X
(3.47)
~ Y~ ) = x y + xy ,
Remembering the rules for the expectations operator, E(X
2
2
2
~
and E(X ) = x + x makes it possible to solve for and in terms of means
and variances. From the rst equation we get for ,
=
x:
(3.48)
=
=
2
x) x + ( x
2
2
x+
x+
y
x y
2
x );
x y
2
x;
xy
(3.49)
which gives
xy
:
2
x
(3.50)
xy
(x
2
x
x)
+ x;
(3.51)
y)
+ y:
(3.52)
yx
(y
2
y
We can now make use of the correlation coe cient and the parameter in the
~ and Y~ is dened as,
linear regression. The correlation coe cient between X
=
xy
or
x y
xy :
(3.53)
x y
y
x
(x
x );
(3.54)
~
E(Xjy;
)=
x
y
(y
y ):
(3.55)
So, if the two variables are independent their covariance is zero, and the correlation is also zero. Therefore, the conditional mean of each variable does not
dependent on the mean and variance of the other variable. The nal message
is that a non-zero correlation, between two normal random variables, results in
linear relationship between them. With a multivariate model, with more than two
random variables, things are more complex.
33
34
T
Y
f (xt ; )
= ( 1;
2 ; :::; k );
(4.1)
so we write
(4.2)
35
where x; indicates that it is the shape of the density, described by the parameters which gives us the sample. If the density function describes a normal
distribution would consistent of two parameters the mean and the variance.
Now, suppose that we know the functional form of the density function. If we
~ t , we can ask the question which estimates
also have a sample of observations on X
of would be the most likely to nd, given the functional form of the density and
given the observations. Viewing the density in this way amounts to asking which
values of maximize the value of the density function.
Formulating the estimation problem in this way leads to a restatement of the
density function in terms of a likelihood function,
L( ; x);
(4.3)
where the parameters are seen as a function of the sample. It is often convenient
to work with the log of the likelihood instead, leading to the log likelihood
log L( ; x) = l( ; x)
(4.4)
What is left is to nd the maximum of this function with respect to the parameters in . The maximum, if it exists is found by solving the system of k
simultaneous equations,
l( ; x)
= 0;
(4.5)
i
= S( );
(4.6)
(4.8)
So far we have not assigned any specic distribution to the density function.
~ t g. The
Let us assume a sample of T independent normal random variables fX
normal distribution is particularly easy two work with since it only requires two
parameters to describe it. We want to estimate the rst two moments, the mean
2
and the variance 2 , thus = ( ; ): The likelihood is,
#
"
T
X
1
T
=2
2
(xt
)2 :
(4.9)
L( ; x) = 2
exp
2 2 t=1
Taking logs of this expression yields,
l( ; x) =
(T =2) log 2
(T =2) log
(1=2
T
X
(xt
)2 :
(4.10)
t=1
36
and
T
1 X
2
(xt
are,
);
(4.11)
t=1
and,
l
2
(T =2
) + (1=2
T
X
)2 :
(xt
(4.12)
t=1
xt
=0
(4.13)
t=1
T
X
(xt
)2
= 0:
(4.14)
t=1
^ 2x
and
T
1X
xt
T t=1
(4.15)
T
1X
(xt
T t=1
" T
#2
1 X
xt :
T t=1
T
1X 2
2
^x) =
x
T t=1 t
(4.16)
6
D l( ; x)=4
2
l
2
2
2
2
2
l
2
3 2
7 4
5=
2
1
4
P
(xt
T
2 4
(x
)
Pt
(xt
T
6 ^ 2x
E[D l( ; x)]= 14
2
and
)2
2
3
5
(4.17)
, we get,
7
5= I(^);
T
2^ 4x
(4.18)
T
T
X
1X
~ = 1
E(X)
T t=1
T t=1
PT
PT
2
solution is given by T1
]2 = T1
t=1 [xt
t=1 xt +
PT
P
P
T
T
1
1
2+
2
x
2
x
t
t
t=1
T
Pt=1
PTT t=1
T
1
2
2
2 T1
t=1 xt + T T
t=1 xt
hP
i2
PT
PT
PT
T
1
2
2 T12
t=1 xt + T 2
t=1 xt
t=1 xt
t=1 xt
PT
2
1 P
2
x
[
x
]
t
t=1 t
T2
1 The
=
=
=
=
1
T
1
T
1
T
1
T
= ;
2
(4.19)
2xt
37
!2 #
T
1 X ~
Xt
T t=1
t=1
"
#
T X
T
X
1
1
~ t2 )
~tX
~s
E
= E T E(X
X
T
T t=1 s=1
"
!#
T
T
X
X
1
1
2
2
~ ) E(X
~ )
~tX
~s
T E(X
E
X
=
t6=s
t
t
T
T
1
= E
T
"
T
X
~ t2
X
1
(T
T
~ t2 )
1)E(X
1
T (T
T
~ t )]2 =
1)[E(X
(4.20)
(4.22)
where Y~ is a random variable, with observations fyt g and zt is, for the time
being, assumed to be a deterministic variable.(This is not a necessary assumption).
~ let us
Instead of using the symbol x, for observation on the random variable X;
set xt = t where t N ID(0, 2 ): Thus, we have formulated a linear regression
model with a white noise residual. This linear equation can be rewritten as,
t
= yt
zt
(4.23)
where the RHS is the function to be substituted with the single normal variable
xt used in the MLE example above. The algebra gets a bit more complicated but
the principal steps are the same.2 The unknown parameters in this case are and
2 As a consequence of more complex algebra the computer algorithms for estimating the variables will also get more complex. For the ordinary econometrician there are a lot of software
packages that cover most of the cases.
38
l( ;
; y; z) =
(T =2) log 2
(T =2) log
(1=2
T
X
(yt
zt )2 : (4.24)
t=1
The last factor in this expression can be identied as the sum of squares function, S( ). In matrix form we have,
S( ) =
T
X
Z )0 (Y
zt )2 = (Y
(yt
Z )
(4.25)
t=1
and
l( ;
; y; z) =
(T =2) log 2
(T =2) log
(1=2
)(Y
Z )0 (Y
Z ) (4.26)
yields
2Z 0 (Y
Z );
(4.27)
(Z 0 Y )
(4.28)
Notice that the ML estimator of the linear regression model is identical to the
OLS estimator.
The variance estimate is,
^ 2 = 0 =T;
(4.29)
which in contrast to the OLS estimate is biased.
To obtain these estimates we did not have to make any direct assumptions
about the distribution of yt or zt : The necessary and su cient condition is that yt
conditional on zt is normal, which means that yt
zt = t should follow a normal
distribution. This is the reason why MLE is feasible even though yt might be a
dependent AR(p) process. In the AR(p) process the residual term is a independent
normal random variable. The MLE is given by substitution of the independently
distributed normal variable with the conditional mean of yt :
The above results can be extended to a vector of normal random variables. In
this case we have a multivariate normal distribution, where the density is
D(X) = D(X1 ; X2 ; :::; XT );
(4.30)
~
P The random variables X will have a mean vector and a covariance matrix
. The density function for the multivariate normal is,
X
X
1
D(X) = [(2 )n=2 j
jn=2 ] 1 exp[ (1=2)(X
)0
(X
)]
(4.31)
P
which can be expressed in a compact form Xt N ( ; ):
With multivariate densities it is possible to handle systems of equations with
stochastic variables, the typical case in econometrics. The bivariate normal is an
~ = (X
~1, X
~ 2 ), and
often used device to derive models including 2 variables. Set X
X
2
1
21
12
2
2
with j
j=
2 2
1 2 (1
p2 );
(4.32)
P
where p is the correlation coe cient. As can be seen j
j> 1 unless p2 = 1. If
12 = 21 = 0; the two processes are independent and can estimated individually
MLE FOR A LINEAR COMBINATION OF VARIABLES
39
40
^R
L
:
^
LU
(5.1)
This lead to the test statistic ( 2 ln ) which has a 2 (R) distribution, where
R is the number of restrictions.
The Wald test compares (squared) estimated parameters with their variances.
In a linear regression, if the residual is N ID(0; 2 ), then ^
N ( ; var( ^ )), so
^
^
(
) N (0; var( ); and a standard t-test will tell if is signicant or not.
More generally
if we have vector of normally distributed random variables
P
^ Nj ( ; ), then have
X
(x
(x
(J):
(5.2)
41
The LM test starts from a restricted model and tests if the restrictions are
valid. Here restrictions should be understood as a general concept. A model
is restricted if it assumes homoscedasticity, no autocorrelation, etc. The test is
formulated as,
ln L(^R ) h ^ i 1 ln L(^R )
I( R )
:
(5.3)
LM =
^R
^R
The formula looks complex but is in many cases extremely easy to apply.
Consider the LM test for p : th order autocorrelation in the residuals ^t ,
^t =
1^t 1
2^t 2
+ ::: +
p^t p
t:
(5.4)
The LM test statistic for testing if the parameters 1 to p are zero, amounts to
estimating the equation with OLS and calculate the test statistics T R2 , distributed
as 2 (p) under the null of no autocorrelation. Similar tests can be formulated for
testing various forms of heteroscedasticity.
Tests can often be formulated in such a way that they follow both 2 and
F -distributions. In less than large samples the F -distribution is better one to use.
The general rule for choosing among tests based on the F or the 2 distribution
is to use the F distribution, since it has better the small sample properties.
If the information matrix is known (meaning that it is not necessary to estimate
it), all three tests would lead to the same test statistic, regardless of the chosen
distribution 2 or F . I all all three approaches lead to the same test statistics,
we would have RW = RLR = RLM . However, when the information matrix is
estimated we get the following relation between the tests RW RLR RLM .
Remember (1) that when dealing with limited samples the three tests might
lead to dierent conclusions, and (2) if the null is rejected the alternative can
never be accepted. As a matter of principle, statistical tests only rejects the
null hypothesis. Rejection of the null does not lead to accepting the alternative
hypothesis, it leads only to the formulation of new null. As an example, in a test
where the null hypothesis is homoscedasticity, the alternative is not necessarily
heteroscedasticity. Tests are generally derived on the assumption that everything
else is OK in the model. Thus, in this example, rejection of homoscedasticity
could be caused by autocorrelation, non-normality, etc. The econometrician has
to search for all possible alternatives.
42
Part II
43
45
cial standard is to use It discrete time settings and zt for continuous time settings.
We can also say that the set fFt :t 0g is a ltration, representing increasing fam~ t , (x1 ; x2 ; :::; :xt ); will
ily of sub- sigma algebras on z: Over time outcomes of X
i
be added to the increasing family of information sets. We refer to the observed
process, (x1 ; x2 ; ::; xt ), as adapted to the ltration zt : We can also say that if
~ t is a random
(x1 ; x2 ; ::; xt ) is an adapted process, then for the sequence of fxt g X
~ t is know as xt .
variable with respect to f ; z), and for each t the value of X
= 0;
(6.1)
its variance exists and is constant 2 , and there is no memory in the process
so the autocorrelation function is zero,
E[
t t]
E[
t s]
(6.2)
0 f or t 6= s:
(6.3)
In addition, the white noise process is supposed to follow a normal and independent distribution, t
N ID(0; 2 ). A p
standardized white noise have a
distribution like N ID(0; 1). Dividing t with 1= 2 gives ( t = ) ~N ID(0; 1):
The independent normal distribution has some important characteristics. First,
if we add normal random variables together, the sum will have a mean equal to the
sum ofPthe mean of all variables. Thus, adding T white noise variables together as,
T
zT = t=1 ( t = ) forms a new variables with mean E(zT ) = E( 1 = ) + E ( 2 = ) +
:: + E ( T = ) = (1= ) [E( 1 ) + E( 2 ) + ::: + E( T )] = 0: Since each variable is independent, we have the variance as 2z = 2z;1 + 2z;2 + :: + 2z;T = 1 + 1 + :: + 1 = T .
The random
p variable is distributed as zt ~N ID(0; T ); with a standard deviation
for zt increases, a 95% forecast condence
given as 1= T : As the forecast horizon
p
interval also increases with 1:96 T :
In the same way, we can dene the distribution, mean and variance during
subsets of time. If t ~N (0; 1) is dened for the period of year. The variables
will
p be distributed over six months as, N (0; 1=2), with a standard deviation of
1= 2,
p over three months the distribution is N (0; 1=4), with a standard deviation
of 1= 4. For any fraction ( ) over
p the year, the distribution becomes N ID(0; 1= )
and the standard deviation 1= : This property of the variable following from the
assumption of independent distribution, is known as Markov property. Given that
x0 is generated from an independent normal distribution N ( ; 2 ); the expected
future value of xt at time x0+T is distributed as N ( T; 2 T ).
To sum up, it follows from the denition that a white noise process is not
linearly predictable from its own past. The expected mean of a white noise,
conditional on its history is zero,
E[
t 1 ; t 2 ; :::: 1 ]
= E [ t ] = 0:
(6.4)
(6.5)
where the information set It includes not only the history of t , but also all other
information which might be of importance for explaining this process. Stating
that a series is a white noise innovation process, with respect to some information
set It ; is a stronger requirement than white noise process. It is also a stronger
statement than saying that t is a martingale dierence process, because we add
the assumptions of a normal distribution. The martingale and the martingale
dierence processes were dened in terms of their rst moments only. Creating
a residual process that is a white noise innovation term is a basic requirement in
the modelling process.
; 2 .
log of st , is then distributed as ln st N
2
Given that S~t , has a log normal distribution, ithfollows that the idistance be2
tween S~t and S~t+n is distributed as S~t+n S~t N
n; 2 n :
s
+ a1 yt
+ ::: + yt
A(L)yt =
+ t;
(6.6)
t;
(6.7)
47
yt =
+ b1
t 1
+ ::: + bq
t q;
(6.8)
+ B(L) t :
(6.9)
where
N ID(0;
):
(6.10)
E(xt+1 j xt ; xt
1 ; xt 2 ; :::; xt n )
= E(xt+1 j xt ) = xt ;
(6.11)
where n might be equal to innity. This denition does not rule out the case
that there are other variables that can be correlated with xt and thereby also
predict xt+1 . We can also say that a random walk has an innite long memory.
~ = 2t ;and
The mean is zero, the variance and autocovariance is equal to, var(X)
2
~
~
Cov(Xt , Xt n ) = (t n) :
E(xt ) = E
t
X
= 0;
(6.12)
i=1
var(xt ) =
E(x2t )
=E
cov(xt xt
1)
= E(xt xt
1)
=E4
"
t
X
i=1
!#2
t X
t
X
E [ei ej ] = t:
(6.13)
i=1 j=1
1),
t
X
i=1
! 0 t 1 13
t X
t 1
X
X
ei @
ej A5 =
E [ei ej ] = t
j=1
1:
i=1 j=1
(6.14)
The autocovariances foe higher lag order follows from this previous example.
As can be seen these are non-stationary moments, since both are dependent on
time (t). It follows that the autocorrelation function looks like,
n
= [(t
1=2
n)=t]
(6.15)
t
X
i:
(6.16)
i=0
Thus, a random walk is a sum of white noise error from the beginning of the
series (x0 ). Hence, the value of today is dependent on shocks from built up beginning of the series. All shocks in the past, are still aecting the seriesPtoday.
t
Furthermore, all shocks are equally important. The process formed by i=1 is
called a stochastic trend. In contrast to a deterministic trend, the stochastic trend
is changing its slope in a random way period by period. Ex post a stochastic trend
might look like deterministic trend. Thus, it is not really possible to determine
whether a variable is driven by a stochastic or a deterministic trend, or a combination of both.
If we add a constant term to the model we get a random walk with a drift,
xt =
+ xt
+ t;
(6.17)
where the constant represents the drift term. In this processxt is driven by
both a deterministic and a stochastic trend. If we perform the same backward
substitution as above, we get,
xt = t +
t
X
+ x0 ;
(6.18)
i=1
+ t;
(6.19)
49
+ t:
(6.20)
1 Alternatively, it is possible to dene the information set at time t-1, and wrire the denition
h
i
~ t j It 1 = X
~t 1 :
as E X
50
~ t+s is conGiven the information set, all information relevant for predicting X
~ t . Thus, the best prediction of X
~ t+1 is xt ; and the
tained in todays value of X
value of today is the best prediction of all periods in the future. The information
~ t as well as all other information that might be
set might include the history of X
~ t+s . The denition of a martingale is always relative,
of relevance for predicting X
~ t is a marsince we have the freedom of dening dierent information sets. If X
tingale with respect to the information set It0 , it might not be a martingale with
respect to another information set It00 unless the two sets are not identical.
We can now continue and dene the martingale dierence process as the ex~ t+s and X
~t;
pected dierence between X
~ t+s
E[(X
~ t ) j It ] = E(X
~ t+s
X
xt ) = 0:
(6.22)
If a process is a martingale dierence process, changes in the process are unpredictable from the information set.
The sub-martingale and the super martingale are two versions of martingale
processes. A sub-martingale is dened as
h
i
~ t+s j It
E X
xt ;
which says that, on average the expected value is growing over time. A supermartingale is dened as
h
i
~ t+s j It
E X
xt ;
~ t+s is given by X
~ t but, on average,
which says that the expected value of X
declining over time.
Martingales are well known in the nancial literature. If the agents on a nancial market use all relevant information to predict the yields of nancial assets,
the prices of these assets will, under certain special conditions, behave like martingales. The random walk hypothesis of asset prices does not come from nance
theory, it is based on empirical observations, and is mainly a hypothesis about the
empirical behavior of asset prices which lacks a theoretical foundation. A random
walk process is a martingale, but also includes statements about distributions. If
we compare with the random walk we have the model, xt = xt 1 + t where t is a
normally distributed white noise process. The latter is a stronger condition than
assuming a martingale process. A random walk with a drift xt = t + xt 1 + t ,
this variable is a sub-martingale,since the deterministic trend will increase the
~ t+1 ) = t + xt : Let us now turn to nance theory.
expectation over time, E(X
Theory that the price of an asset (Pt+1 ) at time t + 1 is given by the price at t
plus a risk-adjusted discount factor r. If we assume, for simplicity, that the discount factor is a constant we get that Pt+1 = (1 + r)Pt : Asset prices are therefore
not driftless random walks, or martingales. The process described by theory is
ln Pt+1 = ln(1 + r) + ln Pt + t+1 , which is a sub-martingale given, in this case, a
constant discount factor. If we would like to say that asset prices are martingales
we must either transform the price process according [Pt+1 =(1 + r)], or we must
include the risk-adjusted discount factor in the information set.2
Thus, the expected value of an asset price is, by denition, E(Pt+1 ) = E(1 +
r)Pt . If the discount factor (and risk) is a constant (g) we get E(Pt+1 ) = g + Pt ,
which is a random walk with drift. If the risk premium is a time-varying stochastic
2 It is obvious that we can transform a variable into a martingale by substracting elements
from the process by conditioning or direct calculation. In fact most variables can be transformed
into a martingale in this way. An alternative way of transforming a variable into a martingale is
to transform its probability distribution. In this method you look for a probability distribution
which is equivalentto the one generating the conditional expectations. This type of distribution
is called an equivalent martingale distribution.
MARTINGALE PROCESSES
51
2
t 1
t;
(6.23)
where t is a white noise process. This is a rst order ARCH(1) model (Auto
Regressive Conditional Heteroscedasticity), which implies that a large shock to
the series is likely to be followed by another large shock. In addition, it implies
that the residuals are not independent of each other.
The conclusion is that we must be careful when reading articles which claim
that the exchange rate, or some other variable should be, or is, random walks, often
what the authors really mean is that the variable is a martingale, conditional on
some information.
The martingale property is directly related to the e cient market hypothesis
(EMH), which set out the conditions under which changes in asset prices becomes
unpredictable given dierent types of information.
52
~ t+s
Pr ob(X
~ t+s
xt + s j x1 ; x2 ; :::xt ) = Pr ob(X
xt+s j xt );
(6.24)
where s > 0: The expression says that all probability statements of future
~ t+s is only dependent on the value the variable
values of the random variable X
takes at time t, and do not depend on earlier realizations. By stating that a
variable is a Markov process we put a restriction on the memory of the process.
The AR(1) model, and the random walk, are rst-order Markov process,
xt = a1 xt
where
N ID(0;
):
(6.25)
Given that we know that t is a white noise process (N ID(0; 2 )]; and can
observe xt we know all what there is to know about xt+1 = ; since xt = contains
all information about the future. In practical terms, it is not necessary to work
with the whole series, only a limited present. we can also say that the future
of the process, given the present, is independent of the past. For a rst order
~ t+1 , given all its possible present and
Markov process, the expected value of X
~tX
~t 1, X
~ t 2 :::, can be expressed as,
historical values X
h
~ t+1 j X;
~ X
~t
E X
1 ; Xt 2 :::Xt 1
h
i
~ t+1 j X
~t :
=E X
(6.26)
Thus, a rst order Markov process is also a martingale. Typically, the value of
~ t is know at time t as xt : The Markov property is a very convenient property if we
X
want to build theoretical models describing the continuous evolution of asset prices.
We can focuses on the value today, and generate future time series, irrespective
of the past history of the process. Furthermore, at each period in future we can
easily determine an exactfuture value, which is the equilibriumprice for that
period.
The white noise process, as an example, is a Markov process. This follows from
the fact that we assumed that each t was independent from its own past, and
future. One outcome of the assumption of a normal and independent process, was
that we could relatively easy form predictions and condence intervals given only
the value of t today.
The denition of a Markov process can be extended to an m : th order Markov
processes, for which we have;
h
~ t+1 j X
~t; X
~t
E X
1 ; Xt 2 :::Xt 1
h
~ t+1 j X;
~ X
~t
=E X
1 ; Xt 2 :::Xt m :
; (6.27)
53
54
predict. The further we look into the future, the number of random changes gets
larger, and probability statements about future events get harder and harder.
A generalized (arithmetic) Brownian motion is written as
dxt = dt + dWt
(6.28)
t+
t:
(6.29)
(6.30)
55
W (t) =
n
X
i=1
wti as i ! 1
(6.31)
p
An extension of this, if t ~N ID(0; 2 ); is that let Wt = t = T will also
converge to a Wiener process. Thus, the sum of a standardized white noise will
also converge to a standardized Wiener process. This result is crucial for the
understanding of the distribution a random walk and other unit rootvariables.
6.9.1
The arithmetic Brownian motion is not well suited for asset prices as their changes
seldom display a normal distribution. The log of asset prices, and return, is better
described with a normal distribution. This takes us to the geometric Brownian
motion
dxt
= dt + dWt
xt
What happens here is that we assume that ln xt has a normal distribution,
meaning that xt follows a log normal distribution, and dt + dWt follows a
normal variable. Itos lemma can be used to show that
2
d ln xt =
dt + dWt :
The expected value of the geometric Brownian motion is E(dxt =xt ) = dt, and
the variance is V ar(dxt =xt ) = 2 dt:
There are several ways in which the model can be modied to better suit
real world asset prices. One way is to introduce jumps in the process, so-called
"jump diusion models". This is done adding a Poisson process to the geometric
Brownian motion,
dxt
= dt + dWt + Ut dN ( );
xt
where Ut is a normally distributed random variable, Nt represent a Poisson
process with intensity to account for jumps in the price process.
The random walk model is good for asset prices, but not for interest rates.
The movements of interest rates are more bounded than asset prices. In this case
the so-called Ornstein-Uhlenbeck process provides a more realistic description of
the dynamics,
drt = (b
rt )dt + Wt :
Thus the idea behind the Ornstein-Uhlenbeck process is that it restricts the movements of the variable (r) to be mean reverting, or to stay in a band, around b,
where b can be zero.
6.9.2
56
~
~
If X(t)
is a Wiener process, 0
t < 1:The series always starts in zero, X(0)
=
~ i ) are independent. In
0:and if t0 t1 t2 ...
tn , then all increments of X(t
terms of the density function we have,
D [x(t1 ) x(t0 ); x(t2 ) x(t1 ); :::; x(tn )
n
Y
=
D [x(ti ) x(ti 1 ) j t0 ; t1 ; :::; tn ] :
x(tn
1)
(6.33)
h
~ X(t
~
var X(t)
i
1) =
(t
j t0 ; t1 ; :::; tn ] (6.32)
(6.34)
s);
(6.35)
where 0
s < t. Finally, since the increments are a martingale dierence
process, we can assume that these increments follow a normal distribution, so
~
X(t)
N [0, (t s)]: These assumptions lead to the density function,
D[x(t)]
=
p
2
1
exp
(2 )ti
x21
2 2 t1
n
Y
(ti
i=2
ti 1 ) (1=2)
p
exp
(2 )t1
(xi xi 1 )2
(6.36)
2 2 (ti ti 1 ;
When
= 1, the process is called a standard Wiener process or standard
Brownian motion. That the Brownian motion is quite special, can be seen from
this density function. The sample path is continuous, but is not dierentiable.
[In physics this is explained as the motion of a particle which at no time has a
velocity].
Wiener processes are of interest in economics of many reasons. First, they
oer a way of modeling uncertainty. Especially in nancial markets, where we
sometimes have an almost continuous stream of observations. Secondly, many
macro economic variables appear to be integrated or near integrated. The limiting
distributions of such variables are known to be best described as functions of
Wiener processes. In general we must assume that these distributions are nonstandard.
To sum up, there are ve important things to remember about the Brownian
motions/Wiener process;
It represents the continuous time, (asymptotic) counterpart of random walks.
It always starts at zero and are dened over 0
t < 1:
The increments, any change between two points, regardless of the length of
the intervals, are not predictable, are independent, and distributed as N (0,
(t s) 2 ), for 0 s < t.
It is continuous over 0 t < 1, but nowhere dierentiable. The intuition
behind this result is that the dierential implies predictability, which would
go against the previous condition.
Finally, a function of a Brownian motion/Wiener process will behave like a
Brownian motion/Wiener process.
The last characteristic is important, because most economic time series variables can be classied as, random walks, integrated or near-integrated processes.
In practice this means that their variances, covariances etc. have distributions
that are functionals of Brownian motions. Even in small samples will functionals
of Brownian motions better describe the distributions associated with economic
variables that display tendencies of stochastic growth.
BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE
57
58
59
sumption over a period, typically a quarter, representing the integral sum of these
activities.
Usually, a discrete time variable is written with a time subscript (xt ) while
continuous time variables written as x(t). The continuous time approach has a
number of benets, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches
as an approximation to the underlying continuous time system. The cost for doing this simplication is small compared with the complexity of continuous time
analysis. This should not be understood as a rejection of continuous time approaches. Continuous time is good for analyzing a number of well dened problems
like aggregation over time and individuals. In the end it should lead to a better
understanding of adjustment speeds, stability conditions and interactions among
economic time series, see Sj (1990, 1995).1 Thus, our interest is in analysing
discrete time stochastic processes in the time domain.
A time series process is generally indicated with brackets, like fyt g: In some
situations it will be necessary to be more precise about the length of the process.
Writing fyg1
1 indicates that he process start at period one and continues innitely.
The process consists of random variables because we can view each element in fyt g
as a random variable. Let the process go from the integer values 1 up to T: If
necessary, to be exact, the rst variable in the process can be written as yt1 the
second variable yt2 etc. up until ytT : The distribution function of the process can
then be written as F (yt1 ; yt2 ; :::; ytT ):
In some situation it is necessary to start from the very beginning. A time series
is data ordered by time. A stochastic time series is a set of random variables
ordered by time. Let Y~it represent the stochastic variable Y~i given at time t.
Observations on this random variable is often indicated as yit . In general terms
a stochastic time series is a series of random variables ordered by time. A series
starting at time t = 1 and
n ending at timeo t = T , consisting of T dierent random
variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is
built up by individual random variables, with their own independent probability
distributions is a complex thought. But, nothing in our denition of stochastic
time series rules out that the data is made up by completely dierent random
variables. Sometimes, to understand and nd solutions to practical problems, it
will be necessary to go all the way back to the most basic assumptions.
Suppose we are given a time series consisting of yearly observations of interest
rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the rst question to ask is this a stochastic
series in the sense that these number were generated by one stochastic process or
perhaps several dierent stochastic processes? Further questions would be to ask
if the process or processes are best represented as continuous or discrete, are the
observations independent or dependent? Quite often we will assume that the series
are generated by the same identical stochastic process in discrete time. Based on
these assumptions the modelling process tries to nd systematic historical patters
and cross-correlations with other variables in the data.
All time series methods aim at decomposing the series into separate parts in
some way. The standard approach in time series analysis is to decompose as
yt = Tt;d + St;d + Ct;d + It ;
1 We can also mention the dierent types of series that are used; stocks, ows and price
variables. Stocks are variables that can be observed at a point in time like, the money stock,
inventories. Flows are variables that can only be observed over some period, like consumption or
GDP. In this context price variables include prices, interest rates and similar variables which can
be observed at a market at a given point in time. Combining these variables into multivariate
process and constructing econometric models from observed variables in discrete time produces
further problems, and in general they are quite di cult to solve without using continuous time
methods. Usually, careful discrete time models will reduce the problems to a large extent.
60
(7.1)
61
= cov(Y~t ; Y~t
k)
= E[Y~t
E(Y~t )][Y~t
E(Y~t
k )]:
PT
where T is the number of observations, and y is the sample mean, y = (1=T ) i=1 yi .
In practical work, the standard assumption is a constant variance over the
sample, so that var(yt ) = var(yt k ). The sample autocorrelations are estimates
of random variables they are therefore associated with variances. Bartlett (1946)
shows that the variance of the k:th sample autocorrelation is
2
32
k
X1
14
var(^k ) =
1+2
^k 5 :
T
j=1
Given the variance, and the standard deviation of the estimated variable, it becomes possible to set up a signicance test. Asymptotically, this t-test has a
normal distribution, with an expected value of zero under the null of no autocorrelation (no memory in the series). For a limited sample, a value of ^k larger than
two times its standard error is considered signicant.
The next question is how much autocorrelation is left between the observations
at t and t k (Y~t and Y~t k ) after we remove (condition on) the autocorrelation
between t and t k? Removing the autocorrelation means that we rst calculate
the mean of Y~t conditional on all observation on Y~t and Y~t k 1 ;another way of
3 Standard
62
T =4 sample autocorrelations.
expressing this is to say that we lter Y~t from the inuence of all lags of Y~t between
t 1 and t k 1: Using the expectations operator, we dene the conditional
mean as EfY~t j yt 1 ; yt 2 ; :::yt k 1 g = Y~t . The partial autocorrelation is then
the slope coe cient in a regression between Y~t and Y~k . This leads to the following
denition of the partial autocorrelation function
k
cov(Y~t ; Y~t
j yt 1 ; :::; yt
var(Y~t k )
k
k 1)
(7.2)
+ ::: +
k yk
+ et :
(7.3)
ACF
Tails o
Cuts o at lag q
Tails o
P ACF
Cuts o at lag p
Tails o
Tails o
4 Notice,
that in the regression, the parameters a1 , a2 ,...ak 1 are not identical to 1 ; 2 ...
due to the (possible) correlation between yt 1 and lower order lags like yt 2 etc. The
regression formula only identies the last coe cient, at lag k, as the PACF k :
t k 1
63
7.1.1
A fundamental issue when analyzing time series processes is whether they are
stationary or not. As a rst, general denition, we can say that a non-stationary
series changes its behavior over time such that the mean is changing over time.
Many economic time series are non-stationary in the sense that they are growing
over time, their estimated variances are also growing and the covariance function
never dies out. In other words the calculation of the mean, autocovariance etc.
are dependent on the time period we study, and inference becomes impossible. A
stationary series on the other hand displays a behavior which is independent of
the time period and it becomes possible to test for signicance. Non-stationarity
must either be removed before modeling or included in the model. This requires
that we know what type of non-stationarity we are dealing with.
The problem with non-stationary is that a series can be non-stationary in an
innite number of ways. And, to make the problem even more complex some types
of non-stationarities will skew the distributions of the estimates such that inference
based on standard distributions such as the t , the F or the 2 distributions
are not only wrong but completely misleading. In order to model time series, we
need to understand what non-stationarity is, how to estimate it and how to deal
with it.
7.1.2
Of the two concepts, weak stationarity is the practical one. Weak stationarity
is dened in terms of the rst two moments of the process, the mean and the
variance. A process fxt g is (weakly) stationary if (1) the mean is independent of
time t,
Efxt g = ;
(2) the variance exists and is less than innity,
varfxt g =
< 1;
k)
k:
Thus, the mean and the variance are constant over time, and the covariance
between two values of the process is only a function of the distance between the
two points.
A related concept is that of covariance stationarity if the autocovariances go
to zero as the distance between the two points increases the series is said to be
covariance stationary (or ergodic),
cov(xt ; xt
k)
! 0 as k ! 1:
This denition brings us to the concept of ergodicity, which can be understood as a weak form of average asymptotic independence. The most important
condition, but not su cient, for a series to be ergodic is
!
T
X
lim T 1
cov(xt ; xt k ) = 0:
t!1
k=1
7.1.3
Strong Stationarity
Strong stationarity is dened in terms of the distribution function fxt g. Suppose a process that is ordered from observation 1 up to observation T: Each
observation up to T can be thought of as a random variable. Hence we can
write the rst variable in the process as xt1 the second variable xt2 etc. up until
xtT : The distribution function for this process is F (xt1 ; xt2 ; :::; xtT ): Next, dene the distribution function fxt g for another time interval, namely t + j, where
j = 1; 2; :::; T . This leads to the distribution function Fj (xt+j1 ; xt+j2 ; :::; xt+jT ).
Strong stationarity requires that the two distribution functions are identical such
that F (xt1 ; xt2 ; :::; xtT ) = Fj (xt+j1 ; xt+j2 ; :::; xt+jT ); meaning that the characteristics of the process are independent of time. We will get the same means, etc.
independently of the time period we choose for our calculations. By letting j take
dierent integer values we get the j : th order strong stationarity. Thus, j = 1
leads to rst order (strong) stationarity, etc.
Strong stationary incorporates the denition of weak stationarity. But, the
practical problem is that it is di cult to work with distribution functions for
continuous random variables, so strong stationarity is mainly a theoretical concept.
1. (a)
+ ::: + ap xt
+ et ;
b1 e t
:::
bq et
q;
65
where et is a white noise process. The combination of autoregressive and moving average processes gives the ARIMA(p,q) model
xt = a0 + a1 xt
+ ::: + ap xt
+e
b1 et
:::
bq e t
q:
7.1.4
PT
"t ^"t k
t=k+1 ^
PT
"2t
t=1 ^
p
X
^2k :
k=1
Under the null of no autocorrelation this test statistic has a 2 (p) distribution.
The Box-Pearce statistics is best suited for testing the residual in an AR model.
A modication, for ARMA, and more general regression models, is the so called
Box-Ljung statistics,
BL = T (T + 2)
p
X
r=1
66
^2r
(T
r)
7.1.5
When dealing with time series and dynamic econometric models, the expressions
are easier to handle with the backward shift operator (B) or the lag operator
(L).5 The backward shift operator is the symbol most often used in statistical
textbooks. Econometricians tend to use the lag operator more often. The rst
order lag operator is dened as,
Lxt = xt
1;
(7.4)
n:
(7.5)
The lag operator is an expression such that when its is multiplied with an
observation at any given time, it will shift the observation one period backwards
5 The practical dierence between using the lag operator or the backward shift operator is
that the lag operator also aects the conditional expectations generator Et which is of interest
when working with economic theories dealing with expectations.
67
in time. In other words, the lag operator can be viewed as a time traveling device,
which makes it possible to travel both forward and backwards in time. A forward
shift operator can be constructed a long the same lines. Thus, moving forward n
observations in the series from an observation at time t is done by L n xt = xt+n :
The properties of the lag operator implies that we can write an autoregressive
expression of order p (AR(p)) as,
a0 xt + a1 xt
+ a2 xt
+ ::: + ap xt
a0 xt + a1 Lxt + a2 L xt + ::: + ap Lp xt
A(L)xt :
(7.6)
Notice that the lag operator can be moved across the equal sign. The AR(1)
model, xt = a1 xt 1 + "t can be written as (1 La1 )xt = "t or A(L)xt = "t or
1
xt = [A(L)] "t . If necessary the lag length of the process can be indicated as
Ap (L): An ARM A(p; q) process can be written compactly as,
Ap (L)xt = Bq (L)"t :
(7.7)
Skipping the indication of lag lengths for convenience, the ARMA model can
1
written as xt = [A(L)] B(L)"t or alternatively depending on the context as
1
[B(L)] A(L)xt = "t : Thus, the lag operator works as any mathematical expression. However, whether or not moving the lag operator around results in a meaningful expression is associated with the principles of stationarity and invertibility,
know as duality.
7.1.6
Generating Functions
The function A(L) is a convenient way of writing the sequence. More generally
we can refer to any expression of the type A(L) as a generating function. This
includes the mean operator, the variance and covariance operators etc. Generating
functions summarize a lot of information about sequences in a compact way and
are an important tool in time series analysis. Their main advantage is that they
saves time and make the expressions much simpler since a number mathematical
operations can be applied to generating
functions. As an example, given certain
P
conditions concerning the sum ai ; we can write invert A(L); and A(L) 1 A(L) =
1:
The generating function for the lag operator is
D(L) =
k
X
di z i ;
(7.8)
where di is generated by some other function. The point here is that it is often
easier to do manipulations on D(L) directly than on each individual element in
the expression. In the example above, we would refer to A(L)xt as the generating
function of xt .
A property of generating functions is that they are additive. If we have two
series, ai , bi and i = 0; 1; 2; :::, and dene a third series as ci = ai + bi , it then
follows that,
C(L) = A(L) + B(L):
(7.9)
68
Another property is that of convolution. Take the series ai and bi from above,
a new series di can then be dened by,
i
di = a0 bi + a1 bi
+ a 2 bi
+ ::: + ai b0 =
ah bi
h:
(7.10)
h=0
(7.11)
The results stated in this section should be compared with chapter 19, below,
which shows how long-run multipliers, etc. can be derived from lag operator.
7.1.7
Given the denition of the lag operator (or the backward shift operator) the difference operator ( ) is dened as,
=1
L;
(7.12)
(7.13)
where D = d=dt.
Dierences of higher order are denoted in the same way as for the lag operator.
Thus for the second dierence of xt we write,
2
xt = (1
L)2 xt = (1
2L + L2 )xt = xt
2xt
+ xt
2:
(7.14)
xt = (1
L)d xt :
= (1
Ls )xt = xt
xt
s:
The subscript s indicates the interval over which we take the (seasonal dierence). If xt is quarterly, setting s = 4, leads to the yearly changes in the series.
This new series can the be dierenced by using the dierence operator,
d
s xt
(1
Ls )xt :
69
7.1.8
Filters
7.1.9
(7.15)
where yp represents the particular solution, the long-run steady state equilibrium or the stationary long-run mean of yt ; and yc represents the complementary
solution, the deviation from the long-run steady state.
Dynamic stability requires that yc vanishes as T ! 1: The roots of the polynomial A(L) tell us if this occurs. Given a change in "t ; what will happen to
yt+1 ; yt+2, ... yt+1 ? Will yt+1 explode, continue to grow for ever, or change temporary until it returns to the steady state equilibrium described by yp ? The roots
are given by solving for the r : s in the following equation,
r p + a1 r p
+ a2 rp
+ ::: + ap = 0:
(7.16)
This equation leads to the latent roots of the polynomial. The condition for
stability, when using the latent roots, is that the roots should be less than unity,
or that the roots should be inside the unit circle. Root equal to unity, so called
unit roots, imply an evergrowing series (stochastic trend), roots greater than unity
implies an explosive process. Complex roots suggest that the adjustment is cyclical
. Though not very likely, the process could follow an explosive cyclical path or
cyclical permanent shocks. If the process is stationary, following a shock, yt will
return to its stationary long-run mean. The roots can be complex indicating
cyclical behavior. The case with one or several unit rootsis of particular interest
because it represents stochastic growth in a non-stationary variable. Series with
one or more unit roots are also called integrated series. Many economic time
processes appears to have a unit root, or roots close to unity.
Using latent roots to dene stability is common, but is not only way to dene
stability. Latent roots, or eigenvalues, are motivated with the fact that they are
70
easier to work with when matrix algebra is used. An alternative way of dening
stability is to solve for the roots ( ) in the following equation,
1
a1 + a2
+ ::: + ap
=0
(7.17)
where : If the roots are greater than unity in absolute value j j> 1, lies
outside the unit circlethe process is stationary, if the roots are less than unity the
process is explosive. The historical literature on time series uses both denitions,
however, latent roots, or eigenvalues are now the established standard.
7.1.10
Fractional Integration
7.1.11
The Box-Jenkins approach is a practical way nding a suitable ARMA representation of a given time series. The steps are
1) Identication.
Determine: (i) if seasonal dierencing is necessary to remove seasonal factors,
(ii) the number times the series need to be dierenced to achieve stationarity and
iii) study ACF and PACF to determine suitable order of the ARMA process.
2) Estimation.
The identication step leads to (1) stationary series and (2) narrows the possible ARMA(p,q) process of interest to estimate.
Methods of estimation? Remember problems with t-values?!
3) Testing.
Test the estimated model(s) for white noise residuals, using Box-Pierce test for
autocorrelation. Among models with white noise residuals pick the one with the
smallest information criteria (AIC, BIC). Dierences among information criteria?
This leads quickly to a forecast model, or a representation for expectations
generating mechanism that can be used in simple (rational) expectations modeling.
Limitations of univariate ARIMA models.
Most economic problems are multivariate. Variables depend on each other.
Furthermore, the test procedure is only aimed at nding a forecast model. To
build an econometric model that can be used for inference the demands for testing
are higher.
7.1.12
The parameters of an ARMA model might not be unique. To see the conditions
for uniqeness, decompose the polynomials of the ARMA process A(L)yt = B(L)"t
into their factors6 as,
A(L) =
p
i=1 (1
i L);
(7.18)
B(L) =
q
j=1 (1
j L):
(7.19)
and
6 If
71
(7.21)
There is a link between AR and M A models, as the presentation of the lag operator
indicated. An AR process with an innite number of lags can under certain
conditions be rewritten as a nite M A process. In a similar way an innite moving
average process can be inverted to an autoregressive process of nite order.
These results have two practical implications. The rst is that in practical
modelling, a long M A process can often be rewritten as a shorter AR process instead, and the other way around. The second implication is that the two process
are complementary to each other. The combination of AR and M A into ARM A
will lead to relatively parsimonious models meaning models with quite few parameters. In fact, it is quite uncommon to nd ARM A models above the order
p = 2 and q = 2.
The AR(1) process, yt = a1 yt 1 + "t , can be written as (1 a1 L)yt = "t , and
in the next step as yt = (1 a1 L) 1 "t : The term (1 a1 L) 1 represents the sum
of an innite moving average process,
yt =
X
1
bi " t
"t =
(1 a1 L)
i=0
= B(1)"t ;
M A process but only stationary if the latent roots of A(L) are inside the unit
circle. The latter has one interesting implication, it is often convenient to rewrite
an AR or a V AR to a moving average form and investigate the properties and
consequences of non-stationary from the M A representation.
The conditions are similar, and actually more general, for a multivariate processes,
such that V AR(p) () M A(q):
7.2.2
P1 2
where b0 = 1, and et is stationary (white noise) such that
j=0 bj < 1;
E(et ) = 0;
E(e2t ) and E(et ; et j ) = 0 for j 6= 0.
The theorem has two implications. The rst is that any series which appears
to be covariance stationary can modeled as an innite MA process. Given the
principle of duality, we can expect to nd a nite autoregressive process as well
(compare with the principle of duality). Since many economic time series are
covariance stationary after rst dierencing, we expect ARMA models as well as
linear autoregressive distributed lag models, to work quite well for these series.
The second implication is that we should be able to extract a white noise process
out of any covariance stationary process. This leads to the conclusion that nding
(or constructing) a white noise process in an empirical model is a basic necessity in
the modeling process because most economic time series are covariance stationary
after dierencing.
The presentation above has focused on the practical side of time series modelling. time series can be described and analysed theoretically. Consider the AR(1)
model yt = a1 yt 1 + "t . The series yt is generated by the parameter a1 , the white
noise process "t and some initial value at the beginning of time say t = 0, y0 :
Thus, given an initial value, a parameter a1 and random number generator that
generates "t
N (0; 2 );where we for simplicity can set to 2 = 1, it becomes
possible to generate possible series of yt using Monte Carlo technique. The different outcomes of the series yt can then be used to estimate the distribution of
a
^1 to learn about how to do inference in small and medium sized samples, and to
understand the distributions as a1 ! 1:0:
We can also calculate the mean and the variance of yt . The series yt is not
independent, since it is a autoregressive. Therefore, the mean and the variance of
the observed yt is not informative for describing the series. Instead look at the
mean of the zero mean (no constant) AR(1) process, in the form of the expected
value; E(yt ) = E(a1 yt 1 ) + E("t ): Looking at the expression, the left hand side
tells us that the right hand side represents the mean of yt : The expected value
of a white noise is denition zero, so E("t ) = 0. Since a1 is a given constant we
have for the other factor, E(a1 yt 1 ) = a1 E(yt 1 ). To nd an answer we need to
substitute the lags of yt 1 ; yt 2 ; etc.
THEORETICAL PROPERTIES OF TIME SERIES MODELS
73
E(yt )
=
=
i=0
(1
a1 )
The last step is simply an application of the solution to an innite series, which
works in this case as long as the AR process is stationary, ja1 j < 1. It is important
that you understand the use of the expectations operator in this example because
the technique is frequently used to derive a number of results. We could have
reached the result in a simpler way if we had used the lag operator. Take the
expectation of E(1 a1 L)yt = E( + "t ). The lag operator is a deterministic
factor why the result is E(yt ) = (1 a1 L) : Again, the left hand side is the sum
of an innite process. If there is no constant, = 0 it follows immediately that
E(yt ) = 0:
What is the variance of the process yt ? The answer is given by understanding that E(yt yt ) = V ar(yt ) = 2 :Thus, start from the AR(1) process, multiply
both sides with yt to get yt yt = a1 yt yt 1 + yt "t . Next, take expectations of
both sides, E(yt yt ) = a1 E(yt yt 1 ) + E(yt "t ), and substitute yt yt 1 and yt "t as
(a1 yt 1 + "t )yt 1 = a1 yt2 1 + "t yt 1 and yt "t = (a1 yt 1 + "t )"t . From this we have
a21 E(yt2 1 ) and a1 E("t yt 1 ) + E("2t ):In the latter expression we have by denition
that E("t yt 1 ) = 0 (recall the basic assumptions of OLS) and that E("2t ) = 2" .
Put the results together,
E(yt yt )
2
2
(1
= a21 E(yt2 1 ) +
= a21
a21 )
2
"
2
"
2
"
2
"
(1
a21 )
= ak1
k)
= a1 E(yt
2
"
(1
a21 )
1 yt: k )
+ E("t yt
k ):
From
= ak1 :
From this expression it is obvious that the autocorrelation function for the
AR(1) process dies out slowly as the lag length k increases.
Calculating the mean, variance, autocovariances and autocorrelations for AR(1),
AR(2), MA(1) and MA(2) processes are standard exercise in time series courses,
followed by investigation of the unit root case a1 = 1: To be completed...
Seasonality
75
7.3.2
Non-stationarity
(To be completed)
Dierencing until stationarity is the standard Box-Jenkins approach. A bit
ad hoc. In econometrics the approach is to test rst, but only reject the null
of integrated of order one in the case of strong evidence against. Alternatives,
include linear deterministic trends, polynomial trends etc. Dangerous, spurious
detrending under the maintained hypothesis of integrated variables.
7.4 Aggregation
The following section oers a brief discussion about the problems of aggregation.
The interested reader is referred to the literature to learn more [Wei (1990 is a
good textbook with many references on the subject, see also Sj (1990, ch. 4].
Aggregation of series means aggregation over agents and markets, or aggregation
over time. The stock of money, measured by (M3), at the end of the month
represents an aggregation over individuals. A series like aggregate consumption in
the national accounts, represents an aggregation over both individuals and time.
Aggregation over time is usually referred to as temporal aggregation. Money
holdings is a stock variable which can be measured at any point in time. Temporal
aggregation of a stock variable implies picking observations with larger intervals,
using say a money series measured at the end of a quarter, instead of at the end
of each month. Consumption, on the other hand is a ow variable, it cannot
be measured at a point in time, only as the sum of consumption over a given
period. Temporal aggregation in this case implies taking the sum of consumption
over intervals. The distinction is of importance because the eects of temporal
aggregation are dierent for stock and ow variables.
Aggregation, both over time and individuals, can change the functional form
of the distribution of the variables, and that it can aect the residual variance
and t-values. Exactly how aggregation changes a model varies from situation
to situation. There are however some general conclusions regarding temporal
aggregation which we will repeat in this section. In many situations there is little
we can do about these problems, except working with continuous time models,
or=and select series with a low degree of temporal aggregation. That the problem
is hard to deal with is no excuse for forgetting or hiding them, as it is done in many
text books in econometrics. The area of aggregation is an interesting challenge for
econometricians since it has not been explored as much as it deserves.
An interesting example of the consequences of aggregation is given in Christiano and Eichenbaum (1987). They show how one can get extremely dierent
results by using discrete time models with yearly, quarterly and monthly data
compared with a continuous time model. They tried to estimate the speed of
adjustment in the stock of inventories, in the U:S national accounts. Using a
continuous time model they estimated the average time for closing 95% of the
gap between the desired and the actual stock of inventories, to be 17 days. The
discrete models predicted much higher rates. Using monthly data the result was
46 days, with quarterly data 7 months, and with yearly data 5 (1=2) year!
Aggregation also becomes an important problem if we have a theory that describes the stochastic behavior of a variable which we would like to test with
empirical data. There are many results, in macro and nance, that predict that
series should follow a random walk, or be the outcome of a martingale process.
76
integer [(p + d) + (q
d)=m];
77
estimated strength of the relationship and can therefore lead to wrong conclusions
from Granger non-causality tests. For ow variables, on the other hand, temporal
aggregation turns a one direction causality into what will appear to be a two-sided
causality. In this situation a clear warning is in place.
~ t and Y~t .
Finally, we also look at the aggregation of two random variables, X
Suppose that they are two independent stationary processes with mean zero,
~ t j yt ] = E[Y~t j xt ] = 0:
E[X
(7.22)
1 ; xt k )
x ;k
(7.23)
cov(yt
1 ; yt k )
y ;k
(7.24)
~ t + Y~t ;
Z~t = X
(7.25)
x ;k
+ y ;k :
(7.26)
(7.27)
Y~t
(7.28)
and
and,
~ t + Y~t ;
Z~t = X
(7.29)
then,
Z~t
ARM A(x1 ; x2 );
(7.30)
t;
t;
t;
B(L)
A(L) xt
B(L)
A(L) xt
+ (L)
(L)
(L) t
Notice that the transfer function is also a rational distributed lag since it
contains a ratio of two lag structures. Also, (7) and (8) can be viewed as distributed
lag models since D(L) = [B(L)=A(L)]. Notice that rational distributed lag models
require some information about B(L) to be workable.
Imposing restrictions on the lag structure B(L) in distributed lag models lead
to further models;
9. Geometric lag structure (= Koyck), where B(L) is assumed to decline
according to some exponential function.
10. Polynomial distributed lag (PDL) models, where B(L) declines according
to some polynomial function, decided a priori. (= Almon lags).
11. All other types of a priori restrictions on B(L) not covered by (9) and
(10).7
12. The error correction model. This model embraces all of the above models
as special cases. The following explains way this is so.
Introduction to Error Correction Models
Economic time series are often non-stationary, their means and variances change
over time. The trend component in the data can either by deterministic or stochastic, or a combination of both. Fitting a deterministic trend assumes that the
data series grow with a xed rate each period. This is seldom a good way of
characterizing describing trends in economic time series. Instead they are better
described as containing stochastic trends with a drift. The series might be growing over time, but it is not possible to predict whether it grows or declines in the
next period. Variables with stochastic trends can be made stationary by taking
rst dierences. This type of variable is called integrated of order 1, where the
order of integration is determined by the number of times the variable needs to be
dierenced before it becomes stationary.
A necessary condition for tting trending data in an econometric model, is that
the variables share the same trend, otherwise there is no meaningful long-run relationship between them.8 Testing for co-integration is a way of testing if the data
7 Restrictions are put on the lag process to make the estimation more eective. A priori,
restrictions can be motivated by a limited sample and muticollinarity that aects estimated
standard errors of the individual lags. These type of restrictions are not used anymore. Today,
it is recognized that it is more important to focus information criteria, white noise residuals and
building a well-dened statistical model, instead of imposing restrictions that might not be valid.
8 The exception is tests of the e cient market hypothesis, and related tests of rational expectations. See Appendix A in Sj and Sweeney (1998) and Sj (1998).
79
has a common trend, or if they tend to drift apart as time increases. The simplest
way to test for cointegration is the so called Engle and Granger two step procedure. The test implies determining whether the data contains stochastic trends,
and if so, testing if there are common trends. If xt and yt are two variables, with
non-stochastic trends that become stationary after rst dierencing, cointegration
can be tested by running the following co-integrating regression,
yt =
+ xt + t :
(7.31)
If both yt and xt are integrated variables of the same order, a necessary condition for a statistically meaningful long-run relationship is that the residual term
( t ) is stationary. If that is the case the error term from the regression can be
seen as temporary deviations from the long-run, and and can be viewed as
estimates of the long-run steady state relation between x and y.
A general way of building a model of time series, without imposing ad hoc a
priori restrictions, is the autoregressive distributed lag model. For two variables
we have,
A(L)yt = B(L)xt + t ;
(7.32)
Pk
Pk
where the lags are given by A(L) = i=0 ai , and B(L) = i=0 bi . The rst
coe cient in A(L) is set to unity, a1 = 1. The lag length is chosen such that
the error term becomes a white noise process, t
N ID(0, 2 ). The long-run
solution of this model is given by,
yt = xt +
t;
(7.33)
where = B(L)=A(L). Without loss of generality we can use the dierence operator, xt = xt xt 1 , to rewrite the autoregressive model as an error correction
model,
k
k
X
X
yt =
x
+
yt i + ECMt 1 + t ;
(7.34)
t
i
i
i
i=0
i=1
80
81
82
Et fyt g = yt = A(L)
B(L)xt :
(8.1)
B(1)
x;
A(1)
(8.2)
(8.3)
(8.4)
The total eect of a change in xt is given by the sum of the coe cients in D(L)
when L = 1. If there are m lags in D(L), the total multiplier is
D(1) = (
+ :::
m)
m
X
j:
(8.5)
j=0
m
X
=[
j ]=
D(1);
(8.7)
j=0
such that it represents the share of the total multiplier up until the j : th lag.
MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS.
83
j ]=[
j=0
m
X
j ];
(8.8)
j=0
Notice that m could be equal to innity if we have a stable model, with stationary variables, such that the innite sum of i converges to a constant sum in
the long run.
The mean lag can be derived in a more sophisticated way, by dierentiating
D(L) with respect to L and then dividing by D(1). That is,
D(L) =
1L
D0 (L) =
+2
2L
2
2L
+3
+ ::: +
2
3L
s
sL ;
+ ::: + s
and
s 1
:
sL
(8.9)
(8.10)
D0 (1)
D(1)
B 0 (1)
B(1)
A0 (1)
A(1)
(8.11)
Finally we have the median lag, representing the number of periods required
for 50% of the total eect to be achieved. The median lag is obtained by solving,
j
X
[
j ]=
D(1) = 0:50:
(8.12)
j=0
84
9. VECTOR AUTOREGRESSIVE
MODELS
The extension of ARIMA modeling into a multivariate framework leads to Vector
Autoregressive (VAR) models, Vector Moving Average (VMA) models and Vector
Autoregressive Moving /VARMA) models. In economics, since most variables
display autocorrelation and are cross-correlated, VAR models are an interesting
choice for modeling economic systems. Vector models can be constructed using
similar techniques as those for single variable ARIMA models. The autocorrelation
and partial autocorrelation functions can be extended to display cross-correlations
among the variables in the system. However, when modelling more than two
variables, these cross autocorrelation and cross partial autocorrelation functions
quickly turn into complex matrix expressions for each lag.1 Thus, the crosscorrelation functions are not practical tools to work with.
The advantages of using VARs are that as VAR represent a statistical description of the economy. When using ARIMA on univariate series, in many situations
the combination of AR and MA processes turn out to be an e cient way of nding a stochastic representation of a process. VAR models are usually eective in
modeling multivariate systems, and can be used to make forecasts and dynamic
simulations of dierent shocks to system. These shocks can come from policy,
from productivity or anywhere in the economy basically, and the shocks can be
assumed to transitory or permanent. The main complicating factor is that in
order to understand what shocks and simulations actually mean it is necessary
to identify the underlying economic relation among the variables. To make VAR
models work for economic analysis it is necessary to impose some restrictions on
the residual covariance matrix of the VAR. Thus, there is no free lunch here in
terms of avoiding discussing causality and simultaneity problems. It is necessary
to point out the latter because in the beginning of the history of VAR models it
seemed like VAR models could be used without economic theory, but that was
build on a misunderstanding.
In econometrics the focus is on nding a parsimonious VAR representation
with N ID residuals.
Let xt be an p dimensional vector of stochastic time series variables, represented as a the k : th order VAR model,
xt
k
X
Ai x t
+ et ; or
i=1
A(L)xt
= et
(9.1)
Pp
where Ai is the matrix of coe cients of lag number i; so A0 A = i=0 Ai ; where
A0 is a diagonal matrix, et is a vector of white noise residual terms. Notice that
all variables across all equations have the same lag length (k). This is so because
it makes it possible estimate the system with OLS. If the lag order is allowed vary,
the VAR must be estimated with the seemingly unrelated regressor method.
A VAR model can be inverted into its VECMA form as
1 See
85
xt =
1
X
Ci xt
= C(L)et
i=1
The MA form is convenient for analysing the properties of a VAR and investigate the consequences of shocks to the system. Estimation, however, is usually
done in the VAR form, and is straightforward since each equation can be estimated
individually with OLS. The lag length (k) of the autoregressive process is chosen
such that the estimated residual process, in combination with constants, trend,
dummy variables and seasonals, becomes white noise process in each equation.
The idea is that the lag length is equal for all variables in all equations.
A
is
3
2 VAR
3 of dimension p with a constant
2
3 2
2 second
3 order
e1t
a0
x1t 1 x1t 2
x1t
7
6
7 6
6 x2t 7 6 a1 7
a11 a12
a1p 6 x2t 1 x2t 2 7 6 e2t 7
7
6
7 6
+6 . 7
6
7
6 .. 7 = 6 .. 7 +
.
.
..
..
a21 a22
a2p 4
5 4 .. 5
4 . 5 4 . 5
ept
ap
xp 1 xpt 2
xpt
VAR models were strongly advocated by Sims (1980) as a response to what he
described as incredible restrictionsimposed on standard structural econometric
models. Up until the mid 80s, empirical time series econometrics was dominated
by the estimation of text-book equations. Researchers simply took an equation
from theory, estimated it, and did not pay much attention to whether the model
and the data actually tted each other. Typically, dynamic lag structures where
treated in a very ad hoc way. Sims argued that it would be better to nd a
statistical model, which described the data series and their interaction, as well as
possible. Once the statistical model was there, it could be used to forecast and
simulate the economy. In particular, it would according to Sims be possible to
analyse the eects of various policy changes.
Simscritique is related to the "Lucas critique". Lucas showed how in a world
of rational expectations, it was not possible to understand estimated parameters in
structural econometric models as (deep) structural behavior or policy parameters.
Since agents form their behavior on plans building on forecasts of variables, not
on historical outcomes of variables, the estimated parameters based on historical
observation become a mixture of behavioral parameters and forecast generating
parameters. Further, under rational expectations, econometric models could not
be used to analyze policy changes, because a change in policy would by denition
lead to a change in the parameters of the system. Sims therefore argued for VAR
models as a statistical description of the economy, under given policy rules. The
eects of surprise changes in policy variables could then be analysed in the reduced
form.
VAR models represent the reduced form of an underlying structural model.
This can be seen by starting from a general (but not necessarily identied) structural model, and rewriting it in reduced form. As an example, start from the
bivariate model,
yt
xt
+ a11 xt + b11 yt
2 + a21 yt + b21 yt
+ b12 xt
1 + b22 xt
1t
(9.2)
1+
2t
(9.3)
86
yt
xt
2+
11 yt 1
21 xt
1+
12 xt 1
22 xt
+ e1t
(9.4)
1 + e2t :
(9.5)
The equations form a bi-variate VAR model of order one. The residuals of the
VAR model (the reduced form) contain the residuals and the parameters (a11 and
a21 ) of the structural model. The reduced system can be estimated by applying
OLS to each equation.2 The parameters of the VAR relate to the structural model
as
11 b12 + b11
; etc.
11 =
1
11 21
Thus, the parameters of the VAR are complex functions of some underlying structural model, and as such they are on their own quite uninteresting for
economic analysis. It is the lag structure and sometimes it signs that are more
interesting. The two residuals in this VAR are,
1t
e1t =
11 2t
and
11 1t
e2t =
(9.6)
11 21
2t
(9.7)
21 11
These residuals are both white noise terms, but they are correlated with each
other whenever the coe cients 11 or 21 are dierent from zero.
The generalization of structural system above, setting zt = fyt ; xt g; is
Bzt =
1 zt 1
t;
(9.8)
where
1
11
01
11
; 0=
and 1 =
1
02
21
If both sides of 9.8 is multiplied with B 1 the result is,
B=
21
zt
= B
+B
1 zt 1
+ et ;
1 zt 1
+B
12
22
(9.9)
87
t
X
i=0
c11;i
c21;i
c12;i
c22;i
e1i
e2i
(9.10)
32
3
33
32
3
33
The problem for identifying the VAR and doing the impulse responses is that
the covariance matrix is not diagonal. The Cholesky decomposition
P builds
on the fact that any matrix P with the property that PP0 =
denes
an orthogonal covariance matrix such that et = P 1 t becomes a diagonal
matrix, et s (0; IN ):The ordering of the equations determines the outcome,
and the causal ordering of the residual shocks. With N = 3, there are three
possible orderings and outcomes, which can be more or less dierent.
3 Early VAR modelers did not recognize the need for orthogonalization. Thus papers from the
rst part of the 1980s must be read by some care.
88
Set up a recursive system. Instead of letting the computer do all of the job,
you can set up the matrix B 1 so that the residuals form a recursive system
by deciding on an ordering of the equations that corresponds to the ordering
and residual correlations created be the Cholesky decomposition. Thus, the
residual in equation one is not aected by the other two. (Meaning that x1t
is not explained by x2t or x3t ) The second residual is only aected by the
rst residual. And nally, the last (third) residual is aected by residual
one and two. Econometric programs often includes Cholesky decomposition
routines in combination with the analysis of VAR models. By changing
the ordering of the equations it becomes possible to compare the eects of
dierent recursive ordering of the variables. The problem is that we are
drowning in output as the dimension of the VAR increases.
Structural Autoregressive models SVAR. If economic theory does not suggest a recursive ordering, use economic theory to impose restrictions on the
B 1 matrix. This is called Structural Vector Autoregressive (SVAR) models.4 In practice the approach implies formulating a small structural (static)
economic system for the residual process et : If yt is an p-dimensional system,
the error covariance matrix contains a total of p2 parameters, leading to the
estimation of p(p + 1)=2 or (p2 + p)=2 number of parameters, equal to the
number restrictions necessary for the matrix B 1 : As an example for a 3
variable system, the error process could be set up as,
e1t
e2;t
= c21
1t
e3t
= c31
2t + c32
1t
2t
2t
3t ;
(9.11)
e2;t
= c21
e3t
= c31
1t
+ c13
3t
1t
2t
2t
3t :
(9.12)
+ B t;
89
argue that the shock is unique coming only from that particular variable. Without
orthogonalization the shock can be a mixture of eects from dierent variables,
and not a cleanshock.
One controversy here is that it is up to the econometrician to identify and
label the shocks as, for instance, demand or supply shocks. The basis for such
labeling might not be strong. Further, by denition, the errors include, not only
structural relations, but also everything that we do not know or understand about
the system. For that reason it might be better to use economic theory to identify
structural relations and build conventional econometric models instead, rather
than trying to analyse what we do not understand. On the other hand, in a
world of rational expectations where the expectations generating mechanisms is
unknown, or cannot be modelled, VAR models is the best we can do.
9.0.1
First you thing about your system. What is it that you want to explain? How
could it be modelled as a recursive system? Second you estimate the equations, by
OLS, the same lag lengths on all variables across the equations to avoid using the
SUR estimation technique. Third, you investigate outliers and shifts and put in
the appropriate dummy variables. Fourth, you try to nd a short lag structure and
white noise residuals. Fifth, if you cannot fulll 4) you minimize the information
criteria. In this case AIC is not the best choice, use BIC or something else.
9.0.2
The orthogonalization of the residuals can oer some interesting intellectual challenges, especially in SVAR approach. If the variables in the VAR are integrated
variables, which also are co-integrating, we are faced with some interesting problems. In the co-integrating VAR model there will be both stationary shocks and
permanent chocks, and identifying these two types in the system is not always
easy. If the VAR is of dimension p, there can be at most r co-integrating vectors, 0
r
p, and p r common stochastic trends. Juselius (2006) ("The
Co-integrated VAR Model", Oxford University Press) shows how an identication
of the structural MA model, and orthogonalization of the residuals, can be done
of both the in terms of short and the long-run of the system.
The VAR(2), with no constants, trends or other deterministic variables, will
have the following VECM representation after nding r co-integrating vectors,
xt =
xt
xt
+ "t
t
X
"i + C (L)"t + x0
i=1
Where the rst factor on the right hand side represent the stochastic trends
in the system and the second factor represents stationary part. The C matrix
will then represent all that is not the stationary vectors, and is related to the
co-integrated vectors as, C = ?( 0 ? ?) 1 0 ?:
90
91
Finally, remember that the data is the real world, economic theories are constructions of the human mind (quote from David Hendry). If you want to use a
priori information of some kind you might miss what the data, the real world, is
trying to tell you.
92
Part III
93
k
X
i yt
i+
i=1
k
X
i xt i
+ et ;
(9.13)
i=1
k
X
i=1
i xt
i+
k
X
i yt i
t;
(9.14)
i=1
noise innovation process with respect to all relevant information for explaining
the movements of xt and yt . This is an important issue which is often forgotten
in applied work, were bivariate systems are the rule rather than the exception.
Grangers basic denition of non-causality is based on the assumption that all factors relevant for predicting yt are known. Let It represent all relevant information,
both past and present, let Xt be present and past observations on xt , such that
Xt = (xt , xt 1 , xt 2 , ..., x0 ); It 1 and Xt 1 represent past observations only.
The variable xt can therefore be said to Granger cause yt if the mean square error
(MSE) increases when yt is regressed against the information set where Xt 1 is
removed. In the bivariate case, this can be stated as,
M SE(^
yt jIt
1)
< M SE[^
yt j(It
Xt
1 ];
(9.15)
96
xt
xt + "1t
"2t
If "1t and "2t are both stationary it follows that xt is I(1) and that yt if I(0)
that = 0. On the other hand if 6= 0, it follows that yt is I(1): To estimate
it is required that yt is not simultaneously inuences xt : If yt or yt is part of
the left-hand side of xt equation (and thus embedded in "2t ) the result is that
E("1t "2t ) 6= 0, and we can write "1t = "2t + ut : Where for simplicity we assume
that ut s N (0; 2 ):
Now, if we estimate with OLS, the outcome would be a biased estimate of
, since E(xt "1t ) = E(xt ( "2t + ut ), and we can no longer assume that xt and "1t
are independent. This is example of lack of weak exogeneity. With the rst model
is not possible to estimate the parameter of interest , the outcome from OLS is
a dierent and biased
value.
10.1.1
Weak Exogeneity
Weak exogeneity spell out the conditions under which it is possible to obtain
unbiased and e cient estimates. The denition is based splitting the joint density
function, into a conditional density and a marginal density function;
D1 (yt ; zt j Yt
1 ; Zt 1 ;
1)
= D2 (yt j yt ; Yt
1 ; Zt 1 ;
2 )D3 (zt
j Yt
1 ; Zt 1 ;
3 );
(10.1)
97
10.1.2
Strong Exogeneity
Strong exogeneity spells out the conditions for conditional forecasting and simulations of a model with not modelled variables. The condition is weak exogeneity
and that the marginal model should not depend on the endogenous variable. Thus
the marginal process must be
D3 (zt j Yt
1 ; Zt 1 ;
3)
= D3 (zt j Zt
1 3 ):
(10.2)
98
10.1.3
Super Exogeneity
Super exogeneity determines the conditions for using the estimated parameters
for policy decisions. The condition is weak exogeneity and that the parameters
of the conditional model are stable w.r.t. to changes in the marginal model. For
instance, if the money supply rule changes, the parameters of the marginal process
will also change. If this also leads to changes of the parameters of the conditional
model, the conditional model cannot be used to analyse the implications of policy
changes. Thus, super exogeneity denes the situations when the Lucas critique is
not valid.
1 xt
2 zt
The estimated parameters of this model is analysed under the assumption that
there is no correlation between the variables. The parameter 1 is understood as
the eect on yt following a unit change in xt while holding the other variables in
the model (zt ) constant. In the same way 2 measures the eect on yt while xt is
held constant. Another way of expressing this is the following; Efyt j zt g = 1 xt
and Efyt j xt g = 2 zt ; which tells us that the eect of one parameter cannot be
analysed in isolation from the rest of the model. The eect of zt in the model is
not on yt in it self, it is on yt conditional on xt . The meaning of holding say xt
constant in the model, while zt is free to vary implies that we study the eect on yt
after removingthe eects of xt on yt .If xt and zt are correlated it is not possible to
keep one of the constant while the other is changing. This is the multicollinearity
problem.
The statistical problem is best understood by looking at the OLS variance of
^ : The variance is
V ar( ^ 2 ) = P
(xt
x2 ) (1
xz )
99
xt +
3 xt 1
+ t:
(10.3)
The transformation is just a reparameterization and does not aect the residual
term. The parameter 3 = 1 + 2 which is the long run static solution of the
model. Thus we get an estimate of the short run eect on yt from 1 and at
the same time a direct estimate of the static long run solution from 3 . If the
collinearity between xt and xt 1 is high, it can be assumed to be quite small when
we look at xt and xt 1 . Since our nal interest in modelling economic time series
is to nd a well-dened statistical model, which mimics the DGP of the variable(s)
multicollinearity is not really a problem. We will therefore not deal with this topic
any further.
100
101
this in advance. S=he must therefore set up the estimated model so that there is an
meaningful alternative hypothesis to the stochastic trend (or unit root hypothesis).
A general alternative is to assume that yt is driven by a combination of t and t2 :
It is therefore recommendable, if t is white noise, to start with model c. If
the t-value on is signicant according to the table in Fuller (1976). The null
hypothesis of unit root process is rejected. It follows then that the t-statistics
for testing the signicance of and follow standard distributions. But, as long
as the unit root hypothesis ( = 0) cannot be rejected, both and must be
assumed to follow non-standard distributions. Thus, under the hypothesis that
= 0, the appropriate distributions for and are found in Dickey and Fuller
(1980).
In a limited sample it might be wise to compare the outcome of both model c
and a.
The test is easily extended to higher order unit roots, simply by performing
the test on dierenced data series.
When will the test go wrong? First, if t is not white noise. In principle, et
can be an ARIMA process. In the following a number of models dealing with this
situation is presented. If there is more than one unit root, then testing for one
unit root is likely to be misleading. Hence a good testing strategy is to start by
testing for
two unit roots, which is done by applying the DF-test to the rst dierence of
the series ( yt ). If a unit root in yt is rejected one can continue with testing for
one unit root, using the series in level form yt .
k
X
yt
+ t;
(11.1)
i=1
or
yt =
+ yt
1+
k
X
yt
+ t;
(11.2)
i=1
or
yt =
+ yt
1+ t+
k
X
yt
+ t:
(11.3)
i=1
The asymptotic test statistic is distributed as the DF-test, and the same recommendation applies to these equations, make sure there is a meaningful alternative
hypothesis. Therefore start with the model including both a constant and a trend.
The ADF test is better than the original DF-test since the augmentation leads
to empirical white noise residuals. As for the DF-test, the ADF test must be set
up in such a way that it has a meaningful alternative hypothesis, and higher order
integration must be tested before the one only unit root case.2
2 Sj
102
T [S 2
S ^
t
S
S 2 ][std:er(^ )=s]
2S
(11.4)
T
X
^2t ;
(11.5)
t=1
and
S2 = T
T
X
t=1
^2t + 2T
l
X
[1
j(l + 1)
j=1
T
X
^t^t
j:
(11.6)
t=j+1
103
+ t + et;
(11.7)
t
X
(11.8)
e^2i and
(11.9)
where
St2 =
i
X
i=1
s2 (k) = T
t
X
e^2t + 2 T
k
X
w(s; k)
t
X
e^t e^t
s:
(11.10)
t=s+1
s=1
The critical values for the test is given in Kwiatkowsky et.al (1992). A Bartlett
type window, w(s; k) = 1 [s=(k + 1)] is used to correct the estimate (sample)
test statistics correspond to the simulated distribution which is based on white
noise residuals. The KPSS test appears to be powerful against the alternative of
a fractionally integrated series. That is, a rejection of I(0) does not lead to I(1),
as in most unit root test, but rather to a I(d) process where 0 < d < 1. These
type of series are called fractionally integrated. A high value of d implies a long
memory process. In contrast to an integrated series I(1), or I(2) etc, a fractionally
integrated series is reverting. Baillie and Bollerslev (1994).
+ t+
1t
(11.11)
2 : yt =
1t
t2 +
2t ;
(11.12)
(11.13)
where RSS1 and RSS2 are the residual sums of squares from model 1 and 2
respectively, s2 (k) is as above.
We can conclude that among theses tests, the ADF test is robust as long as
the lag structure is correctly specied. The gains from correcting the estimated
residual variance seem to be small.
105
L)d yt =
+ (L) t ;
(11.14)
1
X
i t i;
(11.15)
i=0
P1
where t iid(0, 2 ), and i=0 2i < 1. If this series also belongs to the class
of series which has an ARMA representation, the autocorrelation function will die
out exponentially. For an I(1) the autocorrelation function will display complete
persistence, the theoretical autocorrelation function is unity for all lags.
Because the autocorrelation function of an ARMA process dies out exponentially, it can be said to have a relatively short memory compared to series which
have autocorrelation functions which do not die out as quickly. ARFIMA series,
therefore represents long memory time series. The ARFIMA model allows the
autocorrelation coe cients to exhibit hyperbolic patterns. For d < 1, the series is
mean reverting, for 0:5 < d < 0:5 the ARFIMA series is covariance stationary.
For a statistician who is describing the behavior of a time series an ARFIMA
model might oer a better representation than the more traditional ARMA model,
see Diebold and Rudebush (1989) Sowell (1992). For an econometrican however,
the economic understanding is of equal importance. The standard question in most
economic work is whether to use levels or percentage growth rates of the data, to
construct models with known distributions. That means decide whether series are
I(0) or I(1). Fractional integration does not aect these problems. It becomes
important when we ask specic questions about the type of long-run memory we
are dealing with, like is there mean reversion in the forward premium, or the real
exchange rate, or in assets prices etc. Thus only when economic theory gives us
a reason for testing something else than I(0) and I(1) is fractional integration
106
FRACTIONAL INTEGRATION
107
108
+ t + y~t
(12.1)
1t
2t
+ ::: +
nt
+ y~t :
(12.2)
109
Deterministic trends are seldom the best choice for economic time series. Instead the non-stationary behaviour is often better described with stochastic trends,
which have no xed trend that can be predicted from period to period. A random walk serves as the simplest example of a stochastic trend. Starting from the
model,
yt = yt 1 + vt where vt N ID(0; 2 );
(12.3)
repeated substitution backwards leads to,
yt = y0 +
t
X
vi :
(12.4)
i=0
The expression shows how the random walk variable is made up by the sum of
all historical white noise shocks to the series. The sum represents the stochastic
trend. The variable is non-stationary, but we cannot predict how it changes, at
least no by looking at the history of the series. (See also the discussion above
concerning random walks under the section about dierent stochastic processes)
The stochastic trend term is removed by taking the rst dierence of the series.
In the random walk case it implies that yt = vt is a stationary variable with
constant mean and variance. Variables driven by stochastic trends are also called
integrated variable because the sum process represents the integrated property of
these variables.
A generic representation is the combination of deterministic and stochastic
trends,
yt = + t + t + y~t ;
(12.5)
where t = t 1 + vt ; vt is N ID(0; 2 ); t is the deterministic trend and y~t
is a stationary process representing
stationary part of yt : In this model, the
Pthe
t
stochastic trend is represented by i=1 vi :
An alternative trend representation is segmented deterministic trends, illustrated by the model
yt =
1 t1
2 t2
+ ::: +
k tk
+ y~t
(12.6)
where t1 ; t2 etc;
_ are deterministic trends for dierent periods, such as wars, or
policy regimes such as exchange rates, monetary policy etc.. Segmented trends
are an alternative to stochastic trends, see Perron 1989, but the problem is that
the identication of these dierent trends might be ad hoc. Given a suitable
choice of trends almost any empirical series can be made stationary, but are the
dierent trends really picking up anything interesting, that is not embraced by the
assumption of stochastic trends, arising from innovations with permanent eects
on the economy?
12.0.1
+ xt + "t :
(12.7)
12.0.2
111
The intuition here is that for the two variables to form a meaningful long-run
relationship, their must share the same trend. Otherwise they will be drifting
away from each other as time elapses. Therefore, to build econometric models
which make sense in the long run, we have to investigate the trend properties
of the variables and determine the type of trend and whether variables are cotrending and co-integrating or not. In econometric work, trend properties refer
to the properties of the sample and how to do inference. It is not a theoretical
concept about how economics variables grow in the long run.
Once we have claried the trend properties, it becomes possible to establish
stationary relations and models, and econometric modeling can proceed as usual,
and standard techniques for inference can be used.
Denitions:
Denition 1 A series with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after
di erencing (d) times, but which is not stationary after di erencing (d 1) times,
is said to be integrated of order d, denoted xt I(d):
Denition 2 The components of the vector xt are said to be co-integrated of order
d; b, denoted xt CI(d; b); if (i) xt is I(d) and (ii) there exists a non-zero vector
such that 0 xt I(d b); d b > 0: The vector is called the co-integrating
vector.(Adapted from Engle and Granger (1987)).
Remark 1 If xt has more than two elements there can be more than one cointegrating vector .
Remark 2 The order of integration of the vector xt is determined by the element
which has the highest order of integration. Thus, xt can in principle have variables
integrated of di erent orders. A related denition concerns the error correction
representation following from co-integration.
Denition 3 A vector time-series xt has an error-correction representation if
it can be expressed as A(L)(1 L)xt =
zt 1 + ! t ; where ! t is a stationary
multivariate disturbance term, with A(0) = I; A(1) having only nite elements,
zt = 0 xt ; and
a non-zero vector. For the case where d = b = 1, and with
co-integrating rank r, the Granger Representation Theorem holds. (Adapted from
Banerjee et.al (1993))
Remark 3 This denition and the Granger Representation Theorem (Engle and
Granger, 1987) tell us that if there is co-integration then there is also an error
correction representation, and there must be Granger causality in at least one
direction.
12.0.3
Under the general null hypothesis of independent and integrated variables estimated variances, and test statistics, do not follow standard distributions. Therefore the way ahead is to test for co-integration, and then try to formulate a regression model (or system) in terms of stationary variables only. Traditionally there
are two approaches of testing for co-integration; residual based approaches and
other approaches. The rst type starts with the formulation of a co-integration
regression, a regression model with integrated variables. Co-integration is then
determined by investigating the residual(s) from that regression. The Engle and
112
Granger two-step procedure and the Phillips-Oularies test are examples of this approach. The other approach is to start from some representation of a co-integrated
system, (VAR, VECMA, etc.) and test for some specic characterization of cointegrated systems.. Johansens VECM approach, or tests for common trends are
examples.
The Engle and Grangers two-step procedure is the easiest and most used
residual based test. It is used because of its simplicity and ease of use, but is not a
good test. The two-step procedure, starts with the estimation of the co-integrating
regression. If yt and xt are two variables integrated of order one, the rst step is
to estimate the following OLS regression
yt =
+ xt + zt
(12.8)
where the estimated residuals are z^t : If the variables are co-integrating, z^t will
be I(0). The second step is to perform an Augmented Dickey-Fuller unit root test
of the estimated residual,
z^t =
+ z^t
k
X
z^t
+ "t :
(12.9)
i=1
113
good when it does not hold. The dynamics of the two process and their possible
co-integrating relation is usually more complex.
Third, the test assumes that there is only one co-integrating vector. If we test
for co-integration between two variables this is not a problem, because then there
can be only one co-integration vector. Suppose that we add another I(1) variable
(ut ) to the co-integrating regression equation,
yt =
1 xt
2 ut
t:
(12.10)
If yt and xt are co-integrating, they already form one linear combination (zt )
which is stationary. If ut
I(1) is not co-integrating with the other variables,
OLS will set 2 to zero, and the estimated residual ^t is I(0). This is why the test
will only work if there is only one co-integrating vector among the variables. If yt
and xt are not co-integrating then adding ut I(1) might lead to a co-integrating
relation. Thus, in this respect the test is limited, and testing must be done by
creating logical chains of bi-variate co-integration hypotheses.
Other residual based tests try to solve at least the rst problem by adjusting
the test statistics in the second step, so that it always fullls the criteria for testing
the null correctly. Some approaches try to transform the co-integrating regression
is such a way that the estimated parameters follow a standard normal distribution.
A better alternative to testing for co-integration among more than two variables
is oered by Johansens test. This test nds long long-run steady-state, or cointegrating, relations in the VAR representation of a system. Let the VAR,
Ak (L)xt =
Dt + "t ;
(12.11)
represent the system. The VAR is a p-dimensional system, the variables are assumed to integrated of order d; fxgt I(d); Dt is a vector deterministic variables,
constants, dummies, seasonals and possible trends,
is the associated coe cient
P
matrix. The residual process is normally distributed white noise, "t ID(0; ).
It is important to nd the optimal lag length in the VAR and have a normal
distribution of the error terms in addition to white noise because the test uses
a full information maximum likelihood estimator (FIML). estimators are notoriously sensitive to small samples and misspecications why care must be taken in
the formulation of the VAR. Once the VAR has been found, it can be rewritten
in error correction form,
xt =
xt
k
X
xt
Dt + "t
(12.12)
i=1
In practical use the problem is to formulate the VAR, the program will rewrite
the VAR for the user automatically. Johansens test builds on the knowledge
that if xt is I(d) and co-integration implies that there exists vectors such that
0
xt I(d b). In a practical situation we will assume that xt (1) and if there
is cointegration, 0 xt
I(0). If there is cointegration, the matrix
must have
reduced rank. The rank of indicates the number of independent rows in the matrix. Thus, if xt is a p-dimensional process, the rank (r) of
matrix determines
the number of co-integrating vectors ( ), or the number of linear steady state
relations among the variables in fxgt : Zero rank (r = 0) implies no cointegration
vectors, full rank (r = p) means that all variables are stationary, while a reduced
rank (0 < r < p) means cointegration and the existence of r co-integrating vectors
among the variables.
The procedure is to estimate the eigenvalues of
and determine their signicance.1
1 The
114
test is called the Trace test and its use is explanied in Sj Guide to testing for ...
= a11 yt
+ a12 yt
+ a13 xt
+ a14 xt
zt
= a21 zt
+ a22 zt
+ a23 zt
+ a24 zt
2
2
+ e1t
(12.13)
+ e2t :
(12.14)
1)yt
+ a12 yt
+ a13 xt
+ a14 xt
+ e1t
since the equation was correctly specied from the beginning it can transformed
as long as we do not do anything that aects the properties of error term. Our
aim is to split all lag terms into rst dierences and lagged variables in such a
way that the model consists of one lag at t-1 for all variables and rst dierences.
We can do this by using the dierence operator,
= (1 L), which can be
used as yt = yt yt 1 , or yt 1 = yt
yt : Referring to the operators we have
L = (1
); or Lyt = (yt
yt ): If we apply this to all lags of lower order than
t 1, we get for t 2 the following, yt 2 = yt 1
yt 1 , and zt 2 = zt 1
zt 1 :
Substitute this into the equation to get,
yt = (a11
1)yt
+ a12 (yt
yt
1)
+ a13 zt
+ a14 (zt
zt
1)
+ e1t
a12 yt
a14 zt
+ e1t
yt
+ e1t
a22 zt
24
xt
1+
1
X
xt
+ "t
(12.15)
i=1
115
where
xt =
yt
; xt =
zt
yt
;
zt
11
12
21
22
;and
11
21
116
iid(0;
C(L)
(13.2)
= [1=(1
L)]C(L) t :
(13.3)
where 1=(1 L) represents the sum of an innite series. For a limited sample,
we get approximately,
yt = y0 + (1 + L + L2 + ::: + Lt
)C(L) t ;
(13.4)
(13.5)
yt = (1
L)C(L) t ;
(13.6)
117
Let us see what happens with the process in the future. From above we get
the MA representation for some future period t + h;
2
3
t+h t+h
X
Xi
4
(13.7)
yt+h = y0 +
Cj 5 i
i=1
= y0 +
t
X
i=1
j=0
t+h
Xi
j=0
Cj 5
t+h
X
i=1+t
t+h
Xi
j=0
Cj 5 i :
(13.8)
The forecasts are decomposed into what is known at time t, the rst double
sum, and what is going to happen between t and t + h i. The latter is unknown
at time t, therefore we hhave to formi the conditional forecast of yt+h at time t;
Pt
Pt+h i
yt+h jt = y0 + i=1
Cj i :
j=0
The eect of a shock today (at time t) on future periods is found by taking
the derivative of the above expression with respect to a change in t ;
@yt+h jt =@
h
X
j=0
Cj ! C(1) as t ! 1:
(13.9)
Thus, the long-run eect of a shock today can be expressed by the static
long run solution of the MA representation of yt . (Equal to the sum of the MA
coe cients).
The persistence of a shock depends on the value of C(1). If C(1) happens to
be 0, there is no long-run eect of todays shock. Otherwise we have three cases,
C(1) is greater than 0, C(1) = 1 or C(1) is greater than unity. If C(1) is greater
than 0 but less than unity, the shock will die out in the future. If C(1) = 1, the
integrated variables (unit roots) case, a shock will be as important today as it is
for all future periods. Finally, if C(1) is greater than one (explosive roots) the
shock magnies into the future, and we have an unstable system.
If the series are truly I(1), spectral analysis can be applied to exactly measure
the persistence of a shock. The persistence of shocks has interesting implications
for economic policy. If shocks are very persistent, or explosive in some cases, it
might be a good policy to try to avoid negative shocks, but create positive shocks.
In our stabilization policy example, this can be understood as the authorities
should be careful with deationary policies, for instance, since they might result
in high and persistent social costs, see Mankiw and Shapiro (198x) for a discussion
of these issues.
In the following, the MA representation of systems of integrated processes are
analysed. For this purpose let yt be vector of I(1)-variables. Using the lag operator
as, L = 1 (1 L) [1
]2 , and Walds decomposition theorem gives,
yt = C(L)
= C(1) t +
C (L) t :
(13.10)
If yt is a vector of I(1) variables, then we know from above that if the matrix
C(1) 6= 0, any shock to the series has innite eects on future levels of yt . Let us
consider a linear combination of these variables 0 yt = zt . Multiplication of the
expression with 0 gives,
0
zt =
C(1)
C (L) t :
(13.11)
In general, it is the case that when yt is I(1); zt is I(1) as well. Thus a linear
combination of integrated variables will also be integrated. Implying that 0 C(1)
2 This
118
is the same as yt =
yt + yt
= [(1
L) + L]yt :
is dierent from zero. Suppose, however, that there exists a matrix 0 such that
0
C(1) = 0, which implies that when is multiplied with yt we get a stationary
process, 0 yt I(0).
As an example consider private aggregated consumption and private aggregated (disposable) income. Both variables could be random walks, but what about
the dierence between these variables? Is it likely to assume that a linear combination of them could be driven by a stochastic trend, meaning that consumption
would deviate permanently from income in the long run? The answer is no.
In the long-run it is not likely that a person consume more than his/her income,
nor is it likely that a person will save more and more. Thus, we have to think
of situations when two variables cointegrate, that is when a linear combination of
I(1) series forms a new stationary series, integrated of order zero. (A more formal
denition of co-integration is given in a following section.)
In terms of the C(1) matrix, common trends or co-integration implies that
there exists a matrix such that 0 C(1) = 0, hence we get,
0
zt =
C(L) t ;
(13.12)
= y0 + (1 + L + L2 + :::Lt
2
)[C(1) + (1
t 1
= y0 + C(1)(1 + L + L + :::L
L)C (L)]
) t + C (L) t :
(13.13)
(13.14)
If we have cointegration, and therefore common trends, C(1) must be of reduced rank. The matrix C(1) can be thought of as consisting of two sub-matrices,
such that C(1) = AJ, where J is dened as,
t
t 1
= (1 + L + L2 + :::Lt
)J t :
(13.15)
+ C (L)
(13.16)
119
120
Dt + t ;
(14.1)
k
X1
yt
+ yt
+ Dt + t ;
(14.2)
i=1
where
i
= (I +
i
X
Aj );
(14.3)
j=1
and
=
(I +
k
X
Aj ) =
A(1):
(14.4)
j=1
Notice that in this example the system was rewritten such that the variables in
levels (yt k ) ended up at the k : th lag. As an alternative it is possible to rewrite
the system such that the levels enter the expression at the rst lag, followed by k
lags of yt i . The two ways of rewriting the system are identical. The preferred
form depends on ones preferences.
Since yt is integrated of order one and yt is stationary, it follows that there
can be at most p 1 steady state relationships between the non-stationary variables
in yt . Hence, p 1 is the largest possible number of linearly independent rows in
the -matrix. The latter is determined by the number of signicant eigenvalues
^
in the estimated matrix ^ = A(1).
Let r be the rank of , then rank( ) = 0
implies that there are no combinations of variables that leads to stationarity. In
other words, there is no cointegration. If we have rank( ) = p, the
matrix is
said to have full rank, and all variables in y t must be stationary. Finally, reduced
rank, 0 < r < p means that there are r co-integrating vectors in the system.
Once a reduced rank has been determined, the matrix can be written as =
0
, where 0 yt represent the vectors of co-integrating relations, and a matrix of
adjustment coe cients measuring the strength by which each co-integrating vector
aects an element of yt . Whether the co-integrating vectors 0 yt are referred
A DEEPER LOOK AT JOHANSENS TEST
121
k
X1
1 ;i
yt
0 Dt
+ R1t ;
(14.5)
k
X1
2 ;i
yt
2 Dt
+ Rkt :
(14.6)
i=1
yt
i=1
The system in 14.1 can now be written in terms of the residuals above as,
R1t =
Rkt + et :
(14.7)
The vectors and can now be estimated by forming the product moment
matrices S11 , Skk and S1k from the residuals R1 ;t and Rk ;t ;
Sij = T
T
X
Rit Rjt ;
i; j = 0; k
(14.8)
i=1
For xed vectors, is given by ^ ( ) = S1k ( 0 Skk ) 1 , and the sums of squares
function ^ ( ) = S11 ^ ( )( 0 Skk )^ ( )0 . Minimizing this sum of squares function
leads to maximum likelihood estimates of and . The estimates of are found
after solving the eigenvalue problem,
j Skk
(14.9)
where
is a vector of eigenvalues. The solution leads to estimates of the
eigenvalues( ^ 1 , ^ 2 ; :::, ^ ), and the corresponding eigenvectors V^ = (^
v1 , v^2 , ..., v^ ),
normalized around the squared residuals from equation 14.7 such that V 0 S22 V = I.
The size of the eigenvalues ( i ) tells us how much each linear combination of
eigenvectors and variables, vi0 yt is correlated with the conditional process R1t (
yt j yt i , D). The number of non-zero eigenvalues (r) determines the rank of
and lead to the co-integrating vectors of the system, while the number of zero
eigenvalues (p r) dene the common trends in the system. These are the combinations of vi yt that determine the directions in which the process is non-stationary.
Given that 14.1 is a well-dened statistical model, it is possible to determine the
distribution of the estimated eigenvalues under dierent assumptions of the number of co-integrating vectors in the model. The distributions of the eigenvalues
depend not only on 14.1 being a well-dened statistical model, but also on the
number of variables, the inclusion of constant terms in the co-integrating vectors
and deterministic trends in the equations. Distributions for dierent models are
tabulated in Johansen (1995).
122
The maximized log likelihood, conditional on the short run dynamics and the
deterministic variables of the model is,
ln L = constant
(T =2) ln jS00 j
(T =2)
r
X
ln(1
i ):
(14.10)
i=1
From this expression two likelihood ratio tests for determining the number of
non-zero eigenvalues are formulated. The rst test concerns the hypothesis that
the number of eigenvalues is less than or equal to some given number (q) such that
H0 : r q, against an unrestricted model where H1 : r p. The test is given by,
2ln(Q; qj p) =
p
X
^ i ):
ln(1
(14.11)
i=q+1
The second test is used for the hypothesis that the number of eigenvalues
is less than the number tested in the previous hypothesis, H0 : r
q against
H1 : r q + 1, and is given by,
2ln(Q; qj q + 1) =
T ln(1
^ q+1 ):
(14.12)
yt = C(L)( t +
Dt ):
(14.14)
(L) t ;
(14.15)
C(1) t +
t
X
i + C(1) t + C(1)
i=1
t
X
Di + C
L)C
(L) when yt is
(L)( t + Dt );
(14.16)
i=1
123
0
?( ?
C(1) =
(14.17)
1
yt =
yt
+ yt
k
X2
yt
Dt + t ;
(14.18)
i=1
Pk 1
Pk 1
where = i=1 I + , i =
2:
j=i+1 j , and i = 1, ... k
If yt is I(2) and yt is I(1); a reduced rank condition for the matrix must
be combined with a reduced rank condition for the matrix of rst dierences as
well. Johansen (1991) shows that the condition for an I(2) process is
0
?
= ' 0;
(14.19)
0
where ' and are (p r)xs, with rank s. With I(2) variables yt is I(1).
To make these vectors stationary they have to be combined with the vectors of
rst dierences ( 2? yt ) to form stationary processes. In the latter expression
2
2
1
vectors, and = ( 0 ) 1 0 2? ( 20
. The
? is the squared orthogonal
? ?)
squared orthogonal vectors indicate which variables are I(2).
An I(2) model is estimated in a way similar to the I(1) model. Maximum
likelihood estimation is feasible since the residual terms of an I(2) model can be
assumed to be a Gaussian process. The rst step is to perform a reduced rank
regression for the I(1) model of yt on yt 1 , corrected for the short run dynamics
( yt 1 , ..., yt k+1 ) and the deterministic components ( Dt ). This leads to
estimates of r^, ^ and ^ :
In the second step, given the estimates of r^, ^ and ^ , a reduced rank test is
0
performed of ^ 0? 2 yt on ^ ? yt 1 , corrected for 2 yt 1 ; ::. 2 yt k+2 , and the
constant terms. This leads to the estimates s^, '
^ , and ^.
An I(2) process is harder to analyze in economic terms since the parameters
and the test hypotheses have dierent interpretations. The tests concerning the
vectors are still valid, but are in general only valid for I(1) processes. It is,
however, possible to form stationary relations by combining levels ( ^ yt ) with rst
2
dierence expressions ( ^ ? 0 yt ). The practical solution is to identify the I(2)
terms and nds ways of transforming them to I(1) relations. The transformation
to an I(1) system can be done by taking rst dierences of I(2) variables or by
taking ratios of variables; modeling the real money stock rather than the money
stock and the price level separately.
1 An orthogonal vector is often indcated by the sign
? attatched to the original vector. The
vector ? is the orthogonal vector to the vector if ? 0 = 0.
124
(15.1)
I:
(15.2)
125
(15.4)
which ensures that the inverse of (x0 x) exists. Minimizing the sum of squared
residuals leads to the following OLS estimator of ;
^ = (X 0 X)
(X 0 y) =
+ (X 0 X)
(X 0 ):
(15.5)
"
T
1X 2
xt
+
T
t=1
"
T
1X
xt
T
t=1
(15.6)
The estimated parameter ^ is equal to its true value and an additional term.
For the estimate to be unbiased the last factor must be zero. If we assume that
the x0 s are deterministic the problem is relatively easy. A correct specication of
the model, E( ) = 0, leads to the result that ^ is unbiased.
The parameter has the variance,
V ar( ^ ) = E[( ^
)( ^
)0 ] = (x0 x)
1 0
x(
I)x(x0 x)
(x0 x)
(15.7)
(X 0 )] =
+ E(X 0 X)
E(X 0 );
(15.8)
N ID(0;
I):
(15.9)
126
T
1 X ~2
t
+
T
^=
t=1
"
T
1 X~
t
T
t=1
(15.11)
Taking expectations leads to the result that is unbiased. The most important
reason why this regression works well is that there is an additional t~ variable in
the denominator. As t~ goes to innity the denominator gets larger and larger
compared to the numerator, so the ratio goes to zero much faster than otherwise.
(15.12)
t=1
(15.14)
t=1
"
T
X
t=1
xt
(15.15)
127
The necessary conditions are that fxt gT1 is stationary process and that fxt gT1
and f t gT1 are independent. The rst condition means that we can view the covariance matrix (X 0 X) as xed in repeated samples. In a time series perspective
we cannot generally talk about repeated samples, instead we have to look at the
sample moments as T ! 1. If xt is a stationary variable then we can state that
as T ! 1; the covariance matrix will become a constant. This can be written as,
T
1X
xt xt !p Q;
T t=1
(15.16)
!p 0;
(15.17)
p lim [X 0 ] !p 0:
(15.18)
or, alternatively,
The intuition behind this result is that, because t is zero on average, we are
multiplying xt with zero. It follows then that the average of (xt t ) will be zero.
The practical implication is that given a su ciently large sample the OLS
estimator will be unbiased, e cient and consistent even when the explanatory
variables are stochastic variables. If t
N ID(0; 2 ), we also have, conditional
T
on the stochastic process fxt g1 ; that the estimated is distributed as,
^ jx
t
N[ ;
(^
) jxt
(X 0 X)
);
(15.19)
and
N (0;
(15.20)
t=1
The estimated t can vary with t since Qt varies with time. To establish that
OLS is a consistent estimator we need to establish that
"
# 1
T
X
(1=T )
xt xt
= Qt 1 !p Q 1
(15.22)
t=1
3 In
128
The condition holds if fxt gT1 is covariance stationary, as T goes to innity the
estimate will converge in probability ( !p ) to a constant. The second condition
PT
is that the sum
t=1 xt t converges in probability to zero, which takes place
whenever xt and t are independent. The error process is iid, but not necessarily
normal. Under the conditions given
PT here, the central limit theorem is su cient
to establish that the sequence { t=1 xt t gconverges (weakly in distribution) to a
normal distribution,
T
X
[(1=T )
xt t ] !d N (0; 2 );
(15.23)
t=1
so that ( ^ t
) is asymptotically distributed as
(^t
N (0;
):
(15.24)
) = [(1=T )
T
X
xt xt ]
[(1=T )
t=1
T
X
xt t ]:
(15.26)
t=1
Since (1=T ) = (1=T 2 )(1=T 2 ) the CLT can be evoked by rewriting the expression as,
1
(1=T 2 )( ^ t
) = [(1=T )
T
X
xt xt ]
[(1=T 2 )
t=1
T
X
xt t ];
(15.27)
t=1
where the LHS and the numerator on the RHS correspond to the CLT theorem.
From the numerator, on the RHS, we get as T goes to innity
1
[(1=T 2 )
T
X
t=1
xt t ] ) N (0;
):
(15.28)
1
Moreover, we can also conclude that the rate of convergence is given by (1=T 2 ).
1
1
Dividing the RHS side of the OLS estimator with (1=T 2 ) leaves (1=T 2 ) in the
denominator which then represents the speed by which the estimate ^ t converges
to its true value .
+ t;
(15.29)
129
where t
iid(0, 2 ). (The estimation of AR(p) models follows from this
example in a straightforward way). The estimated is
"
^= T
T
X
yt
t=1
"
T
X
yt
1 yt
t=1
(15.30)
leading to
"
= T
T
X
yt
t=1
"
T
X
yt
1 t
t=1
(15.31)
This is similar to the stochastic regressor case, but here fyt 1 g and f t g cannot
be assumed to be independent, so E(yt 1 )E( t ) 6= 0 and ^ can be biased in a
limited sample. The dependence can be explained as follows t is dependent on
yt , but yt is through the AR(1) process correlated with yt+1 , so yt+1 is correlated
with t+1 . The long-run covariance (lrcov) between yt 1 and t is dened as,
lrcov(yt
"
t) = T
1
X
t=1
yt
1 t
X
k=1
E(yt
t+k ) +
E(yt
1+k
t );
(15.32)
k=1
where the rst term on the RHS is sample estimate of the covariance, the last
two terms capture leads and lags in the cross correlation between yt 1 and t .
As long as yt is covariance stationary and t is iid, the sample estimate of the
covariance will converge to its true long-run value.
This dependence from t to yt+1 is not of major importance for estimation.
Since (yt 1 t ) is still a martingale dierence sequence w.r.t. the history of yt and
t , we have that Efyt 1 t j yt 2 ; yt 3; :::; t 1 ; t 2; :::g = 0, so it can be established
in line with the CLT; 4 that
#
"
T
X
1
yt 1 t ) N (0; 2 Q):
(15.33)
(1=T 2 )
t=1
Using the same assumptions and notation as above the variance is given is
E(yt 1 t t yt 1 ) = E( 2 )E(yt 1 yt 1 ) = 2 Qt : These results are su cient to
establish that OLS is a consistent estimator, though not necessarily unbiased in
a limited sample. It follows that the distribution of the estimated , and its rate
of convergence is as above. The results are the same for higher order stochastic
dierence models.
130
iid(0,
T
X
yt
1 t
t=1
T
X
[yt
1(
t 1
+ vt )] =
t=1
T
X
T
X
t=1
yt
1 t 1
t=1
[yt
1 (yt 1
yt
2 )]
t=1
T
X
T
X
yt
1 vt
t=1
T
X
yt
1 vt
t=1
yt2
T
X
yt
1 yt
t=1
2+
T
X
yt
1 vt :
(15.35)
t=1
E (1=T )
T
X
t=1
yt
1 t
var(yt ) +
cov(yt
1y 2)
+ cov(yt
1 vt );
(15.36)
which establishes that the OLS estimator is biased and inconsistent. Only the
last covariance term can be assumed to go to zero as T goes to innity.
In this situation OLS is always inconsistent.5 Thus, the conclusion is that
with a lagged depended variable OLS is only feasible if there is no serial correlation in the residual. There are two solutions in this situation, to respecify the
equation so the serial correlation is removed from the residual process, or to turn
to an iterative ML estimation of the model (yt
yt 1
t 1 = vt ). The latter specication implies common factor restrictions, which if not tested is an ad
hoc assumption. The approach was extremely popular in the late 70s and early
80s, when people used to rely on a priori assumptions in the form of adaptive
expectations or costly adjustment, as examples, to derive their estimated models.
Often economists started from a static formulation of the economic model and
then added assumption about expectations or adjustment costs. These assumptions could then lead to an innite lag structure with white noise residuals. To
estimate the model these authors called upon the so called Koyck transformation
to reduce the model to a rst order autoregressive stochastic dierence model,
with an assumed rst order serially correlated residual term.
131
only consists of two observations x1 and x2 . The joint density function for these
two observations can be factorised as,
D(x1 ; x2 ) = D1 (x2 j x1 )D2 (x1 )
(15.38)
=
=
(15.39)
(15.40)
With three observations, we have that the joint probability density function is
~ 3 , conditional on X
~ 2 and X
~ 1 , multiplied by the
equal to the density function of X
~
~1.
conditional density for X2 , multiplied by the marginal density for X
It follows that for a sample of T observations, the likelihood function can be
written as,
L( ; x) =
T
Y
t=2
D(xt j Xt
1;
)f (x1 );
(15.41)
+ t;
(15.42)
is that the estimated is not following a normal distribution, not even asymptotically. The problem here is not inconsistency, but the nonstandard distribution
of the estimated parameters. This is clearly established in Fuller(1976) where the
results from simulating the empirical distribution of the random walk model is
presented. Fuller generated data series from a driftless random walk model, and
estimated the following models,
a) yt = yt 1 + t ;
b) yt = + yt 1 + t ;
c) yt = + (t t) + yt 1 + t ;
where is constant and (t t) a mean adjusted deterministic trend. These
equations follow from the random walk model. The reason for setting up these
three models is that the modeler will not now in practice that the data is generated
by a driftless random walk. S=he will therefore add a constant (representing the
deterministic growth trend in yt ) or a constant and trend. The models are easy
to understand, simply subtract yt 1 from both sides of the random walk model,
yt
yt
= yt
1)yt
yt
+ t;
(15.43)
which leads to
yt
=(
= yt
+ t:
(15.44)
133
The standard t-statistic for an innitely large sample is for a two sided test of
^ 6= 0 equal to 1.96 at the 5 % level. However, according to the simulations of
Dickey and Fuller the appropriate value of the t-statistic in model (a) is 2.23, for
an innity large sample. In an autoregressive model we know that the estimate
of is biased downward. Thus, the alternative hypothesis in models (a) to (c) is
that is less than zero. The associated asymptotic t-value for an estimate from
a normal distribution, is therefore -1.65. Dickey and Fuller established that the
asymptotic critical values for one sided t-tests at the 5 % level in the models (a) to
(c) are -1.95, -2.86 and -3.41 respectively. (See Fuller (1976) Table 3.2.1, page 373].
Notice that the critical values change depending on the parameters included in the
empirical model. Also, the empirical distributions assume white noise residual; if
this is not the case, either the model or the test statistic must be adjusted.
Moreover, as long as = 1 or = 0 cannot be rejected, the estimated constant
term in model b, as well as the constant and the quadratic trend in model c,
also follow non-normal distributions. These cases are tabulated in Dickey and
Fuller (1981). The consequence of ignoring the results of Dickey and Fuller is
obvious. If using the standard tables, one will reject the null hypothesis of =
0(
= 1:0) too many times. It follows that if you use standard t-tests you will
end up modelling non-stationary series, which in turn take you to the spurious
regression problem. The alternative hypotheses for unit root tests are discussed
in the following chapter.
The explanation to why the t-statistic ends up being non-normally distributed,
can be introduced as follows. As T goes to innity, the relative distance between
yt and yt 1 becomes smaller and smaller. Increasing the sample size implies that
the random walk model goes towards a continuous time random walk model. The
asymptotic distribution of such a model is that of a Wiener process (or Brownian
motion).
The OLS estimate is,
(^
"
1:0) = T
T
X
yt
t=1
"
T
X
yt
t=1
1 t
(15.45)
Pt
where, since yt is driven by stochastic trend, yt = i=1 t i , the sample moments of the two factors on the RHS will not converge to constants, but to random
variables instead. These random variables will have a non-standard distribution,
often called a Dickey-Fuller distribution. We can express this as,
(^
[Wy (t)] :
(15.46)
where W (t) indicates that the sample moment converges to a random variable
which is a function if a Wiener process and therefore distributed according to a
non-standard distribution. If the residuals are white noise then we get the so
called Dickey-Fuller distributions.
The intuition behind this result is that an integrated variable has an innite
memory so the correlation between yt 1 and t does not disappear as T grows.
The nonstandard distribution remains, and gets worse if we choose to regress two
independent integrated variables against each other. Assume that xt and yt are
two random walk variables, such that
yt = yt
t;
and xt = xt
t;
(15.47)
where both t and t are N ID(0; 2 ). In this case, would equal zero in the
model, yt = xt + t :
The estimated t-value from this model, when yt and xt are independent random
walks should converge to zero. This is not what happens when yt and xt are also
134
integrated variables. In this case the empirical t-value will converge to 2.0, leading
to spurious correlation if a standard t-table at 5% is used to test for dependence
between the variables. The problem can be described as follows. If
is zero,
the residual term will be I(1) having the same sample moments as yt . Since yt
is a random walk we know that the variance of t will be time dependent and
non-stationary as T goes to innity. The sample estimate of 2 t is therefore not
representative for the true long run variance of the yt series. The OLS estimator
gives
# 1"
#
"
T
T
X
1X
1
2
^= +
xt
xt t ;
(15.48)
T
T
t=1
t=1
) ) [B1 (t)]
[B2 (t)] ;
(15.49)
135
136
16. ENCOMPASSING
Often you will nd that there are several alternative variables that you can put
into a model, there might be several measures of income, capital or interest rates
to choose from. Starting from a general to a specic model, several models of the
same dependent variables might display, white noise innovation terms and stable
parameters that all have signs and sizes that are in line with economic theory.
A typical example is given by Mankiw and Shapiro (1986), who argue that in
a money demand equation, private consumption is a better variable than income.
Thus, we are faced with two empirical models of money demand.1 The rst model
is,
mt =
1 yt
2 yt 1
3 ry
mt =
1 ct
2 ct 1
3 rt
(16.1)
t:
(16.2)
Which of these models is the best one, given that both can be claimed to be
good estimates of the data generating process? The better model is the one that
explains more of the systematic variation of mt and explains were other models go
wrong. Thus, the better model will encompass the not so good models. The crucial
factor is that yt and ct are two dierent variables, which leads to a non-nested
test.
To understand the dierence between nested and non-nested tests set 2 = 0:
This is a nested test because it involves a restriction on the rst model only. Now,
set 1 = 2 = 0; this is also a nested test, because it only reduces the information
of model one. If 1 = 2 = 0, this is also a nested test of the second model. Thus,
setting 1 = 2 = 0; or 1 = 2 = 0, are only special cases of each model.
The problem that we like to address here is whether to choose either yt or ct
as the scale variable in the money demand equation. This is non-nested test
because the test can not be written as a restriction in terms of one model only.
The rst thing to consider is that a stable model is better than an unstable
one, so if one of the models is stable that is the one to choose. The next measure
is to compare the residual variance and choose the model with the signicantly
smaller error variance.
However, variance domination is not really su cient, PcGive therefore oers
more tests, that allow the comparison of Model one versus Model two, and vice
versa. Thus, there are three possible outcomes, Model one is better, Model two is
better, or there is no signicant dierence between the two models.
1 For simplicity we assume that there is only one lag on income and consumption. This should
not be seen as a restriction, the lag length can vary between the rst and the second model.
ENCOMPASSING
137
138
ENCOMPASSING
N (0;
);
(17.1)
xt +
ht
= !+
t
2
t
t 1;
N (0;
(17.2)
(17.3)
where the error variance is dependent on its lagged value. The rst equation
is referred to as the mean equation and the second equation is referred to as the
variance equation. Together they form an ARCH model, both equations must
estimated simultaneously. In the mean equation here, xt is simply an expression
for the conditional mean of yt . In a real situation this can be explanatory variables,
an AR or ARIMA process. It will be understood that yt is stationary and I(0),
otherwise the variance will not exist.
This example is an ARCH model of order one, ARCH(1). ARCH models can
be said represent an ARMA process in the variance. The implication is that a high
variance in period t-1 will be followed by higher variances in periods t, t + 1, t + 2
etc. How long the shock persists depends, as in the ARMA model on the size of
ARCH MODELS
139
the parameters in combination with the lag lengths. A low variance period is likely
to be followed by another low variance period, but a shock to the process and/or
its variance will cause the variance to become higher before it settles down in the
future. A consequence of An ARCH process is that the variance can be predicted.
In other words it is possible to predict if the future variances, and standard errors
will be large or small. This will improve forecasting in general and is useful tool
for the pricing of derivative instruments.
An ARCH(q) process is,
yt
ht
xt +
= !+
2
1 t 1
D(0; ht )
2
2 t 2
(17.4)
2
q t q
+ :::
q
X
t i:
(17.5)
i=1
The expression for the variance shows a autoregressive process in the variance of
: Deliberately the distribution of the residual term is left undetermined. In ARCH
models normality is one option, but often the residual process will be non-nonrmal
and often display thicker tails, and be leptokurtic. Thus, other distributions such
as the Student t-distribution can be a better alternative.
The t-distribution has three moments, the mean, the variance and the "degrees
of freedom of the Student t-distribution". In this case, if the residual process t
St(0; h2 ; ); where is a positive parameter that measures the relative importance
of the peak in relation to the thickness of the tails. The Student t distribution is
a symmetrical distribution that contains the normal distribution as a special case,
as ! 1:
The ARCH process can be detected by testing for ARCH and by inspecting
the P ACF and ACF of the estimated squared residual ^2t : As is the case for AR
models, ARCH has a more general form, the Generalised ARCH, which implies
lagging the dependent variable ht : A long lag structure in the ARCH process can
be substituted with lagged dependent variables to create a shorter process, just as
for ARMA processes. A GARCH(1,1) model is written as,
yt
xt +
ht
= !+
t
2
t
t 1
D(0; ht )
+ ht
1:
(17.6)
(17.7)
The GARCH(1,1) process is a very typical process found in a number of empirical applications on ARCH processes. The convention is to indicate the length of
the ARCH with q, and use the letter p to indicate the length of the lagged variance
ht : The same convention assigns to the ARCH process, and to the GARCH
process. Usually ! is usd for the constant time independent part of the variance
instead of the 0 that is used here. For an asset market this type of process would
imply that there are persistent periods when asset prices uctuate relatively little
compared with other periods where prices uctuate more and for longer times. A
General GARCH(q,p) process is,
yt
ht
xt + t
q
X
!+
i=1
t
i
D(0; ht )
p
X
i+
i ht i :
(17.8)
(17.9)
i=1
ARCH and GARCH models cannot be estimated by OLS, or standard regression programs. It is necessary to use an interativre system estimation method
because the model is now consisting of two equations; the mean equation and the
140
ARCH MODELS
17.0.1
In practical modelling it is necessary to start with the mean equation. It is necessary to have a correct specication of the mean equation, in order to get the
variance process right. A stationary autoregressive process and relevant explanatory variables, and possible sesonal and other dummies must be included in the
mean equation to get rid of autocorrelation and general misspecication. This is
a a relatively easy procedure for nancial return series, which often martingale
processes
Notice that ARCH and GARCH disappears with aggregation over time and
low frequencies in recording data. Thus, ARCH=GARCH is typically never found
for frequencies above months. Monthly data, or shorter intervals, are necessary
for the modelling of ARCH=GARCH process. Even if models estimated with
quarterly data and higher frequencies can display ARCH in testing the residuals,
it is usually never possible to build an ARCH=GARCH models with that type of
data.
An ARCH process can be identied by testing for ARCH(q) structure in
combination with using ACF : s and P ACF : s on the squared residuals from
the mean equation. Estimate the mean equation, save the estimated residuals,
square them and use ispect the ACF : sand P ACF : s of these squared residuals
to identify a preliminary lagorder for the GARCH. However, this method is higly
approximative regarding the order of q and p.
+ t;
(17.10)
(17.11)
141
which varies over time since yt is a random variable. Now turn to the variance
of yt+1
V ar(yt+1 ) = V ar( yt ) + V ar( t ):
(17.12)
This variance consists of two parts, rst we have the unconditional variance of
yt+1 which is, for an AR(1) given by,
2
V ar(yt+1 ) =
(17.13)
(17.14)
We can see that while the conditional expectation of yt+1 depends on the information set It = yt , both the conditional (V art ) and the unconditional variances
(Var) do not depend on It = yt .
If we extend the forecasts k periods ahead we get, by repeated substitution,
yt+k =
yt +
k
X
k i
t i:
(17.15)
i=1
The rst term is the conditional expectation of yt k periods ahead. The second
term is the forecast error. Hence, the conditional variance of yt k periods ahead
is equal to
k
X
2(k i)
V art (yt+k ) = 2
:
(17.16)
i=1
It can be seen that the forecast of yt+k depends on the information at time t.
The conditional variance, on the other hand, depends on the length of the forecast
horizon (k periods into the future), but not on the information set. Nothing says
that this conditional variance should be stable. Like the forecast of yt it could
very well depend on available information as well, and therefore change over time.
So let us turn to the simplest case, where the errors follow an ARCH(1) model.
We have the following model, yt = yt 1 + t where t D(0, ht ), E( t ) = 0,
E( t t i ) = 0 for i 6= 0, and ht = w + t 2 :
The process is assumed to be stable j j < 1, and since t 2 is positive we must
have w > 0 and
0. Notice that the errors are not autocorrelated, but at the
same time they are not independent since they are correlated in higher moments
through the ARCH eect. Thus, we cannot assume that the errors really are
normally distributed. If we chose to use the normal distribution as a basis for ML
estimation, this is only an approximation. (As an alternative we could think of
using the t-distribution since the distribution of the errors tends to have fatter tails
than that of the normal). Looking at the conditional expectations of the mean
and the variance of this process, Et (yt+1 jyt ) = yt and V art (yt+1 jyt ) = ht+1 =
w + (yt
yt )2 :
We can see that both depend on the available information at time t. Especially
it should be noticed that the conditional variance of yt+1 increases by positive and
negative shocks in yt :
Extending the conditional variance expression k periods ahead, as above, we
get,
k
X
2(k i)
V art (yt+k jyt ) =
Et (ht+k ):
(17.17)
i=1
ARCH MODELS
=w+
(1
L)
t,
which is,
2
t 1;
(17.18)
= w;
(17.19)
= w:
(17.20)
2
t
, implies that,
(1
Substitute by ht ;
ht = (1
2
t 1;
(17.21)
to get the relationship between the conditional and the unconditional variances
of yt . The expected value of ht in any period i is,
E(ht+i ) =
+ E[ht+i
]:
(17.22)
k
X1
2i
s 1
(ht+1
k
X1
2i
(17.23)
i=0
i=1
The rst term on the RHS is the long run unconditional forecast variance of
yt . The second term represents the memory in the process, given by the presence
2
of ht+1 . If < 1 the inuence of (ht+1
) will die out in the long run and
the second term vanishes. Thus, for long-run forecasts it is only the unconditional
forecast variance which is of importance. Under the assumption of
< 1 the
memory in the ARCH eect dies out. (Below we will relax this assumption, and
allow for unit roots in the ARCH process).
q
X
t i
+ A(L)
(17.24)
i=1
This is the basic ARCH model from which we now introduce dierent eects.
2) GARCH(q; p): Generalized ARCH models.
If q is large then it is possible to get a more parsimonious representation by
adding lagged ht to the model. This is like using ARMA instead of AR models.
A GARCH(q; p) model is
SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS
143
ht =
q
X
t i
i=1
p
X
ht
+ A(L)
+ B(L)ht ;
(17.25)
i=1
where p
0, q P
> 0, a0P
> 0, i
0, and i
0. The sum of the estimated
parameters (1) =
+
shows
the
memory
of
the process. Values of (1)
i
i
equal to unity indicates that shocks to the variance has permanent eects, like in
a random walk model. High values of (1); but less than unity indicates a long
memory process. It takes a long time before shocks to the variance disappears.
If the roots of [1 B(L)] = 0 are outside the unit circle we the process is
invertible and,
ht
0 [1
"
= a+
B(L)]
p
X
i=1
D(L) 2 t
+ A(L)[1
1
1
X
B(L)]
i
2
t
2
t i
(17.26)
(17.27)
i=1
ARCH(1):
(17.28)
If D(L) < 1 then GARCH = ARCH. Moreover, if the long run solution of the
model B(1), is < 1, the i will decrease for all i > max(p, q).
GARCHmodels are standard tools, in particular, for modeling foreign exchange rate markets and nancial market data. Often the GARCH(1; 1) is the
preferred choice. GARCH models some empirical observations quite well. The
distribution of many nancial series display fatter tails than the standard normal
distribution. GARCH models in combination with the assumption of a normal
distribution of the residual can generate such distributions. However, many series,
like foreign exchange rates, display both fatter tails and are leptokurtic (the peak
of the distribution is higherthan the normal. A GARCH process combined with
the assumption that the errors follow the t-distribution can generate this type
observed data.
Before continuing with dierent ARCH models, we can now look at an alternative formulation of ARCH models which show their similarities with ordinary
time series models. Dene the innovations in the conditional variance as,
vt =
2
t
ht :
(17.29)
B(L)](
2
t
vt ) =
+ A(L) 2t ;
(17.30)
+ A(L)
2
t
+ [1
(17.31)
B(L)]( 2t ) =
[1
B(L)
B(L)]vt ;
and
A(L)]( 2t ) =
+ vt
B(L)vt
1;
(17.32)
144
+(
2
1) t 1
+ 1 vt 1 + vt :
(17.33)
P
If 1 + 1 = 1, or ( i + i ) = 1 in GARCH(q; p) model, we get what is called
an integrated GARCH model.
t
ARCH MODELS
+ A(L)
+ xt ;
(17.34)
yt
xt + ht
ht
0 + A(L)
1:
(17.35)
There exists various ways of puttingthe variance backin the mean equation.
The example above assumes that it is the standard error which is the interesting
variable in the mean equation.
6) IGARCH. Integrated ARCH.
When the coe cients sum to unity we get a model with extremely long memory.
(Similar to the random walk model). Unlike the cases discussed earlier the shocks
to the variance will not die out. Current information remains important for all
future forecasts. We talk about an integrated variance and persistence in variance.
A signicant constant term in an GARCH process can be understood as a mean
reversion of the variance. But if the variance is not mean-reverting, integrated
GARCH is an alternative, that in a GARCH(1,1) process can put the constant
zero, and restrict the two parameters to unity.
7) EGARCH. Exponential GARCH and ARCH models. (Exponential
due to logs of the variables in the GARCH model). These models have the interesting characteristic that they allow for dierent reactions from negative and
positive shocks. A phenomenon observed on many nancial markets. In the output the rst lagged residual indicated the eect of a positive shock, while the
second lagged residual (in absolute terms) indicates the eect of a negative shock.
8) FIGARCH. Fractionally Integrated GARCH.
This approach builds on the idea of fractional integration and allows for a
slow hyperbolic rate of decay for the lagged squared innovation in the conditional
variance function. See Baille, Bollerslev and Mikkelsen (1996).
9) NGARCH and NARCH Non-linear GARCH and ARCH models.
10) Common Volatilty.
Introduced by Engle and Isle 1989 (and 1993), allows you to test for common
GARCH Structure in dierent series.
SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS
145
T
log 2
2
T
2
T X
t
(log ht ) +
:
2 t=1
ht
(17.36)
Notice that there are two equations involved here, the mean equation and the
variance equation. The process is correctly modelled rst when both equations
are correctly modelled.
To estimate ARCH and GARCH processes, non-standard algorithms are generally needed. If yt i is among the regressors some iterative method is always
required. (GAUSS, RATS, SAS provide such facilities). There are also special
programs which deal with ARCH, GARCH and multivariate ARCH. The research
strategy is to begin by testing ARCH, by standard tests procedures. The following
LM test for q order ARCH, is an example,
^2 t =
2
1^t 1
2
2^t 2
+ ::: +
2
q ^t q
+ yt + vt ;
(17.37)
2
where T R2
(q). Notice that this requires that E( ) = 0, and E( t t i ) 6= 0,
for i 6= 0:
If ARCH is found, or suspected, use standard time series techniques to identify
the process. The specication of an ARCH model can be tested by Lagrange multiplier tests, or likelihood ration tests. Like in time series modeling the Box-Ljung
test on the estimated residuals from an ARCH equation serves as a misspecication test. ARCH type of processes are seldom found in low frequency data.
High frequency data is generally needed to observe these eects. Daily, weekly
sometimes monthly data, but hardly ever in quarterly or yearly data.
Finally, remember two things, rst that ARCH eects imply thicker tails than
the standard normal distribution. It not obvious that the normal distribution
should be used. On the other hand it, there is no obvious alternative either.
Often the normal distribution is the best approximation, unless there is some
other information. On example, of other information, is that some series are
leptokurtic, higher peak than the normal, in combination with fat tails. In that
case the t-distribution might be an alternative. Thus using the normal density
function is often an approximation. Second, correct inference on ARCH eects
builds upon a correct specication of the mean equation. Misspecication tests of
the mean equation are therefore necessary.
146
ARCH MODELS
(18.1)
18.0.1
147
behavior of economic agents. Other expectations, than rational, imply that agents
might ignore information that would raise their utility. With anything than rational expectations agents will be allowed to make systematic mistakes, implying
that they ignore prot opportunities or that they are not, for some not explained
reason, maximizing their utility. The economic science has yet to identify such behavior in the real world. Rational expectations becomes an equilibrium condition
in the sense that there the dierence between prediction and outcomes cannot be
predicted. A model which allows for predictable dierences between the expectations and the outcome is not complete without an economic explanation of what
the dierence means, and why it occurs.
The correct way to approach the modeling of expectations is assume that agents
form expectations so that they do not make systematic mistakes that reduce their
welfare. Information used to predict the future will be collected and processed up
to the point were the costs of gathering more information balances the revenue
of additional information. Based on this type of behavior it might, as a special
case, be optimal to use say todays value of a variable to predict all future values
of that variable. But, these are exceptions from the rule.
In general there is a catch 22 situation in the modeling rational expectations
behavior. If the econometrician nds that the agents are doing systematic mistakes
from ex post data, this is no evidence against the rational expectations hypothesis. Instead, the empirical nding might the result of conditioning on the wrong
information set. Alternatively, the modeling of the expectation might be correct,
and be an unbiased and e cient estimate of the expectation held only at a certain
point in time. This argument also include situation where there is a small probability of an event with large consequences, as devaluations, unpredicted changes
in the monetary regime, wars, natural disasters etc. To examine these situations
generally requires further testing of model, were the outcome will depend to a
large extent on assumptions regarding distributions of the processes, if they are
linear or non-linear etc.
The discussion about other types of expectations brings us to the concepts of
forward looking v.s. backward looking behavior. The dierence can explained as
follows. Consumption based on forward looking behavior is determined on the
basis of expected future income. Consumption based on actual (existing) income
is backward looking. In practice there might not a big dierence, your present or
recent income might be a good approximation to your future income. In some cases
rational expectations might be to base decisions on contingent rules, and revise
these rules only when the costs of deviating from the optimal/desired consumption
is too big(or when the alternative cost to being outside equilibrium is to high).
18.0.2
Without given values of the expected value there are two types of common mistakes in econometric models on expected driven stochastic processes. The rst
mistake is to substitute xet with the observed value xt : This leads to an error-invariables problem, since xt = xet + vt ; where vt is E(vt ) = 0:The error-in-variable
problem implies that will not be estimated correctly. OLS is inconsistent for the
estimation of the original parameter.
The second mistake is to model the process for xt and substitute this process
into 18.1. Assume that the variable xt follows an AR(2) process, like xt = a1 xt 1 +
148
a2 xt
+ nt , where nt
N ID(0;
yt
a1 xt
1 xt 1
+ et
+ et :
(18.2)
This estimated model also gives the wrong results, if we are interested in estimation the (deep) behavioral parameter . The variables xt 1 and xt 2 are not
weakly exogenous for the parameter of interest ( ) in this case. The estimated
parameters will be a mixture of the deep behavioral parameter and the parameters
of the expectations generating process (a1 and a2 ).
Not only are the estimates biased, but policy conclusion based on this estimated
model will also be misleading. If the parameters of the marginal model, (a1 and
a2 ) describe some policy reaction function, say a particular type of money supply
rule, changing this rule, i.e. changing a1 and a2 will also change 1 and 2 : This is
a typical example of when super exogeneity does not hold, and when an estimated
model cannot be used to form policy recommendations.
What is the solution to this dilemma of estimating deepbehavior parameters,
in order to understand working of the economy better?
1. One conclusion is that econometrics will not work. The problems of correctly
specifying the expectation process in combinations with short samples make
it impossible to use econometric to estimate deep parameters. A better
alternative is to construct micro-based theoretical models and simulate these
models. (As example, use calibration techniques)
2. Sims solution was to advocate VAR models, and avoid estimating deep
parameters. VAR models can then be used to increase our understanding
about the economy, and be used to simulate the consequences of unpredictable events, like monetary or scal policy shocks in order to optimize
policy.
3. Though the rational expectations critique (Lucas, Sims and others) seem to
be devastating for structural econometric modeling, the critique has yet to
be proven. In surprisingly many situations, policy changes appear to have
small eects on estimated equations, i.e. the eects of the switch in monetary
policy in the UK in early 1980s.
4. Finally, the assumption of rational expectations provides priori information
that can be used to formulate an econometric model from the beginning.
There are, in principle, three ways in which one can approach this problem; i)
substitution, ii) system estimation based on the Full Information Maximum
Likelihood (FIML) estimator or iii) use the General Methods of Moments
(GMM) estimator.
Substitution means to replace the expected explanatory variable with an
expectation. This expectation could either be a survey expectation or an
expectation generated by a forecasting model, i.e. an ARIMA model. The
FIML method can be said build in the econometric forecast in an estimated
system. The GMM estimator builds on the assumption that the explanatory
variable and the residuals are orthogonal to each other. Since, rational
expectations implies that the (rationally expected) explanatory variables are
orthogonal to the residuals, the GMM estimator is well suited for rational
expectations models. Because of this it is the preferred choice when it comes
to estimating rational expectations models, especially in nance applications.
ECONOMETRICS AND RATIONAL EXPECTATIONS
149
18.0.3
x
^t + ut where
ut
= et
(^
xt
xet ) = et
(vt
v^t ):
(18.3)
18.0.4
(To be completed)
Tests concerning given values of xet .
Given some values of the expectation process, there are three types of tests
that can be performed.
1. Test if the dierence between the expectation and the outcome is a martingale dierence process, conditional on assumptions regarding risk premiums.
2. Test for news. Under the assumption of rational expectations the expected
driven variable should only react the unpredictable event newsbut not to
events that can be predicted. These assumptions are directly testable as
soon as we have a forecasting model for xet :
3. Variance bounds tests. Again, given xet , it follows that the variance of yt in
equation 18.1 must be higher than the variance of xet :
Encompassing tests
If a model based on taking account of assumed rational expectations behavior
is the correct model, it follows that this model should encompass other models
with lack this feature. Thus, encompassing tests can used to discriminate between
models based on rational expectations and other models.
Tests of super exogeneity
150
151
152
153
154
A RESEARCH STRATEGY
If you reduce to single equation (or very limited systems) can you motivate the
weak exogeneity. assumptions?
The reduced form VECM gives you ideas about what a system might look like,
and not like through the estimated (signicant) alpha values.
It is possible to test for predictability in the VECM by looking at the estimated
alpha values, and argue for reductions of the system?
Of course, from the reduced for VECM to logical step is to construct a simultaneous structural model based on testing the order and the rank condition
in the model. However, this can be a bit of a challenge, especially if you are
short of time. Furthermore, identication must be done on signicant parameters
(including lags) not on the underlying theoretical lag structure.
V. Set up the Error Correction Representation.
In the following we assume that you have chosen to continue with a single
equation.
Use the results from Johansens multivariate cointegration technique, then
formulate an ECM model directly.
Test for cointegration in the ADL representation of the model. (PcGive
test). It is necessary to choose lag lengths long enough to get white noise
residuals. Test if residuals are N ID(0; 2 ), +RESET test if possible.
Having white noise innovation error terms is a necessary condition.
If not white noise innovation?
Add more lags.
Did you forget something important?
Study outliers. Use dummies and trends to get white noise. But remember
that they should be motivated.
Or continue to the least worse of all possible models, see above.
Rethink the problem or stop. RESET test!! (Perhaps you should try to
condition on some other variable instead?)
When white noise is established:
Is the equation in line with what you think can be an economic meaningful
long-run equilibrium? Check sign and sizes of parameters.
155
156
A RESEARCH STRATEGY
20. REFERENCES
Andersson, T.W. (1971) The Statistical Analysis of Time Series, John Wiley &
Sons, New York.
Andersson, T.W. (1984) An Introduction to Multivariate Statistical Analysis,
John Wiley & Sons, New York.
Banerjee, A., J. Dolado, J.W.Galbraith and D.F. Hendry, (1993) Cointegration,
Error-Correction and the Econometric Analysis of Non-stationary Data, (Oxford
University Press, Oxford).
Baillie, Richard J. and Tim Bollerslev, The long memory of the Forward premium, Journal of Money and Finance 1994, 13 (5), p. 565-571.
Baillie, Richard J., Tim Bolloerslev and Hans Ole Mikkelsen (1966) Fractionally Integrated Generalized Autoregressive Heteroscedastcity, Journal of Econometrics 74, 3-30.
Banerjee, A., R.L. Limsdaine and J.H Stock (1992) Recursive and Sequential
tests of the Unit Root and Trend Break Hypothesis: Theory and International
Evidence, Journal of Business and Economics Statistics ?.
Cheung, Y. and K. Lai (1993), Finite Sample Sizes of Johansens Likelihood
Ratio Tests for Cointegration, Oxford Bulletin of Economics and Statistics 55, p.
313-328.
Cheung, Y. and K. Lai (1995) A Search for Long Memory in International
Stock Markets Returns, Journal of International Money and Finance 14 (4),
p.597-615.
Davidson, James, (1994) Stochastic Limit Theory, Oxford Univresity Press,
Oxford.
Dickey, D. and W.A. Fuller (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical
Association 74.
Diebold, F.X. and G.D. Rudebush (1989), Long Memory and Persistence in
Aggregate Output,Journal of Monetary Economics 24 (September), p. 189-209.
Eatwell, J., M. Milgate and P. Newman eds., (1990), Econometrics (Macmillian, London).
Eatwell, J., M. Milgate and P. Newman eds., (1990) Time Series and Statistics
(Macmillian, London).
Engle, Robert F. ed. (1995) ARCH Selected Readings, Oxford University Press,
Oxford.
Engle, R.F. and C.W.J. Granger, eds. (1991), Long-Run Economic Relationships. Readings in Cointegration, (Oxford University Press, Oxford).
Engle, R.F. and B.S. Yoo (1991) Cointegrated Economic Time Series: An
Overview with New Results, in R.F Engle and C.W. Granger, eds., Long-Run
Economic Relationships. Readings In Cointegration (Oxford University Press,
Oxford).
Ericsson, Neil R. and John S. Irons (1994) Testing Exogeneity, Oxford University Press, Oxford.
Fuller, Wayne A. (1996) Introduction to Statistical Time Series, John Wiley &
Sons, Nw York.
Freud, J.E. (1972) Mathematical Statistics, 2ed.(Prentice/Hall London).
Granger and Newbold (1986), Forecasting Economic Time Series, (Academic
Press, San Diego).
REFERENCES
157
20.1 APPENDIX 1
A1 Smoothing Time Series Lag Windows.
In the discussion about non-stationarity dierent ways of removing the trend
in a time series was shown. If the trend is removed from, say, GDP we are left
with swings in the data that can be identied as business cycles. In time series
analysis such cycles are referred to as low frequency or periodic components.
Application of smoothing lters arise in empirical studies of real business cycles,
and in modelling nancial variables daily interest rates where for example news
about ination and other variables occur only at monthly intervals and might
158
REFERENCES
cause monthly cycles in the data.1 Smoothing methods, of course, are related
closely to spectral analysis. In this appendix we concentrate on two lters, or lag
windows, which represent the best, or most commonly used methods for time
series in time domain.
Start from a time series, rt . What we are looking for is some weights bi such
that the ltered series xt , is free of low frequency components,
xt =
i=+k
X
bi rt+i :
(20.1)
i= k
In this formula the window is applied both backwards and forwards, implying
a combination of backward and forward looking behavior. Whether this is a good
or a bad thing depends totally on the series at hand, and is left to the judgment of
the econometrician. The alternative is to let the window end at time i = 0. The
literature is lled with methods of calculating the weights bi , in this appendix we
will look at the two most commonly used methods; the Partzn window and the
Tuckey-Hanning window.
The Parzn window is calculated using the following weights,
8
9
< 1 6(i=k)2 + 6(j i j =k)3 ; j i j k=2; =
2(1 j i j =k)3 ;
k=2 j i j k;
wi =
:
;
0;
j i j k;
where k is the size of the lag window. The Parzn window tries to t a third
grade polynomial to the original series.
An alternative is the so called Tuckey-Hanning window, calculated as,
1=2 [1 + cos( i=k)] ; j i j k;
wi =
0;
jij k
Like the Parzen window, the weights need to be normalized. Under optimal
conditions, that is the correct identication of underlying cycles, the dierence between xt and rt , will appear as a normal distribution. The problem is to determine
the bandwidth, the size of the window, or k in the formula above. Unfortunately
there is no way easy way to determine this in practice. Choosing the size of the lag
window involves a choice between low bias in the mean or a high variance of the
smoothed series. The larger the window the smaller the variance but the higher
is the bias. In practice, make sure that the weights at the end of the window are
close to zero, and then judge the best t from comparing xt rt . As a rule of
thumb, choose a bandwidth equal to N exp(2=5), the number of observations (N )
raised to the power of 2 over 5. The alternative rule is to set the bandwidth equal
to N 1=4 , or make a decision based on the last signicant autocorrelation.. Since
the choice of the window is always ad hoc in some sense, great care is needed if
the smoothed series is going to be used to reveal correlations of great economic
consequence.
APPENDIX II
Testing the Random Walk Hypothesis using the Variance Ratio Test.
For a random walk, xt = xt 1 + "t , where "t
N ID(o; 2 ); we have that
the variance is 2 t and that the autocovariance function is cov(xt ; xt k ) = (t
k) 2 . It follows that cov(xt ; xt 1 ) = 21 , and that cov(xt ; xt k ) = 21 k. Dening
1
2
k = k cov(xt ; xt 1 ). For a random walk we get that the estimated variance ratio
V R(k) =
^ 2k
^ 21
1 To be clear, we are not saying that daily interest rates necessarily contain monthly cycles,
only that it might be the case. One example is daily observations of the Swedish overnight
interbank rate.
APPENDIX 1
159
1
T
T
X
(xt
xt
^ )2 ;
(20.2)
t=1
k(T
T
X
1
k + 1)(1
k
T ) t=k
(xt
xt
k^ )2 ;
(20.3)
2(2k
1)(k
3kT
1)
(20.4)
V R(k)
[ (k)]
1
1
2
!a N (0; 1);
(20.5)
k
X1
j ^
(j)
k
j=1
(20.6)
where
^(j) =
PT
t=j+1
(xt xt
hP
T
^ ) (xt
t=1 (xt
xt
xt
i2
^)
j 1
^)
(20.7)
V R(k)
[
(k)]
1
1
2
!a N (0; 1):
(20.8)
REFERENCES
to random variables and series of observations. These are the expectations operator, the variance operator, the covariance operator, the lag operator, the dierence
operator, and the sum operator.2 The formal proofs behind these operators are
not given, instead the chapter states the basic rules for using the operators.
All operators serve the basic purpose of simplifying the calculations and communication involving random variables. Take the expectations operator (E), as
an example. Writing E(xt ) means the same as I will calculate the mean (or the
~ 3 But, I am not telling
rst moment) of the observations on random variable X:
exactly which specic estimator I would be using, if I were to estimate the mean
from empirical data, because in this context it is not important.
One important use of operators is in investigating the properties of estimators
under dierent assumptions concerning the underlying process. For instance, the
properties of the OLS estimator, when the explanatory variables are stochastic,
when the variables in the model are trending etc.
20.2.1
The rst operator is the expectations operator. This is a linear operator and, is
therefore easy to apply, as shown by the following rules. In the following, let c
and k be two non-random constants, i is the mean of the variable i and ij is
the covariance between variable i and variable j. It follows that,
E(c) = c:
~ = cE(X)
~ =c
E(cX)
x:
~ = k + cE(X)
~ =k+c
E(k + cX)
~ + Y~ ) = E(X)
~ + E(Y~ ) =
E(X
x:
y:
~ Y~ ) = E(X)E(
~
~ Y~ ) =
E(X
Y~ ) + covar(X
xy ;
2
x
2
x.
The expectations operator is linear and straight forward to use, with one
important exception - the expectation of a ratio. This is an important exception since it represents a quite common problem.
~
E(Y )
Y
EX
~ is not equal to E(X)
~ : The problem is that the numerator and the denominator are not necessarily independent In this situation it is necessary to use
the p lim operator, alternatively let the number of observations go to zero
and use convergence in probability or distribution to analyze the outcome.
In the derivation of the OLS estimator, the hfollowing
transformation is often
i
1 ~
Y~
~ Y~ ):
used, when X is viewed as given, E X~ = E X~ Y = E(W
2 The
3 Notice
161
20.2.2
For the variance operator, var(:) or V (:) we have the following rules,
var(c) = 0:
~ = c2 var(X)
~ = c2
var(cX)
2
x:
~ = c2 var(X)
~ = c2
var(k + cX)
2
x:
~ = var(Y~ ) + var(X)
~ + 2cov(Y~ + X)
~ =
var(Y~ + X)
2
y
2
x
+2
yx :
20.2.3
2
y
2
x:
The covariance operator (cov) has already been used above. It can be thought of
~
as a generalization of the variance operator. Suppose we have two elements of X,
~
~
call them Xi and Xj : The elements can be two random variables in a multivariate
process, or refereeing to observations at dierent times (i) or (j) of the same
~ i and X
~ j is
univariate time series process. The covariance between X
~i; X
~ j ) = Ef[X
~i
cov(X
~ i )][X
~j
E(X
~ j )]g =
E(X
ij ;
[To be completed!]
~ with p elements can be dened
The covariance matrix of a random variable X
as,
3
::: :::
1p
6 21
7
::: :::
22
2p 7
6
0
6
~
~
~
~
:::
::: ::: ::: 7
Ef[X E(X)][X E(X) ]g = 6 :::
7
4 :::
:::
::: ::: ::: 5
p1
p2 ::: :::
pp
where ii = 2i ; the variance of the i : th element.
Like the expectations and the variance operator there some simple rules. If we
~ i and X
~j ;
add constants, a and b to X
~ i + a, X
~ j + b) = cov(X
~i, X
~ j ):
cov(X
~
~
If we multiply Xi and Xj with the constants (a) and (b) respectively, we get,
~ i , bX
~ j ) = ab cov(X
~i, X
~ j ):
cov(aX
The covariance operator is sometimes also written as C( ).
20.2.4
11
12
In the following
operator is,
(20.9)
i=m
162
REFERENCES
i=1
Some important rules deal with series of integer numbers, like a deterministic
time trend t = 1; 2; :::T: These are of interest when dealing with integrated variables and determining the order of probability, that is the order of convergence,
here indicated with O(:);
T
X
(T + 1)]
t=1
= O(T 2 )
T
X
t2
(20.11)
(1=3)[(T + 1)3
O(T 3 )
t=1
T
X
t3
13 + 23 + ::: + T 3
t=1
= O(T 4 ):
20.2.5
(20.13)
j< ] > 1
for n > n0 :
(20.14)
j< ] = 1
(20.15)
163
^!
or p lim ^ = :
(20.16)
Probability limits are useful for examining the asymptotic properties of estimators of stationary processes. There are a few simple rules to follow,
p lim(ax + by) = a p lim(x) + b p lim(y);
(20.17)
(20.18)
(20.19)
p lim(x
) = [p lim(x)]
p lim(x2 ) = [p lim(x)]2 :
(20.20)
(20.21)
) = [p lim(A)]
(20.22)
(20.23)
These rules hold regardless of whether the variables are independent or not.
20.2.6
n:
xt = xt+n :
With the lag operator is becomes possible to write long lag structures in a
simpler way.
From the lag operator follows the dierence operator
=1
xt = xt
xt
such that
1
xt + xt
or as,
xt
= xt
xt
164
xt = (1
L)d xt
REFERENCES
Setting d = 2 we get,
2
xt
(1
L)2 xt = (1
xt
2xt
+ xt
2L + L2 )xt
2
xt
xt
The letter d indicates dierences, which can be done by integer numbers such as
-2, -1, 0, 1 and 2. It is also possible to use real numbers, typically between -1.5 and
+1.5. With non-integer dierencing we come fractional integration, and so-called
long run memory series. If variables are expressed in log, which is the typical thing
in time series, the rst dierence will be a close approximation to per cent growth.
The lag operator is sometimes called the backward shift operator and is then
indicated with the symbol B n . The dierence operator, dened with the backward
shift operator is written as 5d = (1 B)d : Econometricians use the terms lag
operator and dierence operators with the symbols above. Time series statisticians
often use the backward shift notations.
165