Lecture 1

Econometrics [EM2008]
Lecture 1
Catching up: two-variable relationships
Irene Mammi
irene.mammi@unive.it
Academic Year 2018/2019
1 / 67
outline
I relationships between two variables

I correlations
I probability distributions
I the two-variable linear regression model
I inference in the two-variable linear regression model
I prediction
I further issues
I References:
I Johnston, J. and J. DiNardo (1997), Econometrics Methods, 4th
Edition, McGraw-Hill, New York, Chapters 1 and 2.
2 / 67
examples of bivariate relationships
Figure 1: saving and income (1)
3 / 67
examples of bivariate relationships (cont.)
Figure 2: saving and income (2)
4 / 67
Figure 3: natural log of gasoline consumption vs natural log of price per gallon
5 / 67
Figure 4: natural log of gasoline consumption vs natural log of income per capita
6 / 67
three main characteristics of the scatter diagrams

I sign of the association: do variables move together in a positive or
negative fashion?
I strength of the association: do variables move together in a positive
or negative fashion?
I linearity of the association: is the general shape linear?
7 / 67
I data come in the form of n pairs of observations of the form (Xi , Yi ),
i = 1, 2, . . . , n
I when n gets large, we can consider a bivariate frequency distribution
Table 1: distribution of heights and chest circumferences of 5731 Scottish men
chest circumference (inches)

33-35 36-38 39-41 42-44 45+ row totals
64-65 39 331 326 26 0 722
66-67 40 591 1010 170 4 1815
height (inches) 68-69 19 312 1144 488 18 1981
70-71 5 100 479 290 23 897
72-73 0 17 120 153 27 317
column totals 103 1351 3079 1127 72 5732
Table 2: conditional means for the data in table 1
mean of height given chest (inches) 66.31 66.84 67.89 69.16 70.53
mean of chest given height (inches) 38.41 39.19 40.26 40.76 41.8
8 / 67
correlation coefficient
I the correlation coefficient measures the direction and the closeness

of the linear association between two variables
I denote the observations by (Xi , Yi ) with i = 1, . . . , n
I express the data in deviation form from sample means as:
xi = Xi − X yi = Yi − Y
I consider the product xi yi
9 / 67
correlation coefficient (cont.)
Figure 5: coordinates for scatter diagram for paired variables
10 / 67
correlation coefficient (cont.)
I the sign of ∑ni=1 xi yi indicates whether the scatter slopes upward or
downward
I better to express the sum in average terms, giving the sample
covariance:
n n
Cov(X , Y ) = ∑ (Xi − X )(Yi − Y )/n = ∑ xi yi /n
i =1 i =1
I to obtain a measure of association that is invariant with respect to

units of measurement, consider the correlation coefficient r :
Cov(X , Y ) ∑ni=1 xi yi ∑n xi yi
r = p p = q q = i =1
Var(X ) Var(Y ) nsX sY
∑ni=1 xi2 ∑ni=1 yi2
where sX and sY are the standard deviations of X and Y

I the correlation coefficient must lie in the range from -1 to +1
11 / 67
probability models for two variables
discrete bivariate probability distribution
I consider a discrete bivariate probability distribution as in the table

I each cell entry indicates the probability of the joint occurrence of the
associated X , Y values
Table 3: a bivariate probability distribution
Marginal
X1 ··· Xi ··· Xm
probability
Y1 p11 ··· pi1 ··· pm1 p·1
.. .. .. .. ..
. . . . .
Yj p1j ··· pij ··· pmj p·j
.. .. .. .. ..
. . . . .
Yp p1p ··· pip ··· pmp p·p
Marginal
p1 · ··· pi · ··· pm · 1
probability
12 / 67
probability models for two variables (cont.)
I define the probability that X = Xi and Y = Yj as

pij = P (X = Xi , Y = Yi )
I the means for the bivariate distribution are defined by
µX = E(X ) = ∑ pi. Xi and µY = E(Y ) = ∑ p.j Yj

i j
I the variances are defined as
σX2 = var(X ) = E[(X − µX )2 ] = ∑ pi. (Xi − µX )2

i
σY2 = var(Y ) = E[(Y − µY )2 ] = ∑ p.j (Yi − µY )2

j
13 / 67
I the covariance is
σXY = Cov(X , Y ) = E[(X − µX )(Y − µY )]

= ∑ ∑ pij (Xi − µX )(Yj − µY )
i j
I the population correlation coefficient is defined as

σXY
corr = ρ =
σX σY
14 / 67
conditional probabilities
I the conditional probability for Y given X is given by

pij
= probability that Y = Yj given that X = Xi
pi.
= P(Yj |Xi )
I the mean of this distribution is the conditional expectation of Y given

X:
pij
µY |Xi = E(Y |Xi ) = ∑ Yj
j pi.
I the variance of this distribution is a conditional variance

pij
σY |Xi = Var(Y |Xi ) = ∑
2
(Yj − µY |Xi )2
j p i.
15 / 67
the bivariate normal distribution
I most famous distribution for continuous variables is the bivariate
normal
I when X and Y follow a bivariate normal distribution, the probability
density function (pdf) is given by
1√
f (x , y ) = ×
2πσX σY 1−ρ2
2 2
1 x − µX x − µX y − µY y − µY
exp − 2(1− ρ2 ) σX − 2ρ σX σY + σY
where x and y stand for the values taken by X and Y and ρ is the
correlation coefficient between X and Y
I integrating over y gives the marginal distribution for X
" #
1 x − µX 2

1
f (x ) = p exp − , −∞ < x < ∞
2πσx2 2 σX
which is normal with mean µX and standard deviation σX

16 / 67
I likewise, the marginal distribution of Y is normal with mean µY and
standard deviation σY
I the conditional distribution of Y given X is
f (y |x ) = f (x , y )/f (x )
y − µ 2
Y |X
= √ 1 exp − 12
2πσY |X σ
Y |X
I the conditional distribution is also seen to be normal: the conditional

mean is
µY |X = α + βx
where α = µY − βµX and β = ρ σσY
X
I the conditional mean is a linear function of the X variable
I the conditional variance is invariant with X and is given by
σY2 |X = σY2 (1 − ρ2 )
I the condition of constant variance is referred to as homoskedasticity

17 / 67
the two variables linear regression model
I in many bivariate situations, the variables are treated in a symmetrical

way
I economists often have explicit notions, derived from theoretical
models, of causality running from X , say, to Y
I generally f (X , Y ) = f (X ) · f (Y |X ) will be of greater interest than
f (X , Y ) = f (Y ) · f (X |Y )
18 / 67
the two variables linear regression model (cont.)
a conditional model
I assume Y = vacation expenditure and X = income; draw a sample of

n out of the N households in the population
I there is some bivariate distribution of Y and X for all N households
I economic theory would suggest
E(Y |X ) = g (X )
where g (X ) is expected to be an increasing function of X

I if the conditional expectation is linear in X then
E(Y |X ) = α + βX
I for the i th household this expectation gives
E(Y |Xi ) = α + βXi
19 / 67
I define Yi to be the actual vacation expenditure of the i th household

and the disturbance (or error) ui as
ui = Yi − E(Y |Xi ) = Yi − α − βXi
I taking expectations of both sides gives E(ui |Xi ) = 0

I the variance of ui is the variance of the conditional distribution, σY2 |X
I for now assume homoskedasticity
I also assume that disturbances are distributed independently of one
another so that they are pairwise uncorrelated
20 / 67
I collecting this assumptions together gives
E(ui ) = 0 for all i

var(ui ) = E(ui2 ) = σ2 for all i
cov(ui , uj ) = E(ui uj ) = 0 for all i 6= j
I these assumptions are summarized as
the ui are i.i.d.(0, σ2 )
21 / 67
estimates and estimators
I consider the simplest version of the two-variable model
Yi = α + βXi + ui with ui i.i.d.(0, σ2 )
I need to estimate three parameters: α, β and σ2

I an estimator is a formula, method, or recipe for estimating an
unknown population parameter: it is a random variable
I an estimate is the numerical value obtained when the formula is
applied to sample data
I we want to fit a straight line to sample data: Ŷi = a + bXi
I many estimators of the pair a, b may be devised
22 / 67
least-squares estimators
I the dominant estimating principle is that of least squares

I denote the residuals from any fitted straight line by
ei = Yi − Ŷi = Yi − a − bXi i = 1, 2, . . . , n
I each pair of a, b values defines a different line and hence a different

set of residuals
I the residual sum of squares is a function of a and b
I the least-square principle is
select a, b to minimize the residual sum of squares

RSS = ∑ ei2 = f (a, b )
23 / 67
Figure 6: residuals from a fitted straight line
24 / 67
I taking derivatives of RSS with respect to a and b and setting them to
zero gives
∂(∑i ei2 )
∂a = −2 ∑i (Yi − a − bXi ) = −2 ∑i ei = 0
∂(∑i ei2 )
∂b = −2 ∑i Xi (Yi − a − bXi ) = −2 ∑i Xi ei = 0
I the normal equations for the linear regression of Y on X are
∑i Yi = na + b ∑i Xi
∑i Xi Yi = a ∑i Xi + b ∑i Xi2
I the least-squares estimators are
a = Y − bX
∑i (Xi − X )(Yi − Y ) ∑ xi yi s cov(Xi , Yi )

b= = i 2 =r Y =
∑i (Xi − X ) 2 ∑ i xi sX var(Xi )
25 / 67
the least-squares line has important properties:

I it minimizes SSR
I passes through the mean point (X , Y )
I ∑i ei = 0 in the sample
I cov(ei , Xi ) = 0 in the sample
the error variance σ2 cannot be estimated from a sample of u values since

they are unobservable, but an estimate can be based on the residuals ei
I an unbiased estimator of σ2 is
∑i ei2
s2 =
(n − 2)
26 / 67
decomposition of the sum of squares
I the value of Yi can be decomposed as
Yi = Ybi + ei
I subtract Y from both sides of the previous equation and get
Yi − Y = (Ybi − Y ) + ei
I by squaring and then summing both sides we get
∑(Yi − Y )2 = ∑(Ybi − Y )2 + ∑ ei2

that gives a decomposition of the total sample variation (total sum
of squares) into explained and unexplained components
27 / 67
I ∑(Yi − Y )2 = TSS: total sum of squared deviations in Y
I ∑(Ybi − Y )2 =ESS: explained sum of squares from the regression of
Y on X
I ∑ ei 2 = RSS: residual, or unexplained, sum of squares from the
regression of Y on X
I the previous decomposition can be rewritten as
TSS = ESS + RSS
I the coefficient of determination R 2 measures the proportion of

variation in Y explained by X within the regression model
ESS RSS
R2 = = 1−
TSS TSS
I the closer R 2 to 1, the closer the sample values of Yi to the fitted line
I let r be the sample correlation coefficient then R 2 = r 2
28 / 67
inference in the two variables least-square model
properties of LS estimators
I focus on the sampling distribution of the LS estimators

I the parameters of interest are α, β and σ2 of the conditional
distribution f (Y |X )
I the only source of variation is variation in the stochastic disturbance u
I we assume X to be nonstochastic (fixed regressor case)
I the LS slope estimator can be written as
b= ∑ wi Yi
i
Xi −X
where wi = so that the LS slope estimator is linear in the
∑ i ( Xi − X ) 2
Y values
29 / 67
inference in the two variables least-square model (cont.)
I by substituting Yi = α + βXi + ui and using the stochastic properties
of u we have
b = α (∑i wi ) + β (∑i wi Xi ) + ∑i wi ui
= β + ∑ i w i ui
from which
E(b ) = β
that is, b is an unbiased estimator of β
I the variance is
 !2 
var(b ) = E[(b − β)2 ] = E  ∑ wi ui 
i
which reduces to
σ2
var(b ) =
∑i (Xi − X )2
30 / 67
I similarly, it can be shown that
E(a ) = α
2
" #
2 1 X
var(a ) = σ +
n ∑i (Xi − X )2
I the covariance of the two estimators is
σ2 X
cov(a, b ) = −
∑i (Xi − X )2
31 / 67
Gauss-Markov theorem
I the sampling variances of the LS estimators are the smallest that can
be achieved by any linear unbiased estimator
I looking at estimators of β, let
b∗ = ∑ ci Yi
i
denote any arbitrary linear unbiased estimator of β

I it can be shown that
var(b ∗ ) = var(b ) + σ2 ∑(ci − wi )2

i
I since ∑i (ci − wi )2 ≥ 0, var(b ∗ ) ≥ var(b )

I the LS estimator has minimum variance in the class of linear unbiased
estimators and is the Best Linear Unbiased Estimator (BLUE)
32 / 67
inference procedures
I up to now results only require the assumption that ui are i.i.d.(0, σ2 )
I inference also requires the assumption of normality
I since linear combination of normal variables are themselves normally
distributed, the sampling distribution of a, b is bivariate normal
I thus
b ∼ N ( β, σ2 / ∑ (Xi − X )2 )
I the standard deviation of the sampling distribution is referred to as
the standard error of b and denoted by se(b )
I the sampling distribution of the intercept term is
2
" !#
2 1 X
a ∼ N α, σ +
n ∑i (Xi − X )2
I if σ2 were known, a 95% confidence interval for β would be

r
b ± 1.96σ/ ∑(Xi − X )2
i
33 / 67
I we would also have
b−β
z= q ∼ N (0, 1)
σ/ ∑i (Xi − X )2
I a test of the hypothesis H0 : β = β 0 is carried out by computing
b − β0 b − β0
q =
se(b )
σ/ ∑i (Xi − X )2
I when σ2 is unknown, need further results:
∑i ei2
∼ χ2 (n − 2)
σ2
∑ ei2 is distributed independently of f (a, b )

i
34 / 67
I we have
b−β
q ∼ t (n − 2)
s/ ∑i (Xi − X )2
where s 2 = ∑i ei2 /(n − 2)

I a 95% confidence interval for β is
r
b ± t0.025 s/ ∑(Xi − X )2
i
I H0 : β = β 0 would be rejected if

q b − β 0
> t0.025 (n − 2)

s/ ∑i (Xi − X )2
35 / 67
prediction in the two variables regression model
I point prediction is given by the regression value corresponding to X0
Ŷ0 = a + bX0 = Y + bx0
where x0 = X0 − X
I the true value of Y for the prediction period or observation is
Y0 = α + βX0 + u0
I the average value of Y taken over the n sample observations is
Y = α + βX + u
I subtracting gives
Y0 = Y + βx0 + u0 − u
I the prediction error is defined as
e0 = Y0 − Ŷ0 = −(b − β)x0 + u0 − u

36 / 67
prediction in the two variables regression model (cont.)
I the expected prediction error is zero so Ŷ0 is a linear unbiased
predictor of Y0
I the variance of e0 is
∑ x02

1
var(e0 ) = σ2 1 + +
n ∑i (Xi − X )2
I e0 is a linear combination of normally distributed variables (b, u0 , u)

so that it is also normally distributed
e0
q ∼ N (0, 1)
σ 1 + 1/n + x02 / ∑(Xi − X )2
I replacing σ2 by s 2 gives
Y0 − Y
b0
q ∼ t (n − 2)
s 1 + 1/n + (X0 − X )2 / ∑(Xi − X )2
37 / 67
prediction in the two variables regression model (cont.)
I everything known except Y0 so a 95% confidence interval for Y0 is
s
1 (X0 − X )2
(a + bX0 ) ± t(0.025) s 1 + +
n ∑(Xi − X )2
I there is an element of uncertainty in predicting Y0 due to the random

drawing u0
I focus interest on the prediction of the mean value of Y0
E(Y0 ) = α + βX0
which allows to eliminate u0 from the prediction error

I a 95% confidence interval for E(Y0 ) is
s
1 (X0 − X )2
(a + bX0 ) ± t(0.025) s +
n ∑(Xi − X )2
I nb: the width of the confidence interval increases the further X0 is

from the sample mean X
38 / 67
stochastic properties of disturbance ui : summary
I recall the specific assumptions
E(Y |X ) = α + βX
E(ui2 ) = σ2 for all i
E(ui uj ) = 0 for all i 6= j
I from the homoskedasticity and the fixed regressor assumptions it

follows that
E(Xi uj ) = Xi E(uj ) = 0 for all i, j
I adding the assumption of normality gives
the ui are i.i.d. N (0, σ2 )
39 / 67
time as a regressor
I many economic variables increase or decrease with time
I a linear trend relationship would be modeled as
Y = α + βT + u
where T indicates time

I standard LS estimators can be used to estimate β
I taking first difference of the linear trend equation gives
∆Yt = β + (ut − ut −1 )
I ignoring disturbances, the series increases (decreases) by a constant

amount each period
I for an increasing series (β > 0) the growth rate is decreasing, for a
decreasing series (β < 0) the growth rate is increasing
I an appropriate specification for a series with a constant growth rate
expresses the logarithm of the series as a linear function of time
40 / 67
time as a regressor (cont.)
I without disturbances a constant growth series is given by
Yt = Y0 (1 + g )t
where g = (Yt − Yt −1 )/Yt −1 is the constant proportionate rate of

growth per period
I taking logs gives
lnYt = α + βt
where α = lnY0 and β = ln(1 + g )
I the β coefficient represents the continuous rate of change ∂lnYt /∂t,
whereas g represents the discrete rate
I formulating a constant growth series in continuous time gives
Yt = Y0 e βt or lnYt = α + βt
I taking fist differences gives
∆lnYt = β = ln(1 + g ) ≈ g
where the approximation is accurate for small values of g

41 / 67
transformations of variables
log-log transformation
I constant elasticity function
Yt = AX β or lnYt = α + βlnX
dY X
I β represents the elasticity of Y with respect to X , ε = dX Y
semilog transformation
I constant growth equation
lnY = α + βX + u
I β = Y1 dY
dX represents the proportionate change in Y per unit
change in X
42 / 67
lagged dependent variable as regressor
I when two variables display trends, successive values tend to be close

together
I can model such behavior by means of an autoregression, e. g. an
AR(1)
Yt = α + βYt −1 + ut
I the LS estimators for the AR(1) equation are
∑ Yt = na + b ∑ Yt −1
∑ Yt Yt −1 = a ∑ Yt −1 + b ∑ Yt2−1
I properties of the LS estimators and inference procedures are not

strictly applicable
I the fixed regressor assumption is violated
43 / 67
lagged dependent variable as regressor (cont.)
I by repeated substitution obtain
Y1 = α + βY0 + u1
Y2 = α + β(α + βY0 + u1 ) + u2
= α(1 + β) + β2 Y0 + (u2 + βu1 )
and, hence, the general equation
Yt = α(1 + β + β2 + . . . + βt −1 )
+ βt Y0 + (ut + βut −1 + β2 ut −2 + . . . + βt −1 u1 )
I multiply successively by ut , ut −1 , ut −2 , . . . and take expectations to

get
E(Yt ut ) = σ2
E(Yt ut −1 ) = βσ2
E(Yt ut −2 ) = β2 σ2
44 / 67
lagged dependent variable as regressor (cont.)
I Yt is correlated with current and all previous disturbances but

uncorrelated with all future disturbances; Yt −1 is uncorrelated with
current disturbance ut and all future disturbances but is correlated
with all previous disturbances
I nb: the zero covariances assumption does not hold when the regressor
is a lagged value of the dependent variable
45 / 67
intro to asymptotics
I recall the distribution of the mean of a random sample

I suppose X is a random variable with some unknown pdf which has
finite mean µ and finite variance σ2
I n values are drawn independently from the distribution
I sample mean x n is a random variable with pdf f (x n )
I question: how do a random variable such as x n and its pdf behave as
n → ∞?
46 / 67
intro to asymptotics (cont.)
convergence in probability
I the x’s are i.i.d.(µ, σ2 ) from which
σ2
E(x n ) = µ and var(x n ) =
n
so that x n is an unbiased estimator and the variance tends to zero as
n increases
I the distribution of x n becomes more and more concentrated in the
neighborhood of µ as n increases
I define µ ± e to be a neighborhood around µ and
P {µ − e < x n < µ + e} = P {|x n − µ| < e}
the probability that x n lies in the specific interval
47 / 67
I since var(x n ) declines monotonically with increasing n, there exists a

n∗ and a δ (0 < δ < 1) such that ∀n > n∗
P {|x n − µ| < e} > 1 − δ
I x n is said to converge in probability to µ

I equivalent statement is
lim P {|x n − µ| < e} = 1

n→∞
I shorthand expression
plim x n = µ
I the sample mean is a consistent estimator of µ
48 / 67
convergence in distribution
I the form of the distribution of x n is unknown; however, it collapses on

µ since the variance goes to zero in the limit
√
I consider n(x n − µ)/σ which has zero mean and unit variance
I central limit theorem states
√ Z y
n (x n − µ ) 1 2
lim P ≤y = √ e −z /2 dz
n→∞ σ −∞ 2π
I whatever the form of f (x ), the limiting distribution of
√
n(x n − µ)/σ is standard normal
I this process is defined convergence in distribution and can be
expressed as
√ d √
nx n → N ( nµ, σ2 )
I the objective is to use x n to make inference about µ
49 / 67
I can do that by taking the limiting normal distribution as an

approximation for the unknown distribution of x n
σ2

a
xn ∼ N µ,
n
I the unknown σ2 can be replaced by the sample variance, which will

be a consistent estimate
50 / 67
autoregressive equation
I consider again the AR(1) model
I it may be estimated by the LS estimators a and b obtained from
∑ Yt = na + b ∑ Yt −1
∑ Yt Yt −1 = a ∑ Yt −1 + b ∑ Yt2−1
√ √
I it can be proved that n(a − α) and n(b − β) have a bivariate
normal limiting distribution with zero means and finite variances and
covariances
I thus LS estimators are consistent for α and β
I the application of LS formulae to the AR model has an asymptotic, or
large-sample, justification
51 / 67
I however, two assumptions required
(1) the ut are i.i.d. with zero mean and finite variance
(2) the Yi series is stationary
52 / 67
stationary and nonstationary series
I consider again the AR(1) model
I make again the assumptions about the disturbance u

E(ui2 ) = σ2 for all i
E(ui uj ) = 0 for all i 6= j
which define a white noise series

I the equation
Yt = α(1 + β + β2 + . . . + βt −1 )
+ βt Y0 + (ut + βut −1 + β2 ut −2 + . . . + βt −1 u1 )
shows Yt as a function of α, β, Y0 , and the current and previous

disturbances.
53 / 67
stationary and nonstationary series (cont.)
I assume that the process started a very long time ago so that can
rewrite
Yt = α(1 + β + β2 + . . .) + (ut + βut −1 + β2 ut −2 + . . .)
I the stochastic properties of the Y series are determined by the

stochastic properties of the u series
I taking expectations of both sides gives
E(Yt ) = α(1 + β + β2 + . . .)
which only exists if the infinite geometric series on the RHS has a limit
I the necessary and sufficient condition is
| β| < 1
54 / 67
I the expectation is then
α
E(Yt ) = µ =
1−β
so that the Y series has a constant unconditional mean µ at all

points in time
I to otain the variance write
(Yt − µ) = ut + βut −1 + β2 ut −2 + . . .
I square both sides and take expectations
var(Yt ) = E[(Yt − µ)2 ]

= E[ut2 + β2 ut2−1 + β4 ut2−2 + . . . + 2βut ut −1 + 2β2 ut ut −2 + . . .]
I the white noise assumptions imply
σ2
var(Y ) = σY2 =
1 − β2
55 / 67
I the Y series has a constant unconditional variance, independent of
time
I define autocovariance: covariance of Y with a lagged value of itself
I the first-order (first-lag) autocovariance is defined as
γ1 = E[(Yt − µ)(Yt −1 − µ)]

= βσY2
I the second-order autocovariance is
γ2 = E[(Yt − µ)(Yt −2 − µ)]

= β2 σY2
I in general the s th -order autocovariance is
γs = βs σY2 s = 0, 1, 2, . . .
56 / 67
I nb: the autocovariances depend only on the lag length and are
independent of t
I γ0 is the variance: dividing the covariances by the variance gives the
set of autocorrelation coefficients (or serial correlation
coefficients) defined by
ρs = γs /γ0 s = 0, 1, 2, . . .
I plotting the autocorrelation coefficients against the lag lengths gives

the correlogram of the series
I summing up: when | β| < 1 the mean, the variance, and covariances
of the Y series are constant, independent of time ⇒ the Yt series is
said to be weakly or covariance stationary
57 / 67
Figure 7: correlogram of an AR(1) series
58 / 67
unit root
I when β = 1 the AR(1) process is said to have a unit root

I the model becomes
Yt = α + Yt −1 + ut
which is called random walk with drift
I the conditional expectation is
E(Yt |Y0 ) = αt + Y0
which increases or decreases without limit as t increases

I the conditional variance is
var (Yt |Y0 ) = E[(Yt − E(Yt |Y0 ))2 ]

= E[(ut + ut −1 + . . . + u1 )2 ]
= tσ2
which increases without limit

59 / 67
I in the unit root case, the conditional mean and variance of Y do not
exist ⇒ the series is said to be nonstationary, and the asymptotic
results do not hold
I when | β| > 1, the series exhibits explosive behavior
60 / 67
Figure 8: a stationary AR(1) series and a random walk
61 / 67
Figure 9: an explosive series
62 / 67
maximum likelihood estimation of the AR model
maximum likelihood estimators
I if some assumptions are made about the specific form of the pdf for
u, it is possible to derive maximum likelihood estimators (MLEs)
of the parameters of the AR model
I MLEs are consistent, asymptotically normal and asymptotically
efficient
I assume that the disturbances ui are i.i.d. N (0, σ2 ) so that the pdf is
1 2 2
f (ui ) = √ e −ui /2σ i = 1, 2, . . . , n
σ 2π
I arbitrary initial value Y0 ; any observed set of sample values
Y1 , Y2 , . . . , Yn is generated by some set of u values
63 / 67
maximum likelihood estimation of the AR model (cont.)
I the probability of a set of u values is
P(u1 , u2 , . . . , un ) = f (u1 )f (u2 ) · . . . · f (un )

n
= ∏ f ( ut )
t =1
1 n 2 2
= 2 n/2
e − ∑t =1 (ut /2σ )
(2πσ )
I the joint density of the Y values conditional on y0 is then
" #
1 1 n
P(Y1 , Y2 , . . . , Yn ) = exp − 2 ∑ (Yt − α − βYt −1 )2
(2πσ2 )n/2 2σ t =1
I this density may be interpreted in two ways: (1) for given α, β, and
σ2 it indicates the probability of a set of sample outcomes; (2) it is a
function of α, β, and σ2 , conditional on a set of sample outcomes
64 / 67
I for interpretation (2), we refer to the density as likelihood function:
likelihood function = L(α, β, σ2 ; Y )
I maximizing the likelihood with respect to the three parameters gives

specific values α̂, β̂ and σ̂2 , which maximize the probability of
obtaining the sample values that have actually been observed: these
are the maximum likelihood estimators of the parameters of the AR
model
I the MLEs estimators solve
∂L ∂L ∂L
= = 2 =0
∂α ∂β ∂σ
I often it is simple to maximize the logarithm of the likelihood function
` = lnL
65 / 67
I since ` is a monotonic transformation of L, the MLEs may equally be
obtained by solving
∂` ∂` ∂`
= = 2 =0
∂α ∂β ∂σ
I for the AR model, the log-likelihood (conditional on Y0 ) is
n
n n 1
` = − ln(2π ) − lnσ2 − 2
2 2 2σ ∑ (Yt − α − βYt −1 )2
t =1
I the α̂, β̂ values that maximize L are those that minimize

∑nt=1 (Yt − α − βYt −1 )2 ⇒ in this case, the LS and ML estimates of
α and β are identical
I the ML estimator of σ2 is
n
1
σ̂2 =
n ∑ (Yt − α̂ − β̂Yt −1 )2
t =1
66 / 67
properties of MLEs
(1) consistency: MLEs are consistent, thus yield consistent estimates of

α, β and σ2
(2) asymptotic normality: the estimators α̂, β̂ and σ̂2 have
asymptotically normal distributions centered at the true parameter
values; asymptotic variances are derived from the information matrix
(more on this later)
(3) asymptotic efficiency: no other consistent and asymptotically
normal estimator can have a smaller asymptotic variance
67 / 67

Lecture 1

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Lecture 1

Uploaded by

Copyright:

Econometrics [EM2008]

Academic Year 2018/2019

I relationships between two variables

Figure 1: saving and income (1)

Figure 2: saving and income (2)

three main characteristics of the scatter diagrams

Table 1: distribution of heights and chest circumferences of 5731 Scottish men

chest circumference (inches)

Table 2: conditional means for the data in table 1

I the correlation coefficient measures the direction and the closeness

I consider the product xi yi

Figure 5: coordinates for scatter diagram for paired variables

I to obtain a measure of association that is invariant with respect to

where sX and sY are the standard deviations of X and Y

I consider a discrete bivariate probability distribution as in the table

Table 3: a bivariate probability distribution

I define the probability that X = Xi and Y = Yj as

µX = E(X ) = ∑ pi. Xi and µY = E(Y ) = ∑ p.j Yj

I the variances are defined as

σX2 = var(X ) = E[(X − µX )2 ] = ∑ pi. (Xi − µX )2

σY2 = var(Y ) = E[(Y − µY )2 ] = ∑ p.j (Yi − µY )2

σXY = Cov(X , Y ) = E[(X − µX )(Y − µY )]

I the population correlation coefficient is defined as

I the conditional probability for Y given X is given by

I the mean of this distribution is the conditional expectation of Y given

I the variance of this distribution is a conditional variance

which is normal with mean µX and standard deviation σX

I the conditional distribution is also seen to be normal: the conditional

I the condition of constant variance is referred to as homoskedasticity

I in many bivariate situations, the variables are treated in a symmetrical

I assume Y = vacation expenditure and X = income; draw a sample of

where g (X ) is expected to be an increasing function of X

I for the i th household this expectation gives

E(Y |Xi ) = α + βXi

I define Yi to be the actual vacation expenditure of the i th household

ui = Yi − E(Y |Xi ) = Yi − α − βXi

I taking expectations of both sides gives E(ui |Xi ) = 0

I collecting this assumptions together gives

E(ui ) = 0 for all i

I these assumptions are summarized as

the ui are i.i.d.(0, σ2 )

estimates and estimators

I consider the simplest version of the two-variable model

Yi = α + βXi + ui with ui i.i.d.(0, σ2 )

I need to estimate three parameters: α, β and σ2

I the dominant estimating principle is that of least squares

I each pair of a, b values defines a different line and hence a different

select a, b to minimize the residual sum of squares

Figure 6: residuals from a fitted straight line

I the normal equations for the linear regression of Y on X are

I the least-squares estimators are

∑i (Xi − X )(Yi − Y ) ∑ xi yi s cov(Xi , Yi )

the least-squares line has important properties:

the error variance σ2 cannot be estimated from a sample of u values since

decomposition of the sum of squares

I the value of Yi can be decomposed as

I subtract Y from both sides of the previous equation and get

I by squaring and then summing both sides we get

∑(Yi − Y )2 = ∑(Ybi − Y )2 + ∑ ei2

TSS = ESS + RSS

I the coefficient of determination R 2 measures the proportion of

I focus on the sampling distribution of the LS estimators

I similarly, it can be shown that

denote any arbitrary linear unbiased estimator of β

var(b ∗ ) = var(b ) + σ2 ∑(ci − wi )2

I since ∑i (ci − wi )2 ≥ 0, var(b ∗ ) ≥ var(b )

I if σ2 were known, a 95% confidence interval for β would be