You are on page 1of 14

Notes on Asymptotic Theory: Convergence in

Probability and Distribution


Introduction to Econometric Theory
Econ. 770
Jonathan B. Hill
Dept. of Economics
University of North Carolina - Chapel Hill
November 19, 2011

Introduction

Let ( F ) be a probability space. Throughout is a parameter of interest like


the mean, variance, correlation, or distribution parameters like Poisson , Binomial
, or exponential . Throughout f^ g1 is a sequence of estimators of based on
a sample of data f g=1 with sample size 1. Assume ^ is F-measurable for
any . Unless otherwise noted, assume the 0 have the same mean and variance:
( 2 ). If appropriate, we may have a bivariate sample f g=1 where
( 2 ) and ( 2 ).
Examples include the sample mean, variance, or correlation:

X
:= 1
Sample Mean :

=1

Sample Variance #1 :

1 X
:=
( )2
1
=1

Sample Variance #2 :
^ 2 :=

X
=1

P

1 =1
Sample Correlation :
^ :=

^
^
Similarly, we may estimate a probability by using a sample relative frequency:

1X
^ () =
( ) the sample percentage of

=1

Notice ^ () estimates ( ).
We will look at estimator properties: what ^ is on average for any sample
size; and what ^ becomes as the sample size grows. PIn every
case above the
es

timator is a variant of a straight


average
(e.g. 1 =1 is a

), or a function of a straight average (e.g.


straight average of
^
2 12

2
P

:= (1 =1 ) , the square root of the average ). We therefore


pay particular attention to the sample mean.

Unbiasedness

Defn.

We say ^ is an unbiased estimator of if [^ ] = . Dene bias as



B ^ := [^ ]

An unbiased estimator has zero bias: B(^ ) = 0. If we had an innite number of


samples of size , then the average estimate ^ across all samples would be . An
asymptotically unbiased estimator satises B(^ ) ! 0 as ! 1.

Claim (Weighted Average):


P Let have a common mean := [ ]. Then
the
^ :=
=1 is an unbiased estimator of := [] if
P weighted average

=
1.
=1
Proof:

"
X
=1

[ ] =

=1

X
=1

= QED.

:= 1 P is a
Corollary (Straight Average):
The sample mean
=1
P
weighted average with at or uniform weights = 1 hence trivially =1 = 1
hence
=
[]
P
The problem then arises as to which weighted average =1 may be preferred
in practice since any with unit summed weights is unbiased. We will discuss below the
concept of eciency below, but the minimum mean-squared-error unbiased estimator

has uniform weights if ( 2 ). That is:

is the best linear


Claim (Sample Mean is Best):
Let ( 2 ). Then
unbiased estimator of (i.e. it is BLUE).

Proof:

We want to solve

!2

X
X
min

subject to
= 1

=1

=1

The Lagrange is

L ( ) :=

=1

!2

+ 1

X
=1

P
P
where by independence ( =1 )2 = 2 =1 2 , hence

X
X
L ( ) := 2
2 + 1

=1

=1

The rst order conditions are

L ( ) = 2 2 = 0 and
L ( ) = 1
= 0

=1

2
2
ThereforeP = (2
P) is a constant that sums to =1 = 1. Write = (2 ) =:

. Since =1 = =1 = = 1 it follows = = 1. QED.

Remark:
As in many cases here and below, independence can be substituted
for uncorrelatedness since the same proof applies: [ ] = [ ][ ] for all
6= . We can also substitute uncorrelatedness with a condition that restricts the
total correlation across all and for 6= , but such generality is typically only
exploited in time series settings (where is at a dierent time period).

Claim (Sample Variance):


Let ( 2 ). The estimator 2 is unbiased
2
and
^ is negatively biased but asymptotically unbiased.
Proof:

Notice

=1

=1

1 2
1 X
2= 1
2
=
^ 2 =

+

1X
1 X
1X
2

( ) +
+2
( )

=1

=1

=1

1X
1X
2

( ) + 2
( )

=1

=1

1X
22


( )2 +

=1

1X
2
( )2

=1

is unbiased
By the iid assumption and the fact that

!

2
1
1 X
1

=
= 2
( ) = 2 2 = 2

=1

=1

Further, by denition 2 := [( )2 ] hence


"
#

i 1X
1X
1X h
2
2

( ) =
( ) =
2 = 2

=1

=1

=1

Therefore

2
1 2
1
=
^ = 2 2 = 2


2
This implies each claim: 2 = 2 (2 is unbiased),
^ = 2 ( 1) 2
2
2
2
(^
is negatively biased), and
^ = ( 1) ! 2 (^
2 is asymptotically
unbiased). QED.

Example:
We simulate 100 samples of (75 4) with sample size = 20.
for each sample. The simulation average of all
is 74.983941
In Figure 1 we plot

and the simulation variance of


Pall is 21615195.
P
In Figure 2 we plot
^ = =1 for each sample with weights = =1 .
The simulation average of all
^ is 74.982795 and the simulation variance of all
^

is .30940776. Thus, both display the same property of unbiasedness, but exhibits
less dispersion across samples

Figure 1 :

Figure 2 :
^

Convergence in Mean-Square or L -Convergence

Defn.

We also write

We say ^ 2 R converges to in mean-square if

2
MSE(^ ) := ^ ! 0
^
! and ^ ! in mean-square.

If ^ is unbiased for then

h i2
h i
MSE(^ ) = ^ ^
= ^

Convergence in mean-square certainly does not require unbiasedness. In the, MSE is

h i
h i
2
MSE(^ ) = ^ = ^ ^ + ^

h i2
h i
2

h i h i

= ^ ^
+ ^ + 2 ^ ^
^

h i2
h i
2
= ^ ^
+ ^
4

h i

h i
h i
since ^ is just a constant and ^ ^ = [^ ] ^ = 0. Hence
MSE is the variance plus bias squared:

h i2
h i
2
h i 2
MSE(^ ) = ^ ^
+ ^ = ^ + B ^

If ^ 2 R then we write

0
MSE(^ ) := ^ ^ ! 0

hence component wise convergence. We may similarly write convergence in 2 -norm


0
112

X

0
X

^ ^ ! 0 where kk := @
2 A
2

=1 =1

or convergence in matrix (spectral) norm:


^ ^ ! 0 where kk is the largest eigenvalue of .

2
Both imply convergence with respect to each element ^ ! 0.

Defn.
We say ^ 2 R has the property of -convergence, or convergence in
-norm, to if for 0

^ ! 0
Clearly 2 -convergence and mean-square convergence are equivalent.
Claim (Sample Mean):
Proof:

! in mean square.
Let ( 2 ). Then

)2 = []
= 2 ! 0 QED.
(

= 2 still holds.
We only require uncorrelatedness since []
Claim (Sample Mean):
mean square.
Proof:

! in
Let ( 2 ) be uncorrelated. Then

)2 = []
= 2 ! 0 QED.
(

In fact, we only need all cross covariances to not be too large as the sample size
grows.
P
Claim (Sample Mean):
Let ( 2 ) satisfy 12 ( ) ! 0.
! in mean square.
Then
)2 = []
= 2 + 22 P ( ) ! 0 QED.
Proof:
(

Remark:
In micro-economic contexts involving cross-sectional data this type
of correlatedness is evidently rarely or never entertained. Typically we assume the
0 are uncorrelated. It is, however, profoundly popular in macroeconomic and
5

nance contexts where data are time series. A very large P


class of time series random
2
variables satises both ( ) 6= 0 8 6= and 1
( ) ! 0, and

therefore exhibits ! in mean square.

! in -norm for any 2 (1 2] but proving the result


If ( 2 ) then
for non-integer 2 (1 2) is quite a bit more dicult. There are many types of
"maximal inequalities", however, that can be used to prove

for 2 (1 2) where 0 is a nite constant.

=1

! in -norm for
Let ( 2 ) be iid. Then

Claim (Sample Mean):


any 2 (1 2).
Proof:

1 X

1 X
1
1

=
f g = 1 ! 0

=1

=1

since 1 QED.

Example:
We simulate (7 400) with sample sizes = 5 15 25 1000.
and []
= 400 over sample size . Notice the high volatility
In Figure 3 we plot
for small .
and []

Figure 3:

Convergence in Probability : WLLN

Defn.

We say ^ converges in probability to if

lim ^ = 0 8 0
!1

We variously write

^ !
and ^ !

(1)

and we say ^ is a consistent estimator of .


Since probability convergence is convergence in the sequence f (j^ j
by the denition of a limit it follows for every 0 there exists 0
such that

1 8
)g1
=1 ,

That is, for a large enough sample size ^ is guaranteed to be as close to as we


choose (i.e. the ) with as a great a probability as we choose (i.e. 1 ).
Claim (Law of Large Numbers = LLN):
Proof:

!
If ( 2 ) then
.

By Chebyshevs inequality and independence, for any 0


2

2
2 = 2 ! 0 QED

Remark 1:
We call this a Weak Law of Large Numbers [WLLN] since convergence is in probability. A Strong LLN based on a stronger form of convergence is
given below.

2 = 2 ! 0.
Remark 2:
We only need uncorrelatedness to get
The WLLN, however extends to many forms of dependent random variables.
Remark 3:
In the iid case we only need j j 1, although the proof is
substantially more complicated. Even for non-iid data we typically only need j j1+
1 for innitessimal 0 (pay close attention to scholarly articles you read, and
to your own assumptions: usually far stronger assumptions are imposed than are
actually required).
P
The weighted average =1 is also consistent as long as the weights decay
with the sample size. Thus we write the weight as .
P
P
P

Claim:
If ( 2 ) then =1 ! if =1 = 1 and =1 2 ! 0.
P
Proof:
By Chebyshevs inequality, independence and =1 = 1, for any 0


!

!2

!2
X

X
X



2
= 2
f g

=1

=1

= 2

X
=1

=1

h
i
X
2 ( )2 = 2 2
2 ! 0
=1

which proves the claim. QED.

with = 1, but also the weights = P used in


An example is
=1
Figure 2.

Example:
We simulate (75 20)
sample sizes = 5 15 25 10000.
Pwith

In Figures 4 and 5 we plot and


^ = =1 over sample size . Notice the
high volatility for small .
7


Figure 4 :

Figure 5 :
^

79

79

78

78

77

77

76

76

75

75

74

74

73

73

72

72

71

71

70

70

1005

2005

3005

4005

5005

6005

7005

8005

9005

1005

2005

3005

Sam ple Size n

4005

5005

6005

7005

8005

9005

Sam ple Size n

Claim (Slutsky Theorem):


Let ^ 2 R . If ^ ! and : R ! R is

continuous (except possibly with countably many discontinuity points) then (^ ) !


().

Corollary:
Let ^ ! , = 1 2. Then ^ 1 ^ 2 ! 1 2 , ^1 ^2 ! 1 2 ,

and if 2 6= 0 and lim inf !1 j^2 j 0 then ^1 ^2 ! 1 2 .

Claim:
Proof:

If ( 2 ) and [4 ] 1 then 2 ! 2 .
Note

1 2
1 X
1X

2
=
=
( )2

=1

=1

!
2 !
By LLN
, therefore by the Slutsky Theorem
0. By [4 ]
1 it follows ( )2 is iid with a nite variance, hence it satises the LLN:
P

1 =1 ( )2 ! [( )2 ] = 2 . QED.
Claim:

If ( 2 2 ) and [2 2 j 1 then the sample correla-

tion
^ ! the population correlation.

Example:
We simulate (7 400) and (0 900) and construct =
43 + 2 + . The true correlation is

43 [ ] + 2 2 7 (43 + 2 7)
[ ] [ ] [ ]
p
=
=

20 4 400 + 900

43 7 + 2 400 + 72 7 (43 + 2 7)
p
=
= 8
20 4 400 + 900
We estimate correlation for samples with size = 5 15 25 10000. Figure 6 demonstrates consistency and therefore the Slutsky Theorem.

Figure 6: Correlation
1.00
0.90
0.80
0.70
0.60
0.50
5

1005 2005 3005 4005 5005 6005 7005 8005 9005


Sample Size n

Almost Sure Convergence : SLLN

Defn.

We say ^ converges almost surely to if

lim ^ = = 1
!1

This is identical to

lim max = 0 8 0

!1

We variously write

^ !
and ^ !

and we say ^ is strongly consistent for .


We have the following relationships.
Claim:
Proof:

^ ! implies ^ ! ; . ^ ! implies ^ ! .

(j^ j ) 2 (^ )2 by Chebyshevs inequality. If (^ )2 ! 0 (i.e.

^
! ) then (j^ j ) ! 0 where 0 is arbitrary. Therefore ^ ! .

(j^ j ) (sup j^ j ) since sup j^ j j^ j.

Therefore if (sup j^ j ) ! 0 8 0 (i.e. ^ ! ) then (j^ j

) ! 0 8 0 (i.e. ^ ! ). QED.

If ^ is bounded wp1 then ^ ! if and only if [^ ] ! which is asymptotic un

biasedness (see Bierens). By the Slutsky Theorem ^ ! implies (^ )2 ! 0 hence


[(^
)2 ] ! 0: convergence in probability implies convergence in mean-square.
This proves the following (and gives almost sure convergence as the "strongest" form:
the one that implies all the rest).

Claim (a.s. =) i.p. =) m.s.):

Let ^ be bounded wp1: (j^ j ) = 1


9

for nite 0. Then ^ ! implies ^ ! implies asymptotic unbiasedness and

^
! .

Claim (Strong Law of Large Numbers = SLLN):

! .
Remark:

Example:

If ( 2 ) then

The Slutsky Theorem carries over to strong convergence.

Let ( 2 ) and dene


^ :=

1 +


Then (j^ j ) = 1. Moreover, under the iid assumption
! by the SLLN,
hence by the Slutsky Theorem
1
^
!

1 +
Therefore
1

^ !
1 +
and [^ ] ! = 1(1 + ) and

2
^

! 0

Convergence in Distribution : CLT

Defn.
We say ^ converges in distribution to a distribution , or to a random
variable with distribution , if

lim ^ = () for every on the support .


!1

Thus, while ^ may itself not be distributed , asymptotically it is. We write

^ !
or ^ ! where .

The notation ^ ! is a bit awkward, because characterizes innitely many


random variables. We are therefore saying there is some random draw from
that ^ is becoming. Which random draw is not specied.

6.1

Central Limit Theorem

Convergence of some
By far the most famous result concerns the sample mean .
^
estimator in a monumentally large number of cases reduces to convergence of a
sample mean of something, call it . This carries over to the sample correlation,
regression model estimation methods like Ordinary Least Squares, GMM, and Maximum Likelihood, as well as non-parametric estimation, and on and on.
10

As usual, we limit ourselves to the iid case. The following substantially carries
over to non-iid data, and based on a rarely cited obscure fact does not even require
a nite variance (I challenge you to nd a proof of this, or to ever discover any
econometrics textbook that accurately states this).
Claim (Central Limit Theorem = CLT):
:=

If ( 2 ) then

!

(0 1)

Remark 1:
This is famously cites as the LindebergLvy CLT. Historically,
however, the proof arose in dierent camps sometime between 1910-1930 (covering
Lindeberg, Lvy, Chebyshev, Markov and Lyapunov).

p
Remark 2:
Notice by construction :=
is a standardized
= by identical distributedness and []
= 2 by
sample mean because []
independence and identical distributedness. Thus



[]

p
:=
=
=


[]
Therefore

p
has mean 0 and variance 1:
"

# p

=
=0

"
#


p

2

= 2
= 2
= 1

Thus, even as ! 1 the random variable (0 1). Although this is a long way
from proving has a denable distribution, even in the limit, it does help to point
p
for otherwise we simply have
out that the term ! 1 is necessary to stabilize ,

! 0.

p
Remark 3:
Asymptotically :=
has a standard normal density (2)1 expf 2 2g.
Proof:

Dene := ( ), hence

p
1 X

:= = p

=1

We will show the characteristic function [ ] ! 2 . The latter is the characteristic function of a standard normal, while characteristic functions and distributions
have a unique correspondence: only standard normals have a characteristic function
2
like 2 .

11

12

By independence and identical distributednessNow expand


0 by a second order Taylor expansion:
"
#
h
i
h
i

Y
12

12

=1

=
=

around =

(2)

=1

Y
=1

12

h
i h
i
12
12

= 1 + 12

1 2
+ 2 2 +
1!
2!

= 1 + 12

1 2
2 +
1!
2!

where is a remainder term that is a function of 12 . Now take the expectations


as in (2), and note [ ] = [( )] = 0 and [2 ] = [( )2 ]2 = 2 2 =
1:
h
i
12

1 2


= 1 + 12 [ ] 2
+ [ ]
1!
2!
1 2
= 1
+
2
where := [ ]
12
12
is a bounded random variable, in particular j
j
It is easy to prove
1 wp1 (see Bierens) so even if does not have higher moments we know j j 1.
12
] ! 1.
Further ! 0 because [
Now take the -power in (2): by the Binomial expansion


h
i
1 2

1 2
12

=
1
+
=
1

2
=0

2
X
1
1 2
=
1
+
1

2

2
=1

The rst term satises

2
1 2
1
! 2
2

because the sequence f(1 + ) g1 converges: (1 ) ! (simply put


= 2 2). For the second term notice for large enough we have j1 1 2 2j
1 hence

X


X
1 2
X

1


= (1 + )

=1

=1

12

=0

See Bierens for details that verify (1 + ) ! 0. QED.

Example (Bernoulli):
The most striking way to demonstrate the CLT is to
begin with the least normal of data, a Bernoulli random variable which is discrete

and takes only two nite values, and show


! (0 1), a continuous
random variable with innite support.

We simulate (2) for = 5, 50 500, 10000 and compute


2
2
p
p
p
:=
= p
=

4
2 8
In order to show the small sample distribution of we need a sample of 0 ,
so we repeat the simulation 1000 times. We plot the relative frequencies of the
sample of 0 for each . Let f g1000
the simulated sample of 0 . The
=1 beP
relative frequencies are the percentage 11000 1000
=1 ( +1 ) for interval
endpoints = [5 49 48 49 50]. See Figure 7. For the sake of comparison
in Figure 8 we plot the relative frequencies for one sample of 1000 iid standard normal

random variables (0 1).


Another way to see how becomes a standard normal random variable is to
compute the quantile such that ( ) = 975. A standard normal satises
( 196) = 975. We call an empirical quantile since it is based on a simulated
set of samples. We simulate 10,000 samples for each size = 5 105 205 ..., 5005
and compute . See Figure 9. As increases ! 196.
Figure 7
Standardized Means for Bernoulli

1000 0 , = 5

1000 0 , = 50

1000 0 , = 500

1000 0 , = 5000
13

Figure 8
Standard Normal

Standard Normal
Figure 9 - Empirical Quantiles q
2.3
2.2
2.1
2.0
1.9
1.8
1.7
5

505 1005 1505 2005 2505 3005 3505 4005 4505 5005
Sample Size n

14

You might also like