You are on page 1of 10

Panel Data, Stratified Samples, and more Efficient

Estimators

Iraj Rahmani
Department of Economics
Michigan State University

Abstract

The goal in this paper is to develop methods for estimating more efficient panel data
models when data set comes from stratified sampling. Two cases are considered. The
first is when the stratification is exogenous and the second is when the data set comes
from endogenous stratification. The simulation results show that we achieve more
efficient estimator by considering correlation across time as it is expected.
1. introduction

Stratified sampling is commonly used in economics and other branches of the social
sciences. One of the most common forms of stratified sampling is known as standard
stratified sampling (SS sampling). In SS sampling first the population divided into S
subpopulations or strata that are mutually exclusive, and exhaustive. And then a simple
random sample of N s is taken from each stratum. The standard stratification sampling is
convenient when members from each stratum are easily recognizable before sampling.
Stratification can be exogenous or endogenous. By exogenous stratification we mean that
dividing the population to some strata is based on some exogenous variables while in
endogenous stratification this is the right hand side variable that determines strata.
Depending on the type of stratification, the estimation methods can be different. What
happens if we ignore stratification? To answer this question we need to know how strata
have been defined in first place. There is no real problem if stratification is based on
exogenous variable or variables, at least in specific cases. It means that by ignoring
exogenous stratification, estimators are still consistent and normally distributed and usual
variance matrix estimators are consistent. Estimators’ properties when the data set is
cross section and comes from exogenous stratified sampling have been already studied
extensively. For instance DuMouchel and Duncan (1983), Manski and McFadden (1981)
studied linear and maximum likelihood cases respectively.
The issue can be problematic when a data set based on endogenous stratification is at
hand. In this case standard, un-weighted estimators are generally inconsistent. Hausman
and Wise (1981) consider this case and propose weighted least squares estimators in
linear case. Jewell (1985) suggests new weighting intended to improve efficiency.
Wooldrige (2001) extend the previous work further by considering an M-estimator that
encompasses linear and nonlinear cases and its asymptotic properties for standard
stratification samples.

2. The Model

For simplicity we start with a linear model:

y i = xiβ + ui

Here x i is a T×Ҡ collection of exogenous variables for observation i. u i is a T×1 vector


of error terms.

The next assumption is strict exogeneity i.e.

E (u i | X ) = 0

Also we assume that the error covariance matrix is independent of explanatory variables
and strata. We can show this assumption as

var( u i | X , S i ) = var( u i ) = Ω

This variance-covariance matrix can be assumed to have a specific form. First let assume
that the error term follows AR (1).

 1 ρ ρ 2  ρ T −1 
 
 ρ 1 ρ  ρ T −2 
Ω= 
      
 ρ T −1 ρ T −2 ρ T −3  1 
 −1
 N 
OLS: va r̂ βˆ
ols = σˆ 2  ∑x′i x i 
 i =1 
S
Q Ns
 S Q Ns

βˆwols = (∑ si ∑ x′si x si )  ∑ si
−1
∑ x′si y si 
s =1 H si i =1  s =1 H si i =1 

−1
 S Q Ns
 N
Qsi
βˆ wGLS =  ∑ si ∑ x′si Ω x si 
−1
∑H x′si Ω −1y si
 s =1 H si i =1  i =1 si

−1
1 N
 1 N
β̂uwGLS =
N

i =1
x′i Ω x i 
−1

 N
∑x′Ω
i =1
i
−1
yi

−1  Qs2 
S
A var β wOLS = [ E ( x′x ) ]  ∑
ˆ E [ x′i Ωx i | x i ∈ s ] [ E ( x′x ) ]
−1

 s =1 H s 

[ ( )] ∑Q [ 
][ ( )]
2
−1 −1
A var βˆ wGLS = E x ′Ω−1 x s
E xi′Ω−1 xi | xi ∈s  ′ −1
E xΩ x
H  s 

The estimate of asymptotic variances is:

−1 −1

∑( )
11 N
 1 N
Qs2 S
 1 N

A v̂ar βˆ wols = 
NN
∑ x′i x i   ∑ 2
x ′si Ωx si − x ′s Ωx s  ∑ x′i x i  or
i =1  N i =1 H s s =1  N i =1 
−1 −1

∑( )
11 N
Qsi  1 S
Qs2 Ns
 1 N
Qsi 
A v̂ar βˆ wols = 
NN

i =1 H si
x ′i x i   ∑ 2
x′si Ωx si − x′s Ωx s  ∑ x ′i x i 
 N s =1 H s i =1  N i =1 H si 
−1 −1
11 Qsi  1 Qs2 N s
( )  1 Qsi 
N S N
A v̂ar βˆ wGLS = 
NN

i =1 H si
x ′i Ω −1 x i   ∑ 2 ∑
x ′si Ωx si − x ′s Ωx s  ∑ x′i Ω −1 x i 
 N s =1 H s i =1  N s =1 H si 
or

−1 −1

∑ ∑( )
11 N
 1 S Ns
 1 N

A v̂ar βˆuwGLS = 
NN
∑ x ′i Ω x i 
−1
 x′si Ωx si − x ′s Ωx s  ∑ x ′i Ω x i −1

i =1  N s =1 i =1  N s =1 
−1 −1
11  1 Qs2 N s
( )  1 
N S N
A v̂ar βˆ wGLS = 
NN
∑ x ′i Ω x i 
−1



N ∑ 2 ∑
s =1 H s i =1
x′si Ωx si − x′s Ωx s 
 N ∑ x′i Ω x i 
−1

i =1   s =1 

Qs N
In this formulas are weights where Qs = P (( x i , y i ) ∈ s ) , and H s = s
Hs N

Which one is smaller?

−1  Q2 
S
A var βˆ wOLS = [ E ( x ′x ) ]  ∑ s E [ xi′Ωxi | xi ∈ s ] [ E ( x ′x ) ]
−1

 s =1 H s 

[ ] ∑ QH E [ x ′Ω 
][ ]
2
A var βˆ wGLS = E ( x ′Ω−1 x )  E ( x Ω x)
−1 −1
s
i
−1
xi | x i ∈ s  ′ −1
 s 

We can rewrite these two equations as:

−1 −1
 S Qs2  S
A var βˆ wOLS
S 
= ∑Qr E ( xi′xi | xi ∈ r )   ∑
H
[
E x ′j Ωx j | x j ∈ s ] 
∑Qt E ( x ′k xk | xk ∈ t ) 
 r =1   s =1 s  t =1 

−1 −1
  S Q2  S
S
( ) [ 
A var βˆ wGLS = ∑Qr E xi′Ω−1 xi | xi ∈ r   ∑ s E x ′j Ω−1 x j | x j ∈ s ∑Qt E x ′k Ω−1 x k | xk ∈ t  ] ( )
 r =1   s =1 H s  t =1 

There are S 3 terms in these sums that each term is product of three where S is the
number of strata. Considering those terms that all belongs to same stratum i.e. r = s = t
then it is clear that asymptotic variance that comes from weighted GLS is smaller than its
corresponding part in weighted OLS.
There are S terms that all there parts come from same stratum, and the rest are
combinations of different strata that make the comparison of the two formulas difficult
and in general the gain from weighted GLS is not guaranteed.

3. simulation results

In order to make the investigation simpler we assume β0 = 0 and β1 =1 . Also we assume


that explanatory variable and error term both have standard normal distribution. The
simulation results are presented in tables 1 to 6. The main findings can be summarized as
follows:

I. the results shows that the gain obtained by taking into account the correlation
across time when rho is small is not considerable but by increasing the
correlation the gain is big.
II. When T dimension of the panel is smaller, the problem of inconsistency is
bigger, and the gain resulted by considering correlation is bigger too.
III. As it is expected in case of exogenous stratification un-weighted GLS type
estimators are more efficient. And verse versa when stratification is
exogenous, weighted GLS estimators are more efficient.

Table 1: Simulation Results for Exogenous Stratification: rho=.1


T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
β̂1 1.0013
(.0369
1.0017
(.0455
1.0016
(.0452
1.0013
(0.367)
.9974
(.0447)
.9989
(.0361)
.9996
(.0248
.9992
(.0294
.9990
(.0291
.9995
(.0245)
.9990
(.0283)
.9997
(.0243)
) ) ) ) ) )
s β̂ .0356 .0279 .0336 .0353 .0336 .0354 .0244 .0222 .0254 .0242 .0254 .0242
1

β̂0 .0014
(.0442
.0025
(.0488
.0025
(.0488
.0015
(.0443)
.0006
(.0495)
.0012
(.0427)
-.0002
(.0301
-.0002
(.0343
-.0003
(.0343
-.0003
(.0301)
-.0020
(.0328)
-.0013
(.0295)
) ) ) ) ) )
s β̂ .0422 .0457 .0523 .0443 .0523 .0442 .0263 .0298 .0329 .0285 .0329 .0285
0

Number 500 500 500 500 500 500 500 500 500 500 500 500
of
repetitions
ρ
ˆ =0.1014 ρ
ˆ =0.1016
Table2: Simulation Results for Exogenous Stratification: rho=.5

T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
β̂1 1.0015
(.0359
1.0013
(.0448
1.0011
(.0387
1.0011
(.0309)
1.0042
(.0357)
1.0035
(.0294)
1.0009
(.0243
1.0003
(.0292
1.0011
(.0240
1.0010
(.0197)
1.0008
(.0245)
1.0002
(.0196)
) ) ) ) ) )
s β̂ .0355 .0278 .0284 .0306 .0285 .0306 .0243 .0212 .0210 .0198 .0210 .0199
1

β̂0 .0015
(.0510
.0025
(.0569
.0025
(.0569
.0016
(.0512)
.0032
(.0607)
.0008
(.0537)
-.0017
(.0402
-.0010
(.0448
-.0011
(.0437
-.0017
(.0391)
.0002
(.0424)
.0006
(.0373)
) ) ) ) ) )
s β̂ .0422 .0532 .0599 .0513 .0598 .0513 .0263 .0391 .0444 .0385 .0444 .0385
0

Number 500 500 500 500 500 500 540 540 540 540 500 500
of
repetitions
ρ
ˆ =.5019 ρ
ˆ =.4975

Table 3: Simulation Results for Exogenous Stratification: rho=.9

T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
β̂1 1.0022
(.0372
1.0014
(.0452
1.0003
(.0208
1.0002
(.0165)
.9990
(.0197)
.9990
(.0149)
) ) )
s β̂ .0355 .0251 .0139 .0152 .0139 .0152
1

β̂0 -.0010
(.0574
-.0040
(.0646
-.0041
(.0646
-.0006
(.0165)
.0038
(.0678)
.0033
(.0581)
) ) )
s β̂ .0422 .0579 .0659 .0152 .0660 .0573
0

Number 500 500 500 500 500 500


of
repetitions
ρ
ˆ =.9003 ρ̂ =
Table 4: Endogenous Stratification with ρ =.1

T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
β̂1 1.0679
(0.0374
.9995
(0.0409
.9995
(.0406)
1.0684
(.0371)
1.0028
(.0408)
1.0718
(.0368)
1.0321
(.0241)
1.0003
(.0265)
1.0001
(.0264)
1.0314
(.0240)
.9808
(.0259)
1.0124
(.0241)
) )
s β̂ 0.0412 0.0355 .0407 .0397 .0407 .0396 .0260 .0243 .0267 .0254 .0262 .0250
1

β̂0 0.1229
(0.0399
-.0013
(.0418)
-.0013
(.0418)
.1229
(.0399)
-.0016
(.0414)
.1227
(.0398)
.0522
(.0260)
.0010
(.0280)
.0010
(.0279)
.0549
(.0258)
.0009
(.0286)
.0541
(.0273)
)
s β̂ 0.0429 .0445 .0476 .0435 .0475 .0435 .0265 .0286 .0304 .0282 .0299 .0277
0

Number 1000 1000 1000 1000 500 500 500 500 500 500 499 499
of
repetition
ρ
ˆ =.1004 ρ
ˆ = 0.0963

Table 5: Endogenous Stratification with ρ = .5

T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
1.0649 1.0021 1.0018 1.0302 1.0002 .9995
1.0538 1.0021 1.0542 1.0197 .9992 1.0197
β̂1 (.0412 (.0433 (.0376
(.0354) (.0371) (.0336)
(.0236 (.0255 (.0215
(.0197) (.0232) (.0206)
) ) ) ) ) )
s β̂1
.0413 .0337 .0350 .0344 .0351 .0345 .0261 .0238 .0218 .0207 .0218 .0208
-.0013 -.0013 .0920 .0012 .0009
.1707 .1721 -.0015 .1727 .1016 .0022 .1040
β̂0 .0449
(.0467 (.0466
(.0445) (.0481) (.0470)
(.0347 (.0374 (.0363
(.0334) (.0394) (.0366)
) ) ) ) )
s β̂0
.0431 .0523 .0549 .0507 .0550 .0507 .0265 .0394 .0411 .0382 .0411 .0382
Number
of 500 500 500 500 540 540 500 500 500 500 500 500
repetition
ρ
ˆ = 0.5026 ρ
ˆ = 0.4988

Table 6: Endogenous Stratification with ρ = .9


T=2 T=5
OLS wOLS wGLS uwGLS wFGLS uwFGLS OLS wOLS wGLS uwGLS wFGLS uwFGLS
1.0571 1.0000 .9999 1.0259 1.0000 1.0006
1.0130 .9998 1.0133 1.0045 .9996 1.0035
β̂1 (.0403 (.0431 (.0187
(.0170) (.0190) (0.0180)
(.0250 (.0254 (.0095
(.0088) (.0105) (.0094)
) ) ) ) ) )
s β̂ 1
.0420 .0314 .0174 .0173 .0174 .0173 .0266 .0220 .0097 .0092 .0097 .0092
.2209 -.0010 -.0012 .1959 -.0008 -.0009
.2263 -.0006 .2266 .1983 .0036 .2022
β̂0 (.0450 (.0466 (.0466
(.0439) (.0496) (.0460)
(.0455 (.0485 (.0475
(.0446) (.0501) (.0460)
) ) ) ) ) )
s β̂ 0
.0437 .0558 .0612 .0568 .0612 .0569 .0271 .0546 .0569 .0530 .0569 .0530
Number
of 499 499 499 499 500 500 500 500 500 500 500 500
repitition
ρ
ˆ = 0.9021 ρ
ˆ = 0.8994
4. conclusion

Simulation results show that under some strong assumption like strict exogeneity we can
improve efficiency of the estimators in panel data models when data set comes from a
stratified sample by considering the correlation across time. Like in cross section models,
un-weighted estimators when stratification is exogenous are consistent and more efficient
but in endogenous stratification we need to assign proper weights to the observations in
order to get consistent estimators.
The next step is to relax some assumptions and more general models-not just linear case-
and move toward GLM type methods.

You might also like