Panel Data Analysis

Panel Data Analysis
Using STATA
Lecture by Josef Brderl
November 14, 2008
November 14, 2008 Josef Brderl Page 1
Panel Data
Repeated measures of one or more
variables on one or more persons
- Repeated cross-sectional time-series
Mostly from panel surveys
- Also from cross-sectional surveys by
retrospective questions
Advantages of panel data:
- Panel data allow for higher precision
- They allow to study individual dynamics
- Information on the time-ordering of events
- Controlling for unobserved heterogeneity
Panel Data and Causal Inference I
Counterfactual approach to causality (Rubins model)
With cross-sectional data (between estimation)
Assumption of unit homogeneity (no unobserved heterogeneity)
Assumption of conditional independence (no reverse causality)
With panel data I (within estimation)
Problem: period effects, maturation
With panel data II (difference-in-differences estimator, DID)
C
t i
T
t i
Y Y
0 0
, ,

C
t j
T
t i
Y Y
0 0
, ,

C
t i
T
t i
Y Y
0 1
, ,

) ( ) (
0 1 0 1
, , , ,
C
t j
C
t j
C
t i
T
t i
Y Y Y Y
Panel Data and Causal Inference II
With panel data we can tackle one of the two major
problems of Social Research
No simple solution
(e.g. no time-varying
unobserved heterog.)
Controlled treatment
Reverse Causality
(treatment depends on Y)
Within estimation
(before-after
comparison)
Randomization
Self-selection
(leading to unobserved
heterogeneity)
Solution with
panel design
Solution with
experimental design
The two major problems
in Social Research
Example: Marriage-Premium for Men?
Fabricated data: long-format
. list id time wage marr, separator(6)
------------------------- -------------------------
| id time wage marr | | id time wage marr |
|-------------------------| |-------------------------|
1. | 1 1 1000 0 | 13. | 3 1 2900 0 |
2. | 1 2 1050 0 | 14. | 3 2 3000 0 |
3. | 1 3 950 0 | 15. | 3 3 3100 0 |
4. | 1 4 1000 0 | 16. | 3 4 3500 1 |
5. | 1 5 1100 0 | 17. | 3 5 3450 1 |
6. | 1 6 900 0 | 18. | 3 6 3550 1 |
|-------------------------| |-------------------------|
7. | 2 1 2000 0 | 19. | 4 1 3950 0 |
8. | 2 2 1950 0 | 20. | 4 2 4050 0 |
9. | 2 3 2050 0 | 21. | 4 3 4000 0 |
10. | 2 4 2000 0 | 22. | 4 4 4500 1 |
11. | 2 5 1950 0 | 23. | 4 5 4600 1 |
12. | 2 6 2050 0 | 24. | 4 6 4400 1 |
|-------------------------| |-------------------------|
Example: Marriage-Premium for Men?
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
E
U
R
O

p
e
r

m
o
n
t
h
1 2 3 4 5 6
Time
before marriage after marriage
There is a causal effect:
a marriage-premium
And there is selectivity:
Only high wage men
marry
Example: Computing the Marriage-Premium
These data are like experimental data
Treatment and control group
Before-after comparison
Compute the DID-estimator
The marriage-premium is 500
Within-person comparison (the before-after difference)
To rule out the possibility of maturation or period effects we
compare the within-difference of married (treatment) and
unmarried (control) men
500
2
) 1000 1000 ( ) 2000 2000 (
2
) 3000 3500 ( ) 4000 4500 (
=
+
+
The Fundamental Problem of
Non-Experimental Research
Result of a cross-sectional regression at T=4:
Between-comparison: compare wages of married and unmarried men
A cross-sectional regression is highly misleading!
The bias is due to unobserved heterogeneity
High-wage men self-select into marriage
Technically: endogeneity (x
i4
and u
i4
are correlated)
Self-selection is the fundamental problem of
non-experimental research
Most cross-sectional regression results are therefore highly disputable!
4 4 1 0 4 i i i
u x y + + = | |
2500
2
1000 2000
2
3500 4500
1
=
+
+
= |
No Solution: Pooled-OLS
Pool the data and estimate an OLS regression (POLS)
The result is
This is the mean of the red points minus the mean of the green
points
The bias is still heavy
POLS also relies on a between comparison. It is thus biased due to
unobserved heterogeneity: x
it
and u
it
are correlated
Panel data per se do not remedy the problem of
unobserved heterogeneity!
One has to use appropriate methods of analysis
it it it
u x y + + =
1 0
| |
1833
1
= |
A Solution: Panel Data and Within-Estimation
One has to construct a regression model that relies on the
before-after comparison (like DID)
Starting point: error-components model
person-specific error
i
, idiosyncratic error
it
Error-components model

i
represents person-specific time-constant unobserved
heterogeneity (fixed-effects)
(in our example
i
could be unobserved ability)
Pooled-OLS has to assume that x
it
is uncorrelated with both
error-components
it i it
u c v + =
it i it it
x y c v | + + =
1
Fixed-Effects Regression
How can we get rid of the fixed-effects?
Simply include them as dummies in the regression
Least-squares-dummy-variables-estimator (LSDV)
Only within variation is left
More elegant: within transformation
Time-demeaning the data and using pooled-OLS
Only within variation is left
Fixed-effects have been cancelled out
Time-constant unobserved heterogeneity is no longer a problem
(1)
1 it i it it
x y c v | + + =
(3) ) (
1 i it i it i it
x x y y c c | + =
Average over t for each i
Substract (2) from (1)
(2)
1 i i i i
x y c v | + + =
Example: Fixed-Effects Regression
. xtreg wage marr, fe
Fixed-effects (within) regression Number of obs = 24
Group variable: id Number of groups = 4
R-sq: within = 0.8982 Obs per group: min = 6
between = 0.8351 avg = 6.0
overall = 0.4065 max = 6
F(1,19) = 167.65
corr(u_i, Xb) = 0.5164 Prob > F = 0.0000
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
marr | 500 38.61642 12.95 0.000 419.1749 580.8251
_cons | 2500 16.7214 149.51 0.000 2465.002 2534.998
-------------+----------------------------------------------------------------
sigma_u | 1290.9944
sigma_e | 66.885605
rho | .99732298 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3, 19) = 1639.22 Prob > F = 0.0000
-
4
0
0
-
3
0
0
-
2
0
0
-
1
0
0
0
1
0
0
2
0
0
3
0
0
4
0
0
d
e
m
e
a
n
e
d
(
w
a
g
e
)
-.5 0 .5
demeaned(marr)
Mechanics of a FE-Regression
Those, never marrying are at X=0. They contribute nothing to the regression.
The slope is determined by the wages of those marrying only:
It is the difference in the mean wage before and after marriage.
Equivalent FE-estimator I: LSDV
Practical only when N is small
We get estimates for the
i
. regress wage marr pers1-pers4, noconstant
Source | SS df MS Number of obs = 24
-------------+------------------------------ F( 5, 19) = 9052.94
Model | 202500000 5 40500000 Prob > F = 0.0000
Residual | 85000 19 4473.68421 R-squared = 0.9996
-------------+------------------------------ Adj R-squared = 0.9995
Total | 202585000 24 8441041.67 Root MSE = 66.886
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
marr | 500 38.61642 12.95 0.000 419.1749 580.8251
pers1 | 1000 27.30593 36.62 0.000 942.848 1057.152
pers2 | 2000 27.30593 73.24 0.000 1942.848 2057.152
pers3 | 3000 33.4428 89.71 0.000 2930.003 3069.997
pers4 | 4000 33.4428 119.61 0.000 3930.003 4069.997
------------------------------------------------------------------------------
Equivalent FE-estimator II: Individual Slope Regr.
Estimate a regression for every man marrying
The FE-estimator is the mean of the individual slopes
Multi-level literature: random-coefficient model
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
E
U
R
O

p
e
r

m
o
n
t
h
0 1
marriage
Problem: No Control Group
There is no causal
effect!
But there is selectivity.
And there is a period
effect: general wage
increase at t=4.
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
E
U
R
O

p
e
r

m
o
n
t
h
1 2 3 4 5 6
Time
before marriage after marriage
Problem: No Control Group
FE-regression (within estimation) yields the wrong answer
Reason is that it does not use the control group information
. xtreg wage3 marr, fe
Fixed-effects (within) regression Number of obs = 24
between = 0.8000 avg = 6.0
F(1,19) = 17.07
corr(u_i, Xb) = 0.5013 Prob > F = 0.0006
------------------------------------------------------------------------------
wage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
marr | 500 121.0336 4.13 0.001 246.6738 753.3262
_cons | 2625 52.40907 50.09 0.000 2515.307 2734.693
-------------+----------------------------------------------------------------
Solution: Two-way FE-Regression
Including time fixed-effects (period dummies)
Now also the control-group information is used
. tab time, gen(t)
. xtreg wage3 marr t2-t6, fe
------------------------------------------------------------------------------
wage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
marr | -4.59e-13 58.24824 -0.00 1.000 -124.93 124.93
t2 | 50 50.44445 0.99 0.338 -58.19259 158.1926
t3 | 62.5 50.44445 1.24 0.236 -45.69259 170.6926
t4 | 537.5 58.24824 9.23 0.000 412.57 662.43
t5 | 562.5 58.24824 9.66 0.000 437.57 687.43
t6 | 512.5 58.24824 8.80 0.000 387.57 637.43
_cons | 2462.5 35.66961 69.04 0.000 2385.996 2539.004
-------------+----------------------------------------------------------------
One should always include period-dummies in a FE-regression!
Summary of FE-Estimation
Panel data and within estimation (DID, FE-regression) can
remedy the problem of unobserved heterogeneity
However, with FE-regressions we cannot estimate the
effects of time-constant covariates. These are all cancelled
out by the within transformation.
This reflects the fact that panel data do not help to identify
the causal effect of a time-constant covariate!
The "within logic" applies only with time-varying covariates
Something has to happen (the effects of events)
Only then a before-after comparison is possible
Random-Effects Regression
We start with the error-components model (incl. constant)

i
are random variables (i.i.d. random-effects)
Cov(x
it
,
i
) = 0
Pooled GLS estimator (RE-estimator)
Equivalent to pooled OLS after following transformation:
Where
= 1: FE-estimator, = 0: POLS
RE-estimator mixes within and between estimators
it i it it
x y c v | | + + + =
1 0
)} ( ) 1 ( { ) ( ) 1 ( ) (
1 0 i it i i it i it
x x y y c u c u v u | u | u + + + =
2 2
2
1
c v
c
o o
o
u
+
=
T
Example: Random-Effects Regression
. xtreg wage marr, re theta
Random-effects GLS regression Number of obs = 24
between = 0.8351 avg = 6.0
Random effects u_i ~ Gaussian Wald chi2(1) = 128.82
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
theta = .96138358
------------------------------------------------------------------------------
wage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
marr | 502.9802 44.31525 11.35 0.000 416.1239 589.8365
_cons | 2499.255 406.0315 6.16 0.000 1703.448 3295.062
-------------+----------------------------------------------------------------
sigma_u | 706.57936
sigma_e | 66.885605
rho | .99111885 (fraction of variance due to u_i)
------------------------------------------------------------------------------
FE- or RE-Model?
RE allows for estimating effects of time-constant covariates
Many researchers use RE instead of FE
RE is more efficient, if Cov(x
it
,
i
) = 0
However, mostly Cov(x
it
,
i
) 0 due to self-selection
RE-estimation will provide biased estimates
FE-estimation will provide unbiased estimates
Using the RE-estimator to report effects of sex, race, etc. is
risking to throw away the big advantage of panel data only
to be able to write a paper on "The determinants of Y".
To take full advantage of panel data the style of data
analysis has to change
Not big RE-regression models with many variables
But small FE-models using only a few time-varying covariates
Problems with the FE-estimator
FE-estimator has to assume (strict exogeneity assumption)
If this assumption is violated (endogeneity) FE-estimates are biased
Sources of endogeneity
Time-varying unobserved heterogeneity (captured by
it
)
Y shocks trigger the change in X (reverse causality)
Errors in reporting X (measurement errors)
Supposed remedy
IV-estimation (xtivreg)
Structural equation modeling (LISREL)
These methods rest on untestable assumptions. Therefore, it is
unclear whether they improve estimates.
These methods have produced a big mess in social research.
s t x Cov
is it
and all for 0 ) , ( = c
Further Remarks on Panel Regression
Augmenting the FE-model
One can include interaction effects of time-constant and time-varying
covariates (e.g. period dummies)
There is a hybrid model including RE-estimates of time-constant
variables in a FE-model
Unbalanced panels are no problem
Serially correlated errors
it
(xtregar)
Attrition correlated with
i
does not bias FE-estimates!
So-called parametric models for unobserved heterogeneity
are RE-modeling
These models only work if Cov(x
it
,
i
) = 0
Dynamic panel models:
Estimates are usually biased, one has to use IV-methods
Do not use these models!
it i it t i it
x y y c v | o + + + =
1 1 ,
Fixed-Effects Logit
Logistic regression model with fixed-effects
No incidental parameter problem, because the
i
can be conditioned
out (conditional likelihood)
Advantage of the FE-methodology: Estimates of
1
are unbiased
even in the presence of unobserved heterogeneity (if it is time-
constant).
Persons who have only 0s or 1s on the dependent variable are
dropped. For FE-logit you need data with sufficient variance both on
X and Y, i.e. generally you will need panel data with many waves!
) exp( 1
) exp(
) 1 (
1
1
i it
i it
it
x
x
y P
v |
v |
+ +
+
= =
Example: FE-Logit
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
5
0
0
0
E
U
R
O

p
e
r

m
o
n
t
h
1 2 3 4 5 6
Time
no further education further education
Y: further education
X: wage
Does a wage raise
increase P(further educ)?
The data show
a) a negative effect of a
wage raise
b) self-selection of high
wage males into further
educ
Example: FE-Logit
. * Pooled-Logit
. logit feduc wage
------------------------------------------------------------------------------
feduc | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage | .0009478 .0004256 2.23 0.026 .0001137 .001782
_cons | -2.93141 1.299228 -2.26 0.024 -5.47785 -.3849703
------------------------------------------------------------------------------
. * FE-Logit
. xtlogit feduc wage, fe
note: multiple positive outcomes within groups encountered.
note: 1 group (6 obs) dropped because of all positive or all negative outcomes.
Conditional fixed-effects logistic regression Number of obs = 18
------------------------------------------------------------------------------
feduc | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage | -.0021877 .0023218 -0.94 0.346 -.0067383 .0023629
------------------------------------------------------------------------------
Event History Analysis with Repeated Events
Multiple episodes in the dataset
Problem: dependent episodes, biased S.E.
Potential: within estimation
Analysis options
Analyzing time to first event only
Discards much information
Analyzing episodes separately
Does not make use of the within information
Pooled estimation
Biased S.E. and sub-optimal use of the within information
Random-effects models
Sub-optimal use of the within information
Fixed-effects models
Uses the within information (biased S.E., but vce() option)
Fixed-Effects Cox Regression
Proportional hazards frailty model
i person index, j episode index

i
person-specific error term

i
and x
ij
(t) are correlated
Pooled estimates and RE-estimates will be biased
FE-estimator via adding person dummies
Does not work (incidental parameter problem)
FE-estimator: absorb
i
in the base rate
Stratified cox regression: r
0i
(t) is not estimated (partial likelihood), thus
there is no incidental parameter problem
) ) ( ' exp( ) ( ) (
0 i ij ij
t x t r t r v | + =
)) ( ' exp( ) ( ) (
0
t x t r t r
ij i ij
| =
Further Readings
Lecture Notes by Josef Brderl on Panel and EH Analysis
http://www.sowi.uni-mannheim.de/lessm/lehre.html
Textbooks
Wooldridge, J. (2003) Introductory Econometrics. Thomson.
Cameron, A.C. and P.K. Trivedi (2005) Microeconometrics.
Panel Data Analysis
Allison, P.D. (2005) Fixed Effects Regression Methods for
Longitudinal Data Using SAS. SAS Press.
Allison, P.D. (1994) Using Panel Data to Estimate the Effects of
Events. Sociological Methods & Research 23: 174-199.
Halaby, C. (2004) Panel Models in Sociological Research. Annual
Rev. of Sociology 30: 507-544.
EHA with repeated events
Allison, P.D. (1996) Fixed-Effects Partial Likelihood for Repeated
Events. Sociological Methods & Research 25: 207-222.

Panel Data Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Panel Data Analysis

Uploaded by

Copyright:

Available Formats

Panel Data Analysis

You might also like