You are on page 1of 13

Introduction to Stata Lecture 3: Panel Data

Hayley Fisher
1 March 2010
Key reference: Cameron and Trivedi (2009), chapter 8.

Data used in this lecture

This lecture uses data from the first four waves of the British Household Panel Survey (BHPS). I will make
the relevant source files temporarily available on my website but cannot host them there permanently.
You can get the full set of files from the ESDS (http://www.esds.ac.uk/findingData/bhps.asp). If you
want to learn more about the BHPS and how to use it in Stata, I recommend the BHPS introductory
courses provided by the UK Longitudinal Studies Centre (ULSC) at the University of Essex details
and course materials are available at http://www.iser.essex.ac.uk/survey/bhps/courses. These notes
have been loosely based on parts of their course.

The British Household Panel Survey

The BHPS began in 1991 and has interviewed its initial sample, and additional household members,
every year since then. 5,500 households were selected initially, with additional samples of Scotland,
Wales and Northern Ireland added since. Currently 17 waves of the survey are available. For a full
description of the survey see Taylor, Brice, Buck and Prentice-Lane (2009).
Longitudinal datasets such as the BHPS are rarely provided in a format that is straightforward to
read into Stata and start working with. The BHPS is available in a number of formats including Stata,
but as a series of files containing different variables, split by year and different parts of the survey to give
manageable file sizes. I am using the individual response and household response files from the first
four waves of the survey. A substantial part of this lecture will be devoted to putting together a panel
from these datasets.

2.1

Assembling a cross section using individual and household data

Start your do-file to assemble your dataset by defining a macro for the folder in which the original
datafiles are stored. This makes it easy to alter the folder referenced if necessary in future. Here I use a
global macro:
global dir BHPS
To recall a global macro we prefix it with a $ sign so here $dir.
We could simply read in the entire first file in question (aindresp), but this is a large dataset
with many variables. Instead, we can load in just specific variables. We need to look at the codebook accompanying the dataset to choose the variables (alternatively look at the online codebooks at
http://www.iser.essex.ac.uk/survey/bhps/documentation/volume-b-codebooks). We read in the specific
variables by typing:
use ahid apno asex aage pid amastat ahgspn aqfachi afiyr afiyrl using $dir/aindresp
Note that all variables except pid have the prefix a. This is a convention in the BHPS data files all
files and variables associated with wave 1 have the prefix a, for wave 2 it is b and so on. Lets describe
the data to see what has been loaded here.
1

. describe
Contains data from BHPS/aindresp.dta
obs:
10,264
vars:
10
size:
348,976 (99.9% of memory free)
---------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
---------------------------------------------------------------------------ahid
long
%12.0g
household identification number
apno
byte
%8.0g
person number
asex
byte
%8.0g
asex
sex
pid
long
%12.0g
cross-wave person identifier
amastat
byte
%8.0g
amastat
marital status
ahgspn
byte
%8.0g
ahgspn
pno of spouse/partner
aage
byte
%8.0g
aage
age at date of interview
aqfachi
byte
%8.0g
aqfachi
highest academic qualification
afiyrl
double %10.0g
afiyrl
annual labour income (1.9.90-1.9.91)
afiyr
double %10.0g
afiyr
annual income (1.9.90-1.9.91)
---------------------------------------------------------------------------Sorted by:
Three variables here are vital for the construction of our panel dataset. ahid is a household identification
number which we will use to match data from the household file, and apno is a person identification
number within a given household. This can be used in combination with, for example, ahgspan to
match couples together. pid is a cross-wave person identifier it has no a prefix since it matches the
same variable in all waves this connects people over time. We also have data on individuals sex, age,
academic qualifications, labour income and total income.
We are going to merge in data from the household file, so we need to sort the individual data by the
household identification number and save it.
. sort ahid
. save aind, replace
Then we load data from the household response file hhresp.dta.
use ahid atenure ahhsize ankids afihhyr using $dir/ahhresp
. describe
Contains data from BHPS/ahhresp.dta
obs:
5,511
vars:
5
size:
104,709 (99.9% of memory free)
-------------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
-------------------------------------------------------------------------------------ahid
long
%12.0g
household identification number
ahhsize
byte
%8.0g
ahhsize
number of persons in household
ankids
byte
%8.0g
ankids
number of children in household
atenure
byte
%8.0g
atenure
housing tenure
afihhyr
double %10.0g
afihhyr
annual household income (1.9.90-1.9.91)
-------------------------------------------------------------------------------------Sorted by:
This gives household ID numbers and data on household size, number of children, how housing is owned
and total household income. We see there are 5,511 observations (as opposed to 10,264 from the individual
dataset). After sorting by household ID we can merge the two datasets together:
2

. merge ahid using aind


variable ahid does not uniquely identify observations in aind.dta
. tabulate _merge
_merge |
Freq.
Percent
Cum.
------------+----------------------------------1 |
6
0.06
0.06
3 |
10,264
99.94
100.00
------------+----------------------------------Total |
10,270
100.00
. keep if _merge==3
(6 observations deleted)
. drop _merge
We see that there are 6 observations for which there is no individual data, just household data. I drop
these.
Another useful command for describing data is codebook. This produces a codebook based on what
is in Stata. I have extracted the codebook from my log file and posted it on my website for reference.
Having saved the dataset, we can do some analysis with this cross section. First, recode the missing
values. Page A3-14 of Taylor et al. (2009) outlines the way missing values are handled in these datafiles
any negative values are in fact missing. We can recode these easily together using mvdecode:
. mvdecode _all, mv(-9/-1)
atenure: 17 missing values generated
ahgspn: 1 missing value generated
aqfachi: 371 missing values generated
afiyrl: 352 missing values generated
afiyr: 352 missing values generated
Here _all can be replaced with a list of variables. Having created the necessary variables, simple cross
section regressions can be performed. Here I show that the xi prefix can be used to create interaction
terms as well as a series of category dummies.
. generate married=amastat==1
. generate age2=aage^2
. generate lwages=log(afiyrl)
(3937 missing values generated)
. xi: regress lwages aage age2 i.asex*i.married ankids i.aqfachi, vce(robust)
i.asex
_Iasex_1-2
(naturally coded; _Iasex_1 omitted)
i.married
_Imarried_0-1
(naturally coded; _Imarried_0 omitted)
i.asex*i.marr~d
_IaseXmar_#_#
(coded as above)
i.aqfachi
_Iaqfachi_1-7
(naturally coded; _Iaqfachi_1 omitted)
Linear regression

Number of obs
F( 12, 6305)
Prob > F
R-squared
Root MSE

=
=
=
=
=

6318
246.80
0.0000
0.3383
.92103

-----------------------------------------------------------------------------|
Robust
lwages |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------aage |
.1932962
.0072313
26.73
0.000
.1791205
.2074719
age2 | -.0023262
.0000903
-25.75
0.000
-.0025033
-.0021491
_Iasex_2 | -.3133317
.0411562
-7.61
0.000
-.3940119
-.2326515
_Imarried_1 |
.4771106
.0390765
12.21
0.000
.4005075
.5537138
3

_IaseXmar_~1 | -.7551111
.0507519
-14.88
0.000
-.854602
-.6556201
ankids | -.2082833
.0148482
-14.03
0.000
-.2373909
-.1791756
_Iaqfachi_2 | -.1241103
.0967663
-1.28
0.200
-.3138051
.0655845
_Iaqfachi_3 | -.1621818
.0964851
-1.68
0.093
-.3513254
.0269619
_Iaqfachi_4 | -.3929866
.0912033
-4.31
0.000
-.5717762
-.2141971
_Iaqfachi_5 | -.5157779
.0899938
-5.73
0.000
-.6921963
-.3393595
_Iaqfachi_6 | -.5263538
.1001379
-5.26
0.000
-.7226582
-.3300494
_Iaqfachi_7 | -.7906337
.0903131
-8.75
0.000
-.967678
-.6135893
_cons |
5.916767
.1677032
35.28
0.000
5.588011
6.245522
-----------------------------------------------------------------------------Here _Iasex_2 gives the effect of being female, _Imarried_1 the effect of being married for men, and
_IaseXmar_~1 the difference in the effect of being married on wages between women and men.

2.2

Assembling a two wave panel

Panel data can be in two formats long or wide. Wide data stores each variable separately for each
wave, so only has one observation for each individual:
PID
1
2
3
4

inc1
200
600
250
150

inc2
210
660
280
190

inc3
220
700
200
250

inc4
250
750
210
300

Long data stores all observations of a variable, for example income, in the same variable, and has a wave
variable and multiple observations for each individual:
PID
1
1
1
1
2
2
2
2

wave
1
2
3
4
1
2
3
4

inc
200
210
220
250
600
660
700
750

To use the panel data features in Stata you need to have your data in long format. If your data is in
wide format you can reconfigure it using the reshape command.
To put together a two wave panel in long format we need to extract the same data for wave 2 with
prefix b. This is done as above:
. use bhid bpno bsex bage pid bmastat bhgspn bqfachi bfiyr bfiyrl using $dir/bindresp
. sort bhid
. save bind, replace
file bind.dta saved
. use bhid btenure bhhsize bnkids bfihhyr using $dir/bhhresp
. sort bhid
. merge bhid using bind
variable bhid does not uniquely identify observations in bind.dta
. keep if _merge==3
(2 observations deleted)
. drop _merge
4

. save wave2, replace


file wave2.dta saved
There are two other steps to take to both the wave 1 and wave 2 datasets constructed here to add a
wave variable (just using generate, and to remove the prefix which is done using the command renpfix.
. generate wave=2
. renpfix b
To combine the two files we use the append command.1
.
.
.
.

use wave1
generate wave=1
renpfix a
append using wave2

Listing the first 20 observations shows that we now have a long panel dataset with two time periods.
. sort pid wave
. list pid wave hid sex age mastat
pid
wave
hid
1.
10002251
1
1000209
2.
10004491
1
1000381
3.
10004491
2
2000148
4.
10004521
1
1000381
5.
10004521
2
2000148
6.
10007857
1
1000667
7.
10007857
2
2000296
8.
10014578
1
1001221
9.
10014578
2
2000369
10.
10014608
1
1001221
11.
10014608
2
2000369
12.
10016813
1
1001418
13.
10016813
2
2000504
14.
10016848
1
1001418
15.
10016848
2
2000504
16.
10017933
1
1001507
17.
10017933
2
2000717
18.
10017968
1
1001507
19.
10017968
2
2000717
20.
10019057
1
1001604

2.3

in 1/20, clean
sex
age
mastat
female
91
never ma
male
28
never ma
male
29
never ma
male
26
never ma
male
27
never ma
female
57
widowed
female
59
widowed
female
54
married
female
55
married
male
57
married
male
58
married
male
36
married
male
37
married
female
32
married
female
33
married
female
49
married
female
49
married
male
46
married
male
46
married
female
59
never ma

Creating a longer panel

When performing the same operations on several waves of data, we can write do-files more efficiently
using the foreach and forvalues loop commands. Here I use foreach to perform the same commands
just substituting the wave prefix each time.2 The command to extract all of the data is shown below.
foreach w in a b c d {
use whid wpno wsex wage pid wmastat whgspn wqfachi wfiyr wfiyrl ///
using $dir/windresp
sort whid
save wind, replace
1 To

create a wide panel we would not remove the prefixes and would use merge to combine the datasets.
forvalues command would be used when you want to loop over numbers rather than letters or variables.

2 The

clear
use whid wtenure whhsize wnkids wfihhyr using $dir/whhresp
sort whid
merge whid using wind
keep if _merge==3
drop _merge
renpfix w
generate wave = index("abcd","w")
save wavew, replace
}
We start the forvalues comman by defining what should be replaced in each iteration of the loop in
this case w, and giving the list of values to substitute (a, b, c, d for the first 4 waves). We then write
out the code substituting w where the prefix would normally be. This reproduces the steps we went
through above. The new function used here is index this returns the position of w in the list abcd
and so generates the wave variable.
After this code has been run, we can use a similar loop to append the files together, and to delete
the files created in the process:
foreach w in a b c {
append using wavew
}
compress
save BHPS, replace
foreach w in a b c d {
capture erase wavew.dta
capture erase wind.dta
}
Note that compress ensures that the data is being stored as efficiently as possible. Sorting and listing
the dataset produced shows:
. sort pid wave
. list pid wave hid sex age mastat
pid
wave
hid
1.
10002251
1
1000209
2.
10004491
1
1000381
3.
10004491
2
2000148
4.
10004521
1
1000381
5.
10004521
2
2000148
6.
10004521
3
3000192
7.
10007857
1
1000667
8.
10007857
2
2000296
9.
10007857
3
3000257
10.
10014578
1
1001221
11.
10014578
2
2000369
12.
10014578
3
3000389
13.
10014608
1
1001221
14.
10014608
2
2000369
15.
10014608
3
3000389
16.
10016813
1
1001418
17.
10016813
2
2000504
18.
10016813
3
3000508

in 1/20, clean
sex
age
mastat
female
91
never ma
male
28
never ma
male
29
never ma
male
26
never ma
male
27
never ma
male
28
never ma
female
57
widowed
female
59
widowed
female
59
widowed
female
54
married
female
55
married
female
56
married
male
57
married
male
58
married
male
59
married
male
36
married
male
37
married
male
37
married
6

19.
20.

10016813
10016848

4
1

4000307
1001418

male
female

39
32

married
married

This shows a panel in long format. Note that age mostly increases by one year between waves for each
individual (the age variable here is age at interview date which can vary), whilst sex is constant.
In order to perform analysis exploiting the panel dimension of the dataset we must declare the data to
be a panel we do this using xtset, and declaring the panel variable (here pid) and time variable (here
wave). Note that the panel variable and time variable must together uniquely identify every observation
in the dataset.
. xtset pid wave
panel variable:
time variable:
delta:

pid (unbalanced)
wave, 1 to 4, but with gaps
1 unit

This shows that we have an unbalanced panel as seen in the list above we do not have an observation
for every person in every time period. Some useful commands to investigate panel data are xtdescribe,
xtsum and xttrans:
. xtdescribe
pid:
wave:

10002251, 10004491, ..., 47737689


1, 2, ..., 4
Delta(wave) = 1 unit
Span(wave) = 4 periods
(pid*wave uniquely identifies each observation)

Distribution of T_i:

min
1

5%
1

25%
2

50%
4

n =
T =

75%
4

12350
4

95%
4

max
4

Freq. Percent
Cum. | Pattern
---------------------------+--------7643
61.89
61.89 | 1111
1009
8.17
70.06 | 1...
679
5.50
75.55 | 11..
596
4.83
80.38 | ...1
527
4.27
84.65 | 111.
458
3.71
88.36 | .111
418
3.38
91.74 | ..11
290
2.35
94.09 | .1..
197
1.60
95.68 | ..1.
533
4.32 100.00 | (other patterns)
---------------------------+--------12350
100.00
| XXXX
xtdescribe gives information about the panel structure we see that there are 12,350 individuals and
4 time periods, and that 62% of people have observations in all four time periods.
. xtsum sex age nkids fiyr lwages mastat
Variable
|
Mean
Std. Dev.
Min
Max |
Observations
-----------------+--------------------------------------------+---------------sex
overall | 1.531028
.4990427
1
2 |
N =
39190
between |
.4995023
1
2 |
n =
12350
within |
0
1.531028
1.531028 | T-bar = 3.17328
|
|
age
overall | 44.00559
18.43386
15
97 |
N =
39190
7

nkids

fiyr

lwages

mastat

between |
within |
|
overall |
between |
within |
|
overall |
between |
within |
|
overall |
between |
within |
|
overall |
between |
within |

18.93358
1.044435

15
32.33892

.5951008

.9762426
.9375915
.2487689

0
0
-3.154899

8939.016

8929.79
8186.321
3631.936

0
0
-70011.81

8.849195

1.166257
1.206297
.4933544

-3.321733
-3.321733
1.250891

2.497844

2.031626
2.043197
.5396737

0
0
-2.002156

96.5 |
n
50.33892 | T-bar
|
9 |
N
9 |
n
3.595101 | T-bar
|
287481.8 |
N
160891.5 |
n
204307.5 | T-bar
|
12.56891 |
N
11.61299 |
n
14.18753 | T-bar
|
6 |
N
6 |
n
6.247844 | T-bar

=
12350
= 3.17328
=
39190
=
12350
= 3.17328
=
37455
=
11982
= 3.12594
=
=
=

23609
8323
2.8366

=
39185
=
12350
= 3.17287

xtsum gives summary statistics and shows variation between individuals and within individuals so
we see that sex does not vary within individuals, and that log wages vary more between individuals
than within individuals. We also see the total number of observations (N), number of individuals with
observations (n) and average number of time periods for each individual.
. xttrans married, freq
|
Married=1
Married=1 |
0
1 |
Total
-----------+----------------------+---------0 |
10,669
492 |
11,161
|
95.59
4.41 |
100.00
-----------+----------------------+---------1 |
367
15,312 |
15,679
|
2.34
97.66 |
100.00
-----------+----------------------+---------Total |
11,036
15,804 |
26,840
|
41.12
58.88 |
100.00
xttrans gives an indication of whether there are transitions between groups for categorical variables
for example, we see here that 96% of unmarried individuals remain unmarried in the next period, and
98% of married individuals remain married in the next period.

Regression using panel data

Having set up our dataset we can perform some regressions. As previously, I use a local macro to store
my list of independent variables:
local xlist "age age2 i.sex*i.married nkids i.qfachi"
We can estimate a pooled OLS regression using the regress command seen in the last lecture. We
should use robust standard errors clustered by individual.
. xi: regress lwages xlist, vce(cluster pid)
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.married
_Imarried_0-1
(naturally coded; _Imarried_0 omitted)
i.sex*i.married
_IsexXmar_#_#
(coded as above)

i.qfachi

_Iqfachi_1-7

(naturally coded; _Iqfachi_1 omitted)

Linear regression

Number of obs
F( 12, 8285)
Prob > F
R-squared
Root MSE

=
=
=
=
=

23520
409.27
0.0000
0.3092
.96854

(Std. Err. adjusted for 8286 clusters in pid)


-----------------------------------------------------------------------------|
Robust
lwages |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.1890312
.0055574
34.01
0.000
.1781373
.1999251
age2 | -.0022714
.0000708
-32.10
0.000
-.0024101
-.0021327
_Isex_2 | -.3258962
.0300076
-10.86
0.000
-.3847186
-.2670737
_Imarried_1 |
.488806
.0286158
17.08
0.000
.4327118
.5449002
_IsexXmar_~1 |
-.654709
.0376179
-17.40
0.000
-.7284494
-.5809685
nkids | -.2203951
.0115555
-19.07
0.000
-.2430467
-.1977435
_Iqfachi_2 | -.1272603
.077578
-1.64
0.101
-.2793325
.024812
_Iqfachi_3 | -.1781531
.0785097
-2.27
0.023
-.3320517
-.0242544
_Iqfachi_4 | -.4158935
.0734584
-5.66
0.000
-.5598903
-.2718966
_Iqfachi_5 | -.5313831
.0726375
-7.32
0.000
-.6737708
-.3889953
_Iqfachi_6 |
-.617404
.080617
-7.66
0.000
-.7754335
-.4593746
_Iqfachi_7 | -.8375914
.073551
-11.39
0.000
-.9817697
-.693413
_cons |
6.046451
.1298038
46.58
0.000
5.792003
6.300899
-----------------------------------------------------------------------------Fixed and random effects regressions are both carried out using the xtreg command. In both cases we
should again get cluster robust standard errors. The default is for Stata to estimate random effects when
xtreg is used you must specify the option fe to get fixed effects:
. xi: xtreg lwages xlist, fe vce(cluster pid)
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.married
_Imarried_0-1
(naturally coded; _Imarried_0 omitted)
i.sex*i.married
_IsexXmar_#_#
(coded as above)
i.qfachi
_Iqfachi_1-7
(naturally coded; _Iqfachi_1 omitted)
Fixed-effects (within) regression
Group variable: pid

Number of obs
Number of groups

=
=

23520
8286

R-sq:

Obs per group: min =


avg =
max =

1
2.8
4

within = 0.0451
between = 0.1672
overall = 0.1293

corr(u_i, Xb)

F(11,8285)
Prob > F

= -0.3782

=
=

38.35
0.0000

(Std. Err. adjusted for 8286 clusters in pid)


-----------------------------------------------------------------------------|
Robust
lwages |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.2902458
.0173775
16.70
0.000
.2561815
.3243102
age2 | -.0030951
.0002108
-14.68
0.000
-.0035083
-.0026819
_Isex_2 | (dropped)
9

_Imarried_1 |
.0473639
.0421936
1.12
0.262
-.0353461
.1300739
_IsexXmar_~1 | -.0574721
.0682978
-0.84
0.400
-.1913529
.0764086
nkids | -.1347352
.0193899
-6.95
0.000
-.1727443
-.0967261
_Iqfachi_2 |
.0239754
.2786775
0.09
0.931
-.5223023
.5702531
_Iqfachi_3 | -.2838254
.2629136
-1.08
0.280
-.7992019
.231551
_Iqfachi_4 | -.1524719
.2703457
-0.56
0.573
-.6824172
.3774734
_Iqfachi_5 | -.4847609
.2739852
-1.77
0.077
-1.021841
.0523187
_Iqfachi_6 | -.4576768
.3040223
-1.51
0.132
-1.053637
.138283
_Iqfachi_7 | -.3651604
.3077017
-1.19
0.235
-.9683329
.238012
_cons |
3.198992
.44452
7.20
0.000
2.327621
4.070362
-------------+---------------------------------------------------------------sigma_u | 1.1636036
sigma_e | .59940523
rho | .79029066
(fraction of variance due to u_i)
-----------------------------------------------------------------------------Here sex is dropped this is because it is invariant over time. The lack of variance in qualifications and
marital status explains the imprecise coefficient estimates here.
. xi: xtreg lwages xlist, re vce(cluster pid)
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.married
_Imarried_0-1
(naturally coded; _Imarried_0 omitted)
i.sex*i.married
_IsexXmar_#_#
(coded as above)
i.qfachi
_Iqfachi_1-7
(naturally coded; _Iqfachi_1 omitted)
Random-effects GLS regression
Group variable: pid

Number of obs
Number of groups

=
=

23520
8286

R-sq:

Obs per group: min =


avg =
max =

1
2.8
4

within = 0.0341
between = 0.3457
overall = 0.3053

Random effects u_i ~ Gaussian


corr(u_i, X)
= 0 (assumed)

Wald chi2(12)
Prob > chi2

=
=

4356.47
0.0000

(Std. Err. adjusted for 8286 clusters in pid)


-----------------------------------------------------------------------------|
Robust
lwages |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.2133805
.0058073
36.74
0.000
.2019984
.2247626
age2 | -.0025222
.0000737
-34.22
0.000
-.0026667
-.0023778
_Isex_2 | -.4601064
.0312502
-14.72
0.000
-.5213557
-.3988571
_Imarried_1 |
.341099
.0275778
12.37
0.000
.2870475
.3951504
_IsexXmar_~1 | -.4701444
.0385164
-12.21
0.000
-.5456351
-.3946536
nkids |
-.201973
.0116565
-17.33
0.000
-.2248194
-.1791267
_Iqfachi_2 | -.0940869
.0991879
-0.95
0.343
-.2884916
.1003178
_Iqfachi_3 | -.1635236
.0960624
-1.70
0.089
-.3518024
.0247552
_Iqfachi_4 | -.3019628
.0911546
-3.31
0.001
-.4806225
-.1233031
_Iqfachi_5 | -.4841929
.0903848
-5.36
0.000
-.6613438
-.3070421
_Iqfachi_6 |
-.524829
.0975616
-5.38
0.000
-.7160463
-.3336117
_Iqfachi_7 |
-.785563
.0915491
-8.58
0.000
-.9649959
-.6061301
_cons |
5.485352
.1464953
37.44
0.000
5.198227
5.772478
-------------+---------------------------------------------------------------sigma_u | .87706892
sigma_e | .59940523
10

rho | .68163491
(fraction of variance due to u_i)
-----------------------------------------------------------------------------Stata reports the standard deviations of the error components estimated in sigma_u and sigma_e. We
also see different R2 statistics for within and between variation. These can be tabulated if the estimates
have been stored.
. esttab POLS FE RE, b se stats(r2 r2_o r2_b r2_w)
-----------------------------------------------------------(1)
(2)
(3)
lwages
lwages
lwages
-----------------------------------------------------------age
0.189***
0.290***
0.213***
(0.00556)
(0.0174)
(0.00581)
age2

-0.00227***
(0.0000708)

-0.00310***
(0.000211)

-0.00252***
(0.0000737)

_Isex_2

-0.326***
(0.0300)

0
(0)

-0.460***
(0.0313)

_Imarried_1

0.489***
(0.0286)

0.0474
(0.0422)

0.341***
(0.0276)

_IsexXmar_~1

-0.655***
(0.0376)

-0.0575
(0.0683)

-0.470***
(0.0385)

nkids

-0.220***
(0.0116)

-0.135***
(0.0194)

-0.202***
(0.0117)

_Iqfachi_2

-0.127
(0.0776)

0.0240
(0.279)

-0.0941
(0.0992)

_Iqfachi_3

-0.178*
(0.0785)

-0.284
(0.263)

-0.164
(0.0961)

_Iqfachi_4

-0.416***
(0.0735)

-0.152
(0.270)

-0.302***
(0.0912)

_Iqfachi_5

-0.531***
(0.0726)

-0.485
(0.274)

-0.484***
(0.0904)

_Iqfachi_6

-0.617***
(0.0806)

-0.458
(0.304)

-0.525***
(0.0976)

_Iqfachi_7

-0.838***
(0.0736)

-0.365
(0.308)

-0.786***
(0.0915)

_cons

6.046***
3.199***
5.485***
(0.130)
(0.445)
(0.146)
-----------------------------------------------------------r2
0.309
0.0451
r2_o
0.129
0.305
r2_b
0.167
0.346
r2_w
0.0451
0.0341
11

-----------------------------------------------------------Standard errors in parentheses


* p<0.05, ** p<0.01, *** p<0.001
Estimation can also be easily implemented in first differences using the regress command and difference operator D.. We do not need to generate variables in first differences. The option noconstant
is used so that Stata does not add a constant term (which would be differenced out). For example:
. regress D.(lwage age age2 female married nkids), vce(cluster pid) noconstant
Linear regression

Number of obs
F( 4, 6146)
Prob > F
R-squared
Root MSE

=
=
=
=
=

14667
65.00
0.0000
0.0208
.73681

(Std. Err. adjusted for 6147 clusters in pid)


-----------------------------------------------------------------------------|
Robust
D.lwages |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
D1. |
.3179554
.0211961
15.00
0.000
.2764036
.3595071
age2 |
D1. | -.0034294
.0002499
-13.72
0.000
-.0039192
-.0029395
female |
D1. | (dropped)
married |
D1. |
.0144043
.0390605
0.37
0.712
-.062168
.0909767
nkids |
D1. | -.0957519
.0221907
-4.31
0.000
-.1392534
-.0522504
------------------------------------------------------------------------------

3.1

Hausman test

Stata can easily perform a Hausman test that is, a test of whether the individual effects are random.
The null hypothesis is that both fixed and random effects are consistent, the alternative hypothesis is
that random effects is not consistent. We must first estimate the fixed and random effects models and
without robust standard errors. Then, the Hausman test is conducted using the hausman command.
. quietly xi: xtreg lwages xlist, fe
. estimates store FE1
. quietly xi: xtreg lwages xlist, re
. estimates store RE1
. hausman FE1 RE1, sigmamore
---- Coefficients ---|
(b)
(B)
(b-B)
sqrt(diag(V_b-V_B))
|
FE1
RE1
Difference
S.E.
-------------+---------------------------------------------------------------age |
.2902458
.2133805
.0768653
.0125061
age2 |
-.0030951
-.0025222
-.0005729
.0001535
_Imarried_1 |
.0473639
.341099
-.293735
.0336022
_IsexXmar_~1 |
-.0574721
-.4701444
.4126723
.0470245
nkids |
-.1347352
-.201973
.0672378
.0120169
12

_Iqfachi_2 |
.0239754
-.0940869
.1180623
.1387245
_Iqfachi_3 |
-.2838254
-.1635236
-.1203018
.1676963
_Iqfachi_4 |
-.1524719
-.3019628
.1494909
.1630687
_Iqfachi_5 |
-.4847609
-.4841929
-.000568
.1697845
_Iqfachi_6 |
-.4576768
-.524829
.0671522
.2006431
_Iqfachi_7 |
-.3651604
-.785563
.4204026
.1869317
-----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from xtreg
B = inconsistent under Ha, efficient under Ho; obtained from xtreg
Test:

Ho:

difference in coefficients not systematic


chi2(11) = (b-B)[(V_b-V_B)^(-1)](b-B)
=
238.04
Prob>chi2 =
0.0000

Cameron and Trivedi (2009) recommend using the sigmamore option. Here we see the null hypothesis
is clearly rejected with a p-value of 0.0000 so the random effects estimates are not consistent.

Creating variables to identify changes in variables

We may wish to create a variable which records whether a certain status has changed. For example,
whether marital status has changed. Once data is declared to be a panel this is straightforward. Lets
first recode mastat so that it has just three categories:
recode mastat (0=.) (1/2=1) (3/5=2) (6=3), generate(ma)
Then to find changes we generate a new variable which incorporates the lagged value and current value
of ma:
generate mach=(10*L.ma)+ma
Having labelled the values we have a useful marital change variable. So we can see that there are 352
instances of individuals going from never having been married to having a partner in this sample.
. tabulate mach
marital change |
Freq.
Percent
Cum.
-----------------------------+----------------------------------stayed in couple |
16,849
63.94
63.94
partnership ended |
360
1.37
65.30
partnered -> never married! |
113
0.43
65.73
ex-partner -> partnership |
180
0.68
66.41
stayed ex-partner |
3,622
13.74
80.16
never married -> partnership |
352
1.34
81.49
never married -> ex-partner |
14
0.05
81.55
stayed never married |
4,863
18.45
100.00
-----------------------------+----------------------------------Total |
26,353
100.00

References
Cameron, A. Colin and Pravin K. Trivedi, Microeconometrics Using Stata, Texas: Stata Press,
2009.
Taylor, Marcia Freed, John Brice, Nick Buck, and Elaine Prentice-Lane, British Household
Panel Survey User Manual Volume A: Introduction, Technical Report and Appendices, ISER,
University of Essex, Colchester 2009.

13

You might also like