Professional Documents
Culture Documents
Lecture 7 (Ch14) Pooled Cross Sections and Simple Panel Data Methods
Panel data
This is the cross section data collected at different points in time. However, this data follow the same individuals over time. You can do a bit more than the pooled cross section with Panel data. You usually include year dummies as well.
Example 1
Consider that you would like to see the changes in fertility rate over time after controlling for various characteristics. Next slide shows the OLS estimates of the determinants of fertility over time. (Data: FERTIL1.dta) The data is collected every other year. The base year for the year dummies are year 1972.
5
399.265559 16 24.9540975 2686.24374 1112 2.41568682 3085.5093 1128 2.73538059 Coef. -.1287556 .535383 -.0058384 1.077747 .2180929 .3616071 .1989796 -.0553556 -.1662171 .0825938 .2092197 .301226 -.0639849 -.037886 -.4892665 -.5112715 -7.844731 Std. Err. .0183209 .1380659 .001561 .1733806 .1327211 .1207846 .1668093 .146947 .1751486 .124396 .1600797 .1488953 .1556646 .1598956 .1482989 .1496524 3.038574 t -7.03 3.88 -3.74 6.22 1.64 2.99 1.19 -0.38 -0.95 0.66 1.31 2.02 -0.41 -0.24 -3.30 -3.42 -2.58
[95% Conf. Interval] -.164703 .264484 -.0089013 .7375571 -.042319 .1246157 -.1283168 -.3436803 -.5098761 -.1614836 -.1048727 .0090786 -.3694143 -.3516171 -.7802437 -.8049044 -13.80672 -.0928081 .8062821 -.0027756 1.417937 .4785049 .5985984 .5262761 .2329692 .177442 .3266712 .5233121 .5933735 .2414445 .2758452 -.1982893 -.2176385 -1.882745
The number of children one woman has in 1982 is 0.49 less than the base year. Similar result is found for year 1984.
The year dummies show significant drops in fertility rate over time.
Example 2
CPS78_85.dta has wage data collected in 1978 and 1985. we estimate the earning equation which includes education, experience, experience squared, union dummy, female dummy and the year dummy for 1985. Suppose that you want to see if gender gap has changed over time, you include interaction between female and 1985; that is you estimate the following.
8
Log(wage)=0+1(educ) +2(exper)+3(expersq)+4(Union) +5(female) +6(year85) +7(year85)(female) You can check if gender wage gap in 1985 is different from the base year (1978) by checking if 7 is equal to zero or not. The gender gap in each period is given by: -gender gap in the base year (1978) = 5 -gender gap in 1985= 5+ 7
9
. reg lwage educ exper expersq union Source Model Residual Total lwage educ exper expersq union female y85 y85fem _cons SS 135.328704 183.762464 319.091167 Coef. .0833217 .0294761 -.0003975 .205237 -.3195333 .3530916 .0884046 .3522088 df 7 1076 1083
female y85 MS
y85fem Number of obs F( 7, 1076) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 0.085 0.000 = = = = = = 1084 113.20 0.0000 0.4241 0.4204 .41326
19.332672 .170782959 .29463635 t 16.45 8.25 -5.12 6.77 -8.72 10.59 1.72 4.62
Std. Err. .0050646 .0035717 .0000776 .0302943 .0366427 .0333324 .0513498 .0763137
[95% Conf. Interval] .0733841 .0224679 -.0005498 .1457945 -.3914324 .2876877 -.0123524 .2024683 .0932594 .0364844 -.0002451 .2646795 -.2476341 .4184954 .1891616 .5019493
Coefficient for the interaction term (y85)(Female) is positive and significant at 10% significance level. So gender gap appear to have reduced over time. gender gap in 1978 =-0.319 gender gap in 1985=-0.319+0.088 =-0.231
10
Policy analysis with pooled cross sections: The difference in difference estimator
I explain a typical policy analysis with pooled cross section data, called the difference-in-difference estimation, using an example.
11
Most nave analysis would be to run the following regression using only 1981 data. price =0+1(nearinc)+u where the price is the real price (i.e., deflated using CPI to express it in 1978 constant dollar). Using the KIELMC.dta, the result is the following
. reg rprice nearinc if year==1981 Source Model Residual Total rprice nearinc _cons SS 2.7059e+10 1.3661e+11 1.6367e+11 Coef. -30688.27 101307.5 df MS Number of obs F( 1, 140) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 = 142 = 27.73 = 0.0000 = 0.1653 = 0.1594 = 31238 1 2.7059e+10 140 975815048 141 1.1608e+09 Std. Err. 5827.709 3093.027 t -5.27 32.75
But can we say from this estimation that the incinerator has negatively affected the housing price?
14
To see this, estimate the same equation using 1979 data. Note this is before the rumor of incinerator building began.
. reg rprice nearinc if year==1978 Source Model Residual Total rprice nearinc _cons SS 1.3636e+10 1.5332e+11 1.6696e+11 Coef. -18824.37 82517.23 df 1 177 178 MS 1.3636e+10 866239953 937979126 t -3.97 31.09 P>|t| 0.000 0.000 Number of obs F( 1, 177) Prob > F R-squared Adj R-squared Root MSE = = = = = = 179 15.74 0.0001 0.0817 0.0765 29432
Note that the price of the house near the place where the incinerator is to be build is lower than houses farther from the location. So negative coefficient simply means that the garbage incinerator 15 was build in the location where the housing price is low.
Compared to 1978, the price penalty for houses near the incinerator is greater in 1981. Perhaps, the increase in the price penalty in 1981 is caused by the incinerator
. reg rprice nearinc if year==1981 Source Model Residual Total rprice nearinc _cons SS 2.7059e+10 1.3661e+11 1.6367e+11 Coef. -30688.27 101307.5 df 1 140 141
16
The difference-in-difference estimator in this example may be computed as follows. I will show you more a general case later on.
The difference-in-difference estimator : 1 = (coefficient for nearinc in 1981) (coefficient for nearinc in 1979) = 30688.27 ( 18824.37)= 11846
So, incinerator has decreased the house prices on average by $11846.
17
Note that, in this example, the coefficient for (nearinc) in 1979 is equal to
Average price of houses near the incinerator
This is because the regression includes only one dummy variable: (Just recall Ex.1 of the homework 2).
Therefore the difference in difference estimator in this 1 example is written as.
18
19
. reg rprice nearinc y81 y81nrinc Source Model Residual Total rprice nearinc y81 y81nrinc _cons SS 6.1055e+10 2.8994e+11 3.5099e+11 Coef. -18824.37 18790.29 -11863.9 82517.23 df MS Number of obs F( 3, 317) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.113 0.000 = 321 = 22.25 = 0.0000 = 0.1739 = 0.1661 = 30243
3 2.0352e+10 317 914632739 320 1.0969e+09 Std. Err. 4875.322 4050.065 7456.646 2726.91 t -3.86 4.64 -1.59 30.26
[95% Conf. Interval] -28416.45 10821.88 -26534.67 77152.1 -9232.293 26758.69 2806.867 87882.36
Difference in difference estimator This form is more general since in addition to policy dummy (nearinc), you can include more variables that affect the housing price such as the number of bedrooms etc. When you include more variables, 1 cannot be expressed in a simple difference-indifference format. However, the interpretation does not change, and therefore, it is still called the difference-in-difference estimator 20
The group of people who are affected by the policy is called the treatment group. Those who are not affected by the policy is called the control group. Suppose that you want to know how the change in spousal tax deduction has affected the hours worked by women. Suppose, you have the pooled data of workers in 1994 and 1995. The next slide shows the typical procedure you follow to conduct the difference-in-difference analysis.
22
Difference in difference estimator. This shows the effect of the policy change on the womens hours worked.
23
[95% Conf. Interval] -.9199137 -1.206287 -.095553 -4.384433 .8163574 .2966021 1.374411 14.35799
You did not find the evidence that receiving the grant will reduce scrap rate.
24
The reason why we did not find the significant effect is probably due to the endogeneity problem. The company with low ability workers tend to apply for the grant, which creates positive bias in the estimation. If you observe the average ability of the workers, you can eliminate the bias by including the ability variable. But since you cannot observe ability, you have the following situation.
log( Scrap) 0 1 ( grant) 2 log( sales) 3 log(employment) ( 3ability u )
v
where ability is in the error term v. v=(3ability+u) is called the composite error term.
25
Because ability and grant are correlated (negatively), this causes a bias in the coefficient for (grant). We predicted the direction of bias in the Effect of following way.
~ 1 4 1 ( ) ( ) ( ) True effect
()
~ 1
Sign is determined by of grant Bias term the correlation The true negative effect of grant is cancelled out by between ability the bias term. Thus, the bias make it difficult to and grant 26 find the effect.
Now you know that there is a bias. Is there anything we can do to correct for the bias? When you have a panel data, we can eliminate the bias. I will explain the method using this example. I will generalize it later.
27
The grant is administered in 1988. Suppose that you have a panel data of firms for two period, 1987 and 1988. Further assume that the average ability of workers does not change over time. So (ability) is interpreted as the innate ability of workers, such as IQ.
28
When you have the two period panel data, the equation can be written as:
log( Scrap) it 0 1 ( grant ) it 2 log( sales) it 3 log(employment ) it 5 ( year88 ) it ( 4 abilityi uit )
vit
i is the index for ith firm. t is the index for the period. Since ability is constant overtime, ability has only i index. Now, I will use a short hand notation for 4(ability)i. Since (ability) is assumed constant over time, write 4(ability)i=ai. Then above equation can be written as:
29
ai is called, the fixed effect, or the unobserved effect. If you want to emphasize that it is the unobserved firm characteristic, you can call it the firm fixed effect as well uit is called the idiosyncratic error. Now the bias in OLS occurs because the fixed effect is correlated with (grant). So if we can get rid of the fixed effect, we can eliminate the bias. This is the basic idea. In the next slide, I will show the procedure of what is called the first-differenced estimation.
30
First, for each firm, take the first difference. That is, compute the following.
log( Scrap)it log( Scrap)it log( Scrap)it 1
It follows that,
log( Scrap)it 0 1 ( grant )it 2 log( sales)it 3 log(employment)it 5 ( year88)it (ai uit ) [ 0 1 ( grant )it 1 2 log( sales)it 1 3 log(employment)it 1 5 ( year88)it 1 (ai uit 1 )] 1( grant )it 2 log( sales)it 3 log(employment)it 5 ( year88)it uit
So, by taking the first difference, you can eliminate the fixed effect.
log( Scrap)it 1( grant )it 2 log( sales)it 3 log(employment )it 5 ( year88 )it uit
If uit is not correlated with (grant)it, estimating the first differenced model using OLS will produce unbiased estimates. If we have controlled for enough time-varying variables, it is reasonable to assume that they are uncorrelated. Note that this model does not have the constant. Now, estimate this model using JTRAIN.dta
32
. . . .
************************** * Declare panel * ************************** tsset fcode year panel variable: fcode (strongly balanced) time variable: year, 1987 to 1989 delta: 1 unit
. ****************************** . * Generate first differenced * . * variables * . ****************************** . gen difflscrap=lscrap-L.lscrap (363 missing values generated) . gen diffgrant=grant-L.grant (157 missing values generated) . gen difflsales=lsales-L.lsales (226 missing values generated) . gen difflemploy=lemploy-L.lemploy (181 missing values generated) . gen diffd88=d88-L.d88 (157 missing values generated) . . . . ********************** * Run the regression * ********************** reg difflscrap diffgrant difflsales difflemploy diffd88 if year<=1988, nocons Source Model Residual Total difflscrap diffgrant difflsales difflemploy diffd88 SS 2.71885438 16.0749657 18.79382 Coef. -.3223172 -.1733036 .0233784 -.0272418 df 4 43 47 MS .679713595 .373836411 .399868511 t -1.72 -0.47 0.05 -0.23 P>|t| 0.093 0.638 0.963 0.822 Number of obs F( 4, 43) Prob > F R-squared Adj R-squared Root MSE = = = = = = 47 1.82 0.1428 0.1447 0.0651 .61142
When you use nocons option, the stata omits constant term.
[95% Conf. Interval] -.701274 -.9106586 -.9978775 -.2705336 .0566396 .5640514 1.044634 .2160501
33
Note that, when you use this method in your research, it is a good idea to tell your audience what the potential fixed effect would be and whether it is correlated with the explanatory variables. In this example, unobserved ability is potentially an important source of the fixed effect. Off course, one can never tell exactly what the fixed effect is since it is the aggregate effects of all the unobserved effects. However, if you tell what is contained in the fixed effect, your audience can understand the potential direction of the bias, and why you need to use the firstdifferenced method.
34
General case
First differenced model in a more general situation can be written as follows. Yit=0+1xit1+2xit2++kxitk+ai+uit
Fixed effect
If ai is correlated with any of the explanatory variables, the estimated coefficients will be biased. So take the first difference to eliminate ai, then estimate the following model by OLS. Yit= 1xit1+ 2xit2++ xitk+ uit
35
Note, when you take the first difference, the constant term will also be eliminated. So you should use `nocons option in STATA when you estimate the model. When some variables are time invariant, these variables are also eliminated. If the treatment variable does not change overtime, you cannot use this method.
36
Exercise
The data ezunem.dta contains the city level unemployment claim statistics in the state of Indiana. This data also contains information about whether the city has an enterprise zone or not. The enterprise zone is the area which encourages businesses and investments through reduced taxes and restrictions. Enterprise zones are usually created in an economically depressed area with the purpose of increasing the economic activities and reducing unemployment.
38
Using the data, ezunem.dta, you are asked to estimate the effect of enterprise zones on the city-level unemployment claim. Use the log of unemployment claim as the dependent variable
Ex1. First estimate the following model using OLS. log(unemployment claims)it =0+1(Enterprise zone)it +(year dummies)it+vit Discuss whether the coefficient for enterprise zone is biased or not. If you think it is biased, what is the direction of bias? Ex2. Estimate the model using the first difference method. Did it change the result? Was your prediction of bias correct?
39
OLS results
. reg luclms ez d81 d82 d83 d84 d85 d86 d87 d88 Source Model Residual Total luclms ez d81 d82 d83 d84 d85 d86 d87 d88 _cons SS 35.5700512 64.9262278 100.496279 Coef. -.0387084 -.3216319 .1354957 -.2192554 -.5970717 -.6216534 -.6511313 -.9188151 -1.2575 11.69439 df MS Number of obs F( 9, 188) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.736 0.071 0.445 0.217 0.001 0.001 0.001 0.000 0.000 0.000 = = = = = = 198 11.44 0.0000 0.3539 0.3230 .58767
9 3.95222791 188 .345352276 197 .510133396 Std. Err. .1148501 .1771882 .1771882 .1771882 .1799355 .1847186 .1847186 .1847186 .1847186 .125291 t -0.34 -1.82 0.76 -1.24 -3.32 -3.37 -3.52 -4.97 -6.81 93.34
[95% Conf. Interval] -.2652689 -.6711645 -.2140369 -.568788 -.9520237 -.986041 -1.015519 -1.283203 -1.621887 11.44724 .187852 .0279007 .4850283 .1302772 -.2421197 -.2572658 -.2867437 -.5544275 -.893112 11.94155
40
First differencing
. reg lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88, nocons Source Model Residual Total lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88 SS 17.3537634 7.79583815 25.1496016 Coef. -.1818775 -.3216319 .1354957 -.2192554 -.5580256 -.5565765 -.5860544 -.8537383 -1.192423 df MS Number of obs F( 9, 167) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.021 0.000 0.039 0.007 0.000 0.000 0.000 0.000 0.000 = = = = = = 176 41.31 0.0000 0.6900 0.6733 .21606
9 1.92819594 167 .046681666 176 .142895463 Std. Err. .0781862 .046064 .0651444 .0797852 .0945636 .108961 .1182979 .1269499 .1350488 t -2.33 -6.98 2.08 -2.75 -5.90 -5.11 -4.95 -6.72 -8.83
[95% Conf. Interval] -.3362382 -.4125748 .0068831 -.3767731 -.7447196 -.7716951 -.8196066 -1.104372 -1.459046 -.0275169 -.2306891 .2641083 -.0617378 -.3713315 -.3414579 -.3525023 -.6031047 -.9257998
41
42
yit=0+1xit1++kxitk+ai+uit
43
Assumption FD2:
We have a random sample from the cross section
Assumption FD3: There is no perfect collinearity. In addition, each explanatory variable changes over time at least for some i in the sample.
44
Assumption FD4. Strict exogeneity E(uit|Xi,ai)=0 for each i. Where Xi is the short hand notation for all the explanatory variables for ith individual for all the time period. This means that uit is uncorrelated with the current years explanatory variables as well as with other years explanatory variables.
45
46
Assumption FD5: Homoskedasticity Var(uit|Xi)=2 Assumption FD6: No serial correlation within ith individual. Cov(uit,uis)=0 for ts
Note that FD2 assumes random sampling across difference individual, but does not assume randomness within each individual. So you need an additional assumption to rule out the serial correlation.
47