Professional Documents
Culture Documents
IN R
Understanding Panel Data in R
Contents
1 Introduction ......................................................................................................................................... 2
2 What is Panel Data ............................................................................................................................... 2
3 Panel Data Types .................................................................................................................................. 2
3.1 Long panel data and short panel data ........................................................................................ 2
3.2 Balanced and unbalanced panel data ......................................................................................... 3
4 Types of Variations in Panel Data ........................................................................................................ 4
5 Exploring Panel Data............................................................................................................................ 5
6 Panel Data Models ............................................................................................................................... 7
6.1 Assumptions ................................................................................................................................. 7
6.2 Pooled OLS Model ....................................................................................................................... 8
6.3 Fixed Effects Model ..................................................................................................................... 9
6.4 Random Effects Model ............................................................................................................... 11
6.5 First Differences Model .............................................................................................................. 12
6.6 Between Model .......................................................................................................................... 14
7 Tests to Compare Different Models................................................................................................... 15
8 Final Model Selection......................................................................................................................... 16
9 Other Packages .................................................................................................................................. 16
References .................................................................................................................................................. 17
1
1 Introduction
As panel data is becoming more and more available, it is becoming more and more used for data
analysis, since it explains the variability better than the cross-sectional data or the time series data.
In this report, we will try to understand what panel data is & what types of panel data there could be.
We then look into the types of variations that exist in the panel data and how to calculate them.
After that, we will try to understand the assumptions underlying the basic linear models. Then we
will finally understand how to build models specific to the panel data and limitations associated with
each. We then look at different tests which we can use to compare which model would be best for
our data. And finally we see which would be the best model for our data.
For example, as we can see, the table below is a panel dataset and shows crimes rates in 150 US
cities.
2
Short panel data
While, in unbalanced panel data, each entity has different numbers of observations, which means
that the data is not panel and the numbers of observations cannot equal to n*t directly.
3
4 Types of Variations in Panel Data
There are three types of variation: overall variation, between variation and within variation.
The overall variation can be decomposed into between variation and within variation.
The US statewide productivity data set with the calculation of different type of variation:
4
5 Exploring Panel Data
We use the googleVis package to build a motion chart for our data set. Although the
Unemployment Rate can also be a regressor, since we only want to predict GSP in our case, we
set GSP in Y axis, as well as color and size; in X axis, we set the Year. This graph works well in
visualizing the panel data we have, although the code is easy, it works efficiently.
We tried to use googleVis geochart to visualize our data, but the basic geochart is stationary, which
is not suitable for our panel data. Our solution, therefore, was to use Shiny package to add time to
our geochart.
For Shiny, we needed to create a Shiny UI and a Shiny Server, the former defines the interface we
see, and the later defines the geochart we need time by time.
Shiny Server Code:
5
Shiny UI Code:
The result
We use the below command to view the trend in GSP across states.
library(gplots)
plotmeans(Y ~ STATE, main="Heterogeineity across states", data=productivity)
6
6 Panel Data Models
The models used for Panel data use different types of estimators depending upon the type of
variation in the panel data. We will discuss 5 types of models - OLS Pooled Model, Fixed Effects
Model, Random Effect Model, First Differences Model, Between model.
But firstly we need to set the target and predictors in our data. GSP(Gross state Product) is the
target and rest all variables are predictors here.
Y<-cbind(GSP)
X<-cbind(P_CAP,HWY,WATER,UTIL,PC,EMP,UNEMP)
We then set our productivity data as the panel data and specify the time variable, t =1.T and the
individual variable, i=1.n as index.
paneldata=plm.data(productivity,index=c("STATE","YR"))
6.1 Assumptions
Before starting with the models, we need to understand the assumptions underlying the basic linear
regression model. Consider the below equation:
7
1. The Covariates are exogenous. This implies that the random error is not correlated to any of
the predictors, but occurs due to something which cannot be explained by the model. Also, it
means that the mean of random error is zero. Eg, This condition could be violated when
there are certain variables which are not included in our data but they have an effect on our
outcome (omitted variable). It could also occur when there is error in measurement of
independent variable or simultaneity bias.
2. The errors are Homoskedastic. This implies that the random error term has the same
variance for each observation.
3. The Errors are Uncorrelated (Non- autocorrelation) This implies that the random error in
one individual is not correlated with the random error in another. Thus the individuals are
assumed to be independent units.
If these assumptions do not hold, then the estimators from our model will become biased and
inconsistent. It can lead to errors like the omitted variable bias in which we miss out on an important
variable in our model by over or under-estimating the effect of other predictors. We can remove the
endogeneity bias by using the Fixed Effects Model as we will discuss later.
Here,
u = idiosyncratic error which is uncorrelated with the predictor variables
c= unobserved heterogeneity, which explains the variations that occur due to factors related to an
individual. This can be correlated or uncorrelated to the predictor variables.
Including the unobserved heterogeneity term in the error component, we can write our final model
as:
8
pooled<-plm(Y~X,data=productivity,model="pooling")
summary(pooled)
In the summary, we can see that few predictors like PC, UNEMP and EMP are extremely significant.
We see that the value of R square is high, however the estimates are not reliable since we did not
take into consideration the unobserved heterogeneity.
We will see how to remedy this limitation by using our next model - Fixed Effects Model.
For each unit, we take the average over all time periods.
We then take a difference of the within unit average and the observation.
9
Here, the unobserved heterogeneity term will cancel out since the value of c(i) will remain constant
for an individual over time. We can then run the regression on our transformed equation to get our
model.
fixed<-plm(Y~X,data=productivity,model="within",effect="individual")
summary(fixed)
We can see here the R square is pretty high around 0.96. Also variables like Pc and EMP are more
significant.
Here the model used is within which implies that we are taking the heterogeneity within an
individual into account. Also, the effect is individual which implies that we not considering the time
variation into account here.
To take the fixed effects of the time variable also into account, we use the effect as twoways.
This considers variables which have the same impact on all units, however they vary with time.
fixed<-plm(Y~X,data=productivity,model="within",effect="twoways")
10
summary(fixed)
The R square here is similar to the previous case, and same variables PC and EMP are very significant.
As seen from the above equation, the model includes both between and within variation. This type
of model is appropriately used if you believe that the independent values influence your target
variable somehow because this type of model gives you the opportunity to include time invariant
variables. Random effects assume that the entitys error term is not correlated with the predictors
11
which allows for time-invariant variables to play a role as explanatory variables. In random-effect,
you need to specify those individual characteristics that may or may not influence the predictor
variables. The problem with this is that some variables may not be available therefore leading to
omitted variable bias in the model. RE allows to generalize the inferences beyond the sample used in
the model (Torres-Reyna, 2007).
So now with our plm package, we will change the argument of model to random. As you can see
below, the idiosyncratic and individual specific error is assumed to be uncorrelated with any variables
across time. Theta (lambda) shows how much of the variation comes from the individual compared
to idiosyncratic. Here most of the variation comes from the individual which is a good sign. An R2 of
approximately 97% is extremely good. So here, PC, unemployment and employment are strong
predictors of GSP with three of them being significant at the 10% level.
random<-plm(Y~X,data=productivity,model="random")
summary(random)
12
We can lag each component by 1 to get:
Now since the change in unobserved effects (c) is essentially 0, we are left with the following model
(McManus, 2011):
There are many reasons why first differencing is preferred when dealing with a panel data.
If there is a variable that is unmeasured and unchanging then its usually not measured. Furthermore,
if the variable is correlated with other variables then simply omitting them will make your model
biased. So with first differencing, you can include this since the difference between this unchanging
variable between time t and t+1 is 0. Another case where this would be appropriate would be if for
example, you are predicting income with experience. So here a change in the experience between t
and t+1 would be a good indicator of a persons income.
A huge problem, however, with these type of models is that time invariant variables are not
recognized in the model so in other words we are not considering their coefficients. The same
problem was observed in the fixed effects model.
In R, modeling first differences is straightforward. All we have to do here is change our model
argument with fd to indicate first difference. So as seen below, the model took into consideration
all the variables to predict GSP since there is no constant variable across time. It gives an R2 of 77.5%
and everything except PC is significant at the 10% level. Although only employment and
unemployment are significant at the 5% level, indicating that it is very likely that these two can be
used to predict GSP.
firstdiff<-plm(Y~X,data=productivity,model="fd")
summary(firstdiff)
13
6.6 Between Model
As the name suggests, the Between model looks at the variations that are between the
individuals. So this type of model looks at cross-sectional variation. To recap what we have talked
about so far, the fixed model and the first differencing model looks at within variation, the random
model looks at both within and between and now we are focused on just the variation that occurs
between the individuals. This can be explained by:
This simply highlights the difference between the individual mean and the overall mean. Hence the
formulation is :
between<-plm(GSP~P_CAP+HWY+WATER+PC+EMP+UNEMP,data=productivity,model="between")
summary(between)
This model gives us the highest R-squared meaning that if we ignore the variations that are within
the states and only focus on between states, we get an almost perfect model. This type of model is
14
unrealistic since we cannot just collapse the individuals into single observations. However, if your
dataset does not have any within variation, this model is ideal.
A very small p-value shows that the test is significant and hence here we can rule out an OLS model
over the random effects model.
Next, if we want to test between an OLS and a fixed model we administer an f-test. The syntax for
that would be pFtest() and we get the following results:
So again here we see more support for the fixed effect model over the ols model, confirming our
previous analysis. Finally, to decide between a fixed effects model and a random effects model we
run the Hausman test. As stated previously, usually the random effects is far superior in capturing
the variability in the data then a fixed effects model. But if the Hausman test does not support it,
then a more appropriate model would be the fixed effects model. The Hausman test can be
represented as:
This takes the differences of the two betas of the models and multiplies with the covariance matrix.
In R, the syntax is phtest() for panel Hausman test:
15
The p-value here is very small indicating that one model is inconsistent so we need to use the fixed
effects model here since that provides more consistency.
First
Rando
Pooled Fixed Differen Between
m
ces
Within+
Method OLS Within Betwee Lag Between
n
As our tests indicated above, the final model that should be selected to predict GSP is the Fixed
effects model. So there is unobserved heterogeneity present in our data and it is correlated with our
explanatory or predictor variables. This goes to show that you may have a very high R-squared for a
model but if it doesn't represent your data in a proper way then its simply useless. So in a nutshell,
the model that you select will depend on the data and each variables variability.
9 Other Packages
Apart from the plm package which is specifically used for linear models, we have different packages
available for other types of data.
1. Gplm package: This package is used for estimation of panel models for glm-like models. Eg
Binary, count,ordered etc.
2. Phtt package: This package is used for estimation of panel models with data having
heterogeneous time trends. This implies that when the unobserved heterogeneity is not
constant with time for each individual cross-sectional unit, we can use these models.This lets
us ignore the assumption of time-constant unobserved heterogeneity unlike the classical
panel models.
16
References
Hun Myoung Park (2011) Practical Guides to Panel Data Modeling: A Step by Step Analysis Using Stata.
Croissant, Y. (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software.
McManus, P. A. (2011). Introduction to Regression Models for Panel Data Analysis. Indiana: Indiana
University.
Torres-Reyna, O. (2007). Panel Data Analysis Fixed and Random Effects using Stata. Princeton University.
17