You are on page 1of 18

PANEL DATA MODELS

IN R
Understanding Panel Data in R
Contents

1 Introduction ......................................................................................................................................... 2
2 What is Panel Data ............................................................................................................................... 2
3 Panel Data Types .................................................................................................................................. 2
3.1 Long panel data and short panel data ........................................................................................ 2
3.2 Balanced and unbalanced panel data ......................................................................................... 3
4 Types of Variations in Panel Data ........................................................................................................ 4
5 Exploring Panel Data............................................................................................................................ 5
6 Panel Data Models ............................................................................................................................... 7
6.1 Assumptions ................................................................................................................................. 7
6.2 Pooled OLS Model ....................................................................................................................... 8
6.3 Fixed Effects Model ..................................................................................................................... 9
6.4 Random Effects Model ............................................................................................................... 11
6.5 First Differences Model .............................................................................................................. 12
6.6 Between Model .......................................................................................................................... 14
7 Tests to Compare Different Models................................................................................................... 15
8 Final Model Selection......................................................................................................................... 16
9 Other Packages .................................................................................................................................. 16
References .................................................................................................................................................. 17

1
1 Introduction
As panel data is becoming more and more available, it is becoming more and more used for data
analysis, since it explains the variability better than the cross-sectional data or the time series data.

In this report, we will try to understand what panel data is & what types of panel data there could be.
We then look into the types of variations that exist in the panel data and how to calculate them.

After that, we will try to understand the assumptions underlying the basic linear models. Then we
will finally understand how to build models specific to the panel data and limitations associated with
each. We then look at different tests which we can use to compare which model would be best for
our data. And finally we see which would be the best model for our data.

2 What is Panel Data


Panel data is also called cross-sectional and time series. The term panel data refers to multi-
dimensional data in several different time periods .The observations follow the same units
(individuals, families, firms etc.) over time.

For example, as we can see, the table below is a panel dataset and shows crimes rates in 150 US
cities.

3 Panel Data Types


In the explanation that follows, we will assign the entities or subjects as n. and assign the time
periods as t.

3.1 Long panel data and short panel data


Short panel data has many entities but just few time periods, while a long panel data has many time
periods but just few entities

2
Short panel data

Long panel data

3.2 Balanced and unbalanced panel data


In balanced panel data, all entities have measurements in all time periods. Like the example below,
the total observations in balanced panel is n*t.

While, in unbalanced panel data, each entity has different numbers of observations, which means
that the data is not panel and the numbers of observations cannot equal to n*t directly.

3
4 Types of Variations in Panel Data
There are three types of variation: overall variation, between variation and within variation.

Overall Variation: Variation over time and individuals.

SO^2= sum((original variable-overall mean)^2)/N-1

Between variation: variation between individuals.

SB^2= sum((individual mean-overall mean)^2/N-1

Within variation: Variable with individuals (overtime).

SW^2= sum((original variable-individual mean)^2/N-1

The overall variation can be decomposed into between variation and within variation.

SO^2 = SB^2 + SW^2

The US statewide productivity data set with the calculation of different type of variation:

(e.g., 1 in ID means Alabama, 2 means Arizona.)

4
5 Exploring Panel Data
We use the googleVis package to build a motion chart for our data set. Although the
Unemployment Rate can also be a regressor, since we only want to predict GSP in our case, we
set GSP in Y axis, as well as color and size; in X axis, we set the Year. This graph works well in
visualizing the panel data we have, although the code is easy, it works efficiently.

We tried to use googleVis geochart to visualize our data, but the basic geochart is stationary, which
is not suitable for our panel data. Our solution, therefore, was to use Shiny package to add time to
our geochart.

For Shiny, we needed to create a Shiny UI and a Shiny Server, the former defines the interface we
see, and the later defines the geochart we need time by time.
Shiny Server Code:

5
Shiny UI Code:

The result

We use the below command to view the trend in GSP across states.
library(gplots)
plotmeans(Y ~ STATE, main="Heterogeineity across states", data=productivity)

6
6 Panel Data Models
The models used for Panel data use different types of estimators depending upon the type of
variation in the panel data. We will discuss 5 types of models - OLS Pooled Model, Fixed Effects
Model, Random Effect Model, First Differences Model, Between model.

We will be using the plm package for building models.


install.packages("plm")
library("plm")

But firstly we need to set the target and predictors in our data. GSP(Gross state Product) is the
target and rest all variables are predictors here.
Y<-cbind(GSP)
X<-cbind(P_CAP,HWY,WATER,UTIL,PC,EMP,UNEMP)

We then set our productivity data as the panel data and specify the time variable, t =1.T and the
individual variable, i=1.n as index.
paneldata=plm.data(productivity,index=c("STATE","YR"))

6.1 Assumptions
Before starting with the models, we need to understand the assumptions underlying the basic linear
regression model. Consider the below equation:

In this we assume that:

7
1. The Covariates are exogenous. This implies that the random error is not correlated to any of
the predictors, but occurs due to something which cannot be explained by the model. Also, it
means that the mean of random error is zero. Eg, This condition could be violated when
there are certain variables which are not included in our data but they have an effect on our
outcome (omitted variable). It could also occur when there is error in measurement of
independent variable or simultaneity bias.

2. The errors are Homoskedastic. This implies that the random error term has the same
variance for each observation.

3. The Errors are Uncorrelated (Non- autocorrelation) This implies that the random error in
one individual is not correlated with the random error in another. Thus the individuals are
assumed to be independent units.

If these assumptions do not hold, then the estimators from our model will become biased and
inconsistent. It can lead to errors like the omitted variable bias in which we miss out on an important
variable in our model by over or under-estimating the effect of other predictors. We can remove the
endogeneity bias by using the Fixed Effects Model as we will discuss later.

6.2 Pooled OLS Model


This model is based on ordinary least squares with all the above assumptions. In this model, we treat
all observations as independent ie not necessarily referring to the same individual unit over time.We
assume our data here as a single time series of cross sections. An immediate assumption here is that
we ignore the heterogeneity among individual units.
Consider the equation below:

Here,
u = idiosyncratic error which is uncorrelated with the predictor variables
c= unobserved heterogeneity, which explains the variations that occur due to factors related to an
individual. This can be correlated or uncorrelated to the predictor variables.
Including the unobserved heterogeneity term in the error component, we can write our final model
as:

We now run this model and take a look at the summary

8
pooled<-plm(Y~X,data=productivity,model="pooling")
summary(pooled)

In the summary, we can see that few predictors like PC, UNEMP and EMP are extremely significant.
We see that the value of R square is high, however the estimates are not reliable since we did not
take into consideration the unobserved heterogeneity.

We will see how to remedy this limitation by using our next model - Fixed Effects Model.

6.3 Fixed Effects Model


In this model, we assume that there is unobserved heterogeneity present in our data and it is
correlated with our explanatory or predictor variables.

Consider the equation below:

For each unit, we take the average over all time periods.

We then take a difference of the within unit average and the observation.

So we get the fixed effects transformation as below.

9
Here, the unobserved heterogeneity term will cancel out since the value of c(i) will remain constant
for an individual over time. We can then run the regression on our transformed equation to get our
model.

fixed<-plm(Y~X,data=productivity,model="within",effect="individual")
summary(fixed)

We can see here the R square is pretty high around 0.96. Also variables like Pc and EMP are more
significant.

Here the model used is within which implies that we are taking the heterogeneity within an
individual into account. Also, the effect is individual which implies that we not considering the time
variation into account here.

To take the fixed effects of the time variable also into account, we use the effect as twoways.
This considers variables which have the same impact on all units, however they vary with time.

fixed<-plm(Y~X,data=productivity,model="within",effect="twoways")

10
summary(fixed)

The R square here is similar to the previous case, and same variables PC and EMP are very significant.

6.4 Random Effects Model


The next model is the random effect model. It is similar to the fixed model in the sense that the
model assumes that there is difference between the individuals so there is within variation. But
the model also assumes that there is variation between the individuals. There is also, however,
another major difference between these two models. As mentioned earlier, we had assumed a
correlation between the observed variable and the other variables. Here, we assume that the
variation between all the variables is random. So there is no correlation between the observed value
and the regressors and dependent value. The random effect model is (Torres-Reyna, 2007):

As seen from the above equation, the model includes both between and within variation. This type
of model is appropriately used if you believe that the independent values influence your target
variable somehow because this type of model gives you the opportunity to include time invariant
variables. Random effects assume that the entitys error term is not correlated with the predictors

11
which allows for time-invariant variables to play a role as explanatory variables. In random-effect,
you need to specify those individual characteristics that may or may not influence the predictor
variables. The problem with this is that some variables may not be available therefore leading to
omitted variable bias in the model. RE allows to generalize the inferences beyond the sample used in
the model (Torres-Reyna, 2007).

So now with our plm package, we will change the argument of model to random. As you can see
below, the idiosyncratic and individual specific error is assumed to be uncorrelated with any variables
across time. Theta (lambda) shows how much of the variation comes from the individual compared
to idiosyncratic. Here most of the variation comes from the individual which is a good sign. An R2 of
approximately 97% is extremely good. So here, PC, unemployment and employment are strong
predictors of GSP with three of them being significant at the 10% level.

random<-plm(Y~X,data=productivity,model="random")
summary(random)

6.5 First Differences Model


In this model, we simply find the first difference of the dependent variable of the nth individual. So
here, this dependent variable is regressed with the changes in the other variables. So lets say that in
our productivity dataset, all the variables are time invariant for each state. So the dependent variable
(GSP), at time t+1, would be much better explained with the difference between GSP at time t and
t+1. So basically, we are removing these time invariant characteristics to see how it impacts the
outcome. So if we have the following model:

12
We can lag each component by 1 to get:

Now since the change in unobserved effects (c) is essentially 0, we are left with the following model
(McManus, 2011):

There are many reasons why first differencing is preferred when dealing with a panel data.
If there is a variable that is unmeasured and unchanging then its usually not measured. Furthermore,
if the variable is correlated with other variables then simply omitting them will make your model
biased. So with first differencing, you can include this since the difference between this unchanging
variable between time t and t+1 is 0. Another case where this would be appropriate would be if for
example, you are predicting income with experience. So here a change in the experience between t
and t+1 would be a good indicator of a persons income.
A huge problem, however, with these type of models is that time invariant variables are not
recognized in the model so in other words we are not considering their coefficients. The same
problem was observed in the fixed effects model.

In R, modeling first differences is straightforward. All we have to do here is change our model
argument with fd to indicate first difference. So as seen below, the model took into consideration
all the variables to predict GSP since there is no constant variable across time. It gives an R2 of 77.5%
and everything except PC is significant at the 10% level. Although only employment and
unemployment are significant at the 5% level, indicating that it is very likely that these two can be
used to predict GSP.

firstdiff<-plm(Y~X,data=productivity,model="fd")
summary(firstdiff)

13
6.6 Between Model
As the name suggests, the Between model looks at the variations that are between the
individuals. So this type of model looks at cross-sectional variation. To recap what we have talked
about so far, the fixed model and the first differencing model looks at within variation, the random
model looks at both within and between and now we are focused on just the variation that occurs
between the individuals. This can be explained by:

This simply highlights the difference between the individual mean and the overall mean. Hence the
formulation is :

In R, we can show this by changing our model parameter to between.

between<-plm(GSP~P_CAP+HWY+WATER+PC+EMP+UNEMP,data=productivity,model="between")
summary(between)

This model gives us the highest R-squared meaning that if we ignore the variations that are within
the states and only focus on between states, we get an almost perfect model. This type of model is

14
unrealistic since we cannot just collapse the individuals into single observations. However, if your
dataset does not have any within variation, this model is ideal.

7 Tests to Compare Different Models


The type of model you choose will heavily depend on the type of dataset that you have. There are
several tests available which can help determine which model to choose. To check to see if the
random effects model capture the data in a better way, we administer the Breusch-Pagan Lagrange
Multiplier test. This will help us decide if a random effects model is appropriate or a simple pooled ols
model. Here we test whether the correlation of error terms over time are significantly different from
0 or not. If this is significant than we simply use the random effects model. In R, the syntax for this
test is plmtest(). So running that gives us the following results:

A very small p-value shows that the test is significant and hence here we can rule out an OLS model
over the random effects model.
Next, if we want to test between an OLS and a fixed model we administer an f-test. The syntax for
that would be pFtest() and we get the following results:

So again here we see more support for the fixed effect model over the ols model, confirming our
previous analysis. Finally, to decide between a fixed effects model and a random effects model we
run the Hausman test. As stated previously, usually the random effects is far superior in capturing
the variability in the data then a fixed effects model. But if the Hausman test does not support it,
then a more appropriate model would be the fixed effects model. The Hausman test can be
represented as:

This takes the differences of the two betas of the models and multiplies with the covariance matrix.
In R, the syntax is phtest() for panel Hausman test:

15
The p-value here is very small indicating that one model is inconsistent so we need to use the fixed
effects model here since that provides more consistency.

8 Final Model Selection


The following table highlights the R-squared obtained from the different models described above:

First
Rando
Pooled Fixed Differen Between
m
ces

R-squared 0.99259 0.96254 0.96897 0.77533 0.99472

Within+
Method OLS Within Betwee Lag Between
n

As our tests indicated above, the final model that should be selected to predict GSP is the Fixed
effects model. So there is unobserved heterogeneity present in our data and it is correlated with our
explanatory or predictor variables. This goes to show that you may have a very high R-squared for a
model but if it doesn't represent your data in a proper way then its simply useless. So in a nutshell,
the model that you select will depend on the data and each variables variability.

9 Other Packages
Apart from the plm package which is specifically used for linear models, we have different packages
available for other types of data.
1. Gplm package: This package is used for estimation of panel models for glm-like models. Eg
Binary, count,ordered etc.
2. Phtt package: This package is used for estimation of panel models with data having
heterogeneous time trends. This implies that when the unobserved heterogeneity is not
constant with time for each individual cross-sectional unit, we can use these models.This lets
us ignore the assumption of time-constant unobserved heterogeneity unlike the classical
panel models.

16
References
Hun Myoung Park (2011) Practical Guides to Panel Data Modeling: A Step by Step Analysis Using Stata.

Croissant, Y. (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software.

Katchova, A. (2013). Panel Data Models. Princeton.

McManus, P. A. (2011). Introduction to Regression Models for Panel Data Analysis. Indiana: Indiana
University.

Torres-Reyna, O. (2007). Panel Data Analysis Fixed and Random Effects using Stata. Princeton University.

Jiahua Wu, Introduction of Statistics and Econometrics

Wikipedia, Panel Data

17

You might also like