You are on page 1of 20

2002-Present.

Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 1

Using Dummy Variables in Regression


Park, Hun Myoung Indiana University at Bloomington This document explains how to use dummy variables only in linear regression models (as opposed to nonlinear models like logit and probit). The primary focus is on fixed group effect models rather than random effect models.

1. Introduction
A dummy variable is a binary variable that has either 1 or zero. It is commonly used to examine group and time effects in regression. Panel data analysis estimates the fixed effect and/or random effect models using dummy variables. The fixed effect model examines difference in intercept among groups, assuming the same slopes. By contrast, the random effect model estimates error variances of groups, assuming the same intercept and slopes. An example of the random effect model is the groupwise heteroscesasticity model that assumes each group has different variances (Greene 2000: 511-513). The data used here are of the top 50 information technology firms from the 308 page of OECD Information Technology Outlook 2004 (http://thesius.sourceoecd.org/). The data set contains revenue, R&D budget, and net income in current USD millions.

2. Regression without Dummy


Consider a model of regressing R&D budget in 2002 on net income in 2000 and firm type. The dummy variable d is set to 1 for equipment and software companies and zero for other firms. Let us take a look at the data structure. Table 1. Dummy Variable Coding
+-----------------------------------------------------------------------+ | firm country rd2002 net2000 type d | |-----------------------------------------------------------------------| | Samsung Korea 2,500 4,768 Electronics 0 | | AT&T USA 254 4,669 Telecom 0 | | IBM USA 4,750 8,093 IT Equipment 1 | | Siemens Germany 5,490 6,528 Electronics 0 | | Verizon USA . 11,797 Telecom 0 | | Microsoft USA 4,307 9,421 Service & S/W 1 | | EDS USA 0 1,143 Service & S/W 1 |

Let us first think about a linear regression model, ordinary least squares (OLS), without the dummy variable. Note that 0 is the intercept; 1 is the slope of net income in 2000; and i is the error term of the regression equation. Model 1: researchi = 0 + 1incomei + i
http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 2

The estimated model has the intercept 1,482.697 and slope .223. For $ one million increase in net income, a firm is likely to increase R&D budget in 2002 by $ .223 million, holding all othe r things constant. Table 2. Regression without Dummy Variables (Model 1)
Source | SS df MS -------------+-----------------------------Model | 15902406.5 1 15902406.5 Residual | 83261299.1 37 2250305.38 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 1, 37) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 7.07 0.0115 0.1604 0.1377 1500.1

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2230523 .0839066 2.66 0.012 .0530414 .3930632 _cons | 1482.697 314.7957 4.71 0.000 844.8599 2120.533 ------------------------------------------------------------------------------

3. Regression with a Dummy: Binary Categories


Despite moderate goodness of fit statistics such as F and t, it must be a nave model. R&D investment tends to vary across industries. So, let us take such a difference into account in the model, assuming that equipment and software firms have more R&D investment than do telecommunications and electronics companies. There may or may not be correlation (dependence) between the dummy variable (firm types) and regressors (net income). 3.1 Model and Estimation Now, the new model with a dummy variable becomes, Model 2: researchi = 0 + 1incomei + d i + i , where is the coefficient of the dummy variable that affects equipment and software companies only. Thus, this model has two slightly different regression equations for two groups. researchi = 0 + 1incomei + * 1 + i for equipment and software firms researchi = 0 + 1incomei + * 0 + i for telecom and electronics firms The regression indicates a positive impact of two- year- lagged net income on firms R&D budget. Equipment and software firms on average invest $ 1,007 million more for R&D than do telecommunication and electronics companies. There is only tiny difference in the slop (.223 versus .218) between two models with/without the dummy, supporting the assumption that all firms have the same impact of net income on investing for research and development. The regression equations of the two groups are,
http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 3

Equipment and Software : Research = 2140.205 + .218*income 1 Telecom. and Electronics : Research = 1133.579 + .218*income Table 3. Regression with a Dummy Variable (Model 2)
Source | SS df MS -------------+-----------------------------Model | 24987948.9 2 12493974.4 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.06 0.0054 0.2520 0.2104 1435.4

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 1006.626 479.3717 2.10 0.043 34.41498 1978.837 _cons | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------

3.2 Comparison between Model 1 and Model 2 Let us draw a plot to highlight the difference between the Model 1 and 2 more clearly. Look at the middle red line first. It is the regression line of the Model 1 without the dummy variable. The top green is regression line for equipment and software companies, while the bottom yellow line is one for telecommunication and electronics firms in Model 2. Of course, green and yellow lines are parallel with a difference of 1,006.626, the coefficient of the dummy variable in Table 3. Figure 1. Comparison between Model 1 and Model 2 (Fixed Group Effect)

The intercept of equipment and software firm is computed as 2140.205 = 1006.626 +1133.579.

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 4

This plot shows that the Model 1 is canceling out the group difference, and thus report misleading intercept. The difference between two groups of firms looks substantial. The t-test for the dummy parameter reject the null hypothesis of no difference in intercepts at the .05 level (p<.043). Consequently, we conclude that the Model 2 considering fixed group effects is better than the simple Model 1. You may compare goodness of fit statistics (e.g., F, t, R-squared, and SSE) of the two models. 2 3.3 Common Misunderstandings Some people, especially those who do not know exactly how dummies work, may ask, What if we code the dummy variable reversely? The simplest answer is It gives equivalent results. Let us give 1 to d0 if d is 0 (telecommunications and electronics firm) and zero if d is 1 (equipment and software). And then replace d with d0 in Model 2. The model becomes, Model 2-1: researchi = 0' + 1'incomei + 'd 0 i + i Model 2-1 is equivalent to Model 2 in that both produce the identical regression equations. ANOVA table of two models are identical. The slope of the regressor remains unchanged: 1' = 1 ; The sign of dummy parameter was switched: ' = ; the intercept of Model 2-1 is the actual intercept of equipment and software companies whose dummy variable is excluded in Model 2-1: 0' = 0 + . That is, one implies the other. It is because two models use different baseline categories, reference points. Model 2 uses telecommunications and electronics firms as a baseline, while Model 2-1 switches to equipment and software companies. They see the same thing from different views. Table 4. Regression with a Reversely Coded Dummy (Model 2-1)
Source | SS df MS -------------+-----------------------------Model | 24987948.9 2 12493974.4 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.06 0.0054 0.2520 0.2104 1435.4

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d0 | -1006.626 479.3717 -2.10 0.043 -1978.837 -34.41498 _cons | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 ------------------------------------------------------------------------------

Some may also ask, Then, why dont we run regression on a group by group basis? Yes, we may get similar regression equations by running regression only on equipment and software firms and another regression on telecommunications and electronics companies.

If the coefficient of the dummy variable d turns out statistically insignificant, we can conclude that there is no group effect, or that all firms have the same intercept, in favor of Model 1. http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 5

Model 1-1: researchi = 0 + 1incomei + i for equipment and software firms Model 1-2: research j = 0 + 1income j + j for telecom and electronics firms What is the difference between this group by group regression, Model 1-1 and 1-2, and the Model 2 with a dummy? The former assumes that two groups are different species like monkey versus lemon. The parameters and are not comparable in a strict statistical sense. Thus, we may not be able to examine the group differences by comparing (eyeballing) goodness-of- fits of two separate regressions (Model 1-1 and 1-2). Another difference lines in the efficiency of the slope, which is improved by pooling data; thus, Model 2 produces more efficient estimates than Model 1-1 and 1-2. What if you present Model 1 (pooled regression), Model 1-1, and Model 1-2 at the same time? What if you report Model 1 as well as Model 2. These attempts will end up with logical fallacy because these models have contradictory assumptions. If Model 2 is true, for example, Model 1 must be false. Model 1-1 is not comparable to Model 1 and 1-2.

4. Meanings of Dummy Variable Coefficients


In order to directly get regression equations fo r Model 2, you may run the regression with two dummy variables: one for equipment and software firms and another d0 for telecommunication and electronics (see Table 4). Let us call this model as Model 2 since it is equivalent to Model 2. Note that the intercept 0 was suppressed to avoid perfect multicollinearity. Model 2-2: researchi = 1incomei + 1di + 0d 0i + i Table 5. Regression with Two Dummies without the Intercept (Model 2-2)
Source | SS df MS -------------+-----------------------------Model | 184685604 3 61561868.1 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 258861361 39 6637470.79 Number of obs F( 3, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 29.88 0.0000 0.7135 0.6896 1435.4

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 d0 | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------

You may observe several differences in statistics between Table 3 (Model 2) and Table 5 (Model 2-2). In particular, coefficients and t statistics of dummy variables are different, although two models are equivalent. 3 How do we explain these differences?

The R2 and adjusted R2 are not well defined (incorrect) in the Model 2-2 that suppresses the intercept.

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 6

The coefficients of dummy variables in Model 2 and 2-2 have different meanings. In Model 2-2, the coefficient estimates of dummies, 0 and 1 , are actual intercepts of two groups (2,040.205 and 1,133.579). Accordingly, the null hypothesis of t-test is that parameters 0 and 1 are zero. By contrast, the coefficient of d in Model 2 estimates the difference of 0 from 1 , where 0 is the intercept of the baseline category, telecom and electronics firms. Accordingly, the null hypothesis is that the difference, not the actual intercepts, is zero: = 1 0 = 0 . Consider the following two plots of regression lines. The left plot depicts a situation where both 0 and 1 are close to zero in Model 2-2 and their difference = 1 0 is not substantial in Model 2. T-tests in both models may not be rejected; No group effect. Thus, Model 1, a pooled model, may be better than Model 2. In the right plot, 1 may turn out statistically different (far away) from the zero (t-test may be rejected), while 0 is close to zero (not rejected). Accordingly, the difference = 1 0 is also substantial in Model 2 (rejected). It indicates that there is some fixed effect between two groups; so, the Model 2 is superior to Model 1. Figure 2. Meanings of Dummy Variable Coefficients

Let us run the three regression models mentioned so far using SAS and STATA. In SAS, use the REG procedure as follows. Note that the /**/ is used for comments.
PROC REG; MODEL rd2002 = net2000; /* Model 1*/ MODEL rd2002 = net2000 d; /* Model 2 */ MODEL rd2002 = net2000 d d0 /NOINT; /* Model 2-2 */ RUN;

In STATA, run the .regress command as follows. Note that the // is used for comments.
. regress rd2002 net2000 // Model 1

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 7

. regress rd2002 net2000 d // Model 2 . regress rd2002 net2000 d d0, noconstant // Model 2-2

5. Regression with Dummies: Multiple Categories


Now, imagine a situation where more than three groups need to be considered in a model. We may classify the firms into three types: telecommunication, electronics, and equipment/software, assuming they have different intercepts. Researchers may examine seasonal impacts (spring, summer, fall, and winter; 1st through 4th quarter) by deseasonalizing data (Greene 2000: 319). 5.1 Model and Data Structure Here is a regression model with multiple dummy variables. researchi = 0 + 1incomei + 1d1i + 2d 2i + 3d 3i + i How do we make three dummy variables for the three firm types? The d1 is set 1 for telecommunications firms and 0 for others; d2 is set 1 for electronics firms and 0 for others. Similarly, d3 is set 1 for equipment and software and 0 otherwise; so d3 is identical to d of the Model 2 in section 3. Look at the data structure of multiple dummies. Table 6. Data Structure for Multiple Dummy Model
+----------------------------------------------------------------------+ | firm rd2002 net2000 type d1 d2 d3 | |----------------------------------------------------------------------| | Samsung 2,500 4,768 Electronics 0 1 0 | | AT&T 254 4,669 Telecom 1 0 0 | | IBM 4,750 8,093 IT Equipment 0 0 1 | | Siemens 5,490 6,528 Electronics 0 1 0 | | Verizon . 11,797 Telecom 1 0 0 | | Microsoft 3,772 9,421 Service & S/W 0 0 1 | | EDS 24 1,143 Service & S/W 0 0 1 |

5.2 Three Approaches to Running LSDV Regression Now, we are ready for regression analysis, called the least squares dummy variable (LSDV) regression. However, here is the problem. When including all the three dummy variables and an intercept, we will be caught in a so called dummy variable trap. This problem is a perfect multicollinearity; the regression equation is not solvable since X matrix is not fully ranked. There are three approaches to running regression analyses with multiple dummy variables. First look at the functional forms below. The first approach--let us call it LSDV1--run OLS with all dummy variables, ignoring intercept. The second LSDV2 omits one of dummy variables and includes the intercept. The final approach LSDV3 includes all

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 8

dummy variables and the intercept, but it imposes a restriction that the sum of parameters of all dummies is zero. Table 6 summarizes the features of the three LSDVs. researchi = 1incomei + 1d1i + 2d 2 i + 3d 3i + i (without intercept) researchi = 0 + 1incomei + 1d 1i + 2d 2i + i (without one of three dummy variables) researchi = 0 + 1incomei + 1d1i + 2d 2i + 3d 3i + i with restriction of 1 + 2 + 3 = 0 The biggest difference is the meanings of dummy variable parameters and their hypothesis tests. The first approach reports the coefficients that are easy to interpret substantively. They are actual intercepts of three groups as in the following regression equations (see Table 7). Telecom firm : Research = 153.624 + .215*income Electronics : Research = 1695.486 + .215*income Equipment & S/W : Research = 2147.559 + .215*income In the second approach, LSDV2, the intercept is the coefficient of the dropped dummy, playing a role of baseline or reference point. Other coefficients are differences of the baseline from corresponding actual coefficients (see Table 8). For example, the intercept 2,147.559 in LSDV2 is the actual coefficient of d3 that is dropped. The coefficient 452.073 of d2 is computed as 1695.486- 2147.559. Likewise, 153.624 in LSDV1 is computed as -1993.935 + 2147.559. What if we omit d2 instead of d3 ? We may have different parameter estimates and standard errors of the dummy variables. Note that the coefficient of net income is quite similar to those of Model 1 and Model 2 (LSDV2) in section 2 and 3 (.223 versus .218 versus .215). Table 7. LSDV1 without the Intercept
Source | SS df MS -------------+-----------------------------Model | 198376404 4 49594101.1 Residual | 60484956.6 35 1728141.62 -------------+-----------------------------Total | 258861361 39 6637470.79 Number of obs F( 4, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 28.70 0.0000 0.7663 0.7396 1314.6

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | 153.6238 469.5762 0.33 0.745 -799.6665 1106.914 d2 | 1695.486 373.0145 4.55 0.000 938.2267 2452.746 d3 | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------

The third approach, LSDV3, produces coefficients that indicate how far the averaged group effect, the intercept of LSDV3, is away from the actual parameters (see Table 9). For example, the intercept 1,332.223 is computed as (153.624+1695.486+2147.559)/3. The coefficient of d3 815.33581 is 2,147.559 1,332.223. Note that the 6.14175E-13 in the last part of SAS output is virtually zero; this is the restriction.

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 9

Table 8. LSDV2 without One Dummy Variable


Source | SS df MS -------------+-----------------------------Model | 38678749 3 12892916.3 Residual | 60484956.6 35 1728141.62 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 3, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 7.46 0.0005 0.3900 0.3378 1314.6

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | -1993.935 561.9429 -3.55 0.001 -3134.74 -853.1303 d2 | -452.0725 481.2018 -0.94 0.354 -1428.964 524.8192 _cons | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------

Table 9. LSDV3 with Restriction Imposed (SAS output)


The REG Procedure Model: MODEL1 Dependent Variable: rd2002 NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read Number of Observations Used Number of Observations with Missing Values Analysis of Variance Sum of Squares 38678749 60484957 99163706 1314.58800 2023.56410 64.96399 Mean Square 12892916 1728142 50 39 11

Source Model Error Corrected Total

DF 3 35 38

F Value 7.46

Pr > F 0.0005

Root MSE Dependent Mean Coeff Var

R-Square Adj R-Sq

0.3900 0.3378

Parameter Estimates Parameter Estimate 1332.22301 0.21511 -1178.59917 363.26336 815.33581 6.14175E-13 Standard Error 280.18308 0.07357 333.36182 288.19307 297.13197 .

Variable Intercept net2000 d1 d2 d3 RESTRICT

DF 1 1 1 1 1 -1

t Value 4.75 2.92 -3.54 1.26 2.74 .

Pr > |t| <.0001 0.0060 0.0012 0.2158 0.0095 .

* Probability computed using beta distribution.

Table 10 summarizes the differences of three approaches discussed so far.


http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 10

Table 10. Three Approaches to Running Dummy Variable Models (LSDVs) LSDV No intercept Dropping one dummy Imposing restriction a a b b Dummy included d1 d d b ,d2 dd c , d1c ddc
Intercept All dummy? Restriction? Meaning of coefficient Coefficients No Yes (d) No Fixed group effect
a d1a , d 2 , d da

Yes No (d-1) No How far away from the reference point (dropped)?

Yes Yes (d)

c i

= 0*

How far away from the average group effect?

d ia = b + d ib ,
a d dropped =b

H0 of T-test

d ia = 0

a d ia d dropped =0

d ia = c + d ic , where 1 c = dia d 1 d ia d ia = 0 ** d

Source: http://mypage.iu.edu/~kucc625/documents/Panel_Data_Models.pdf

5.3 Comparing Statistics of the Three LSDVs The t-test for dummy variable parameters should be interpreted with cautions since three approaches have different meanings of the dummy coefficients (see Table 10). LSDV1 is easy to interpret these coefficients because they are actual intercepts. Keep in mind that LSDV2 examines the difference of the baseline intercept from an actual intercept, while LSDV3 checks how far the averaged intercept is away from an actual intercept. The null 1 a hypotheses of LSDV1 through LSDV3 are d ia = 0 , d ia d dropped = 0 , d ia d ia = 0 , d respectively. Therefore, you may not conclude, for example, that intercept of the first group (telecommunications) is statistically significant, or the parameter of d1 is not zero, by referring the t-test of LSDV2 (t=-3.55 and p<.001). The t-test just tells that the intercept of telecommunication firms is substantially different from that of equipment and software companies; it does not tell if the intercept is close to zero or not because the reference point is not zero. Instead, you need to look at the t-test in LSDV1. The small t statistics .33 and large p- value .745 in Table 7 allows us not to reject the null hypothesis that the actual intercept of the telecommunication firm is zero: d1a = 0 . Although the LSDV1 without intercept is easy to interpret, it has serious problems in reporting goodness of fit measures (see Table 11). This approach reports wrong SSM and MSM, thus R2 and F test for d1 = ...d n1 = 0 . However, LSDV1 reports correct SSE, MSE, DFerror , and standard error of parameter estimates. By contrast, LSDV2 and LSDV3 report correct information at the cost of interpreting dummy coefficients in a complicated manner.
*

This restriction reduces the number of parameters to be estimated, making model identified. In SAS, the H0 needs to be rearranged as

**

( d 1)d ia d a j = 0 , where i j

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 11

Table 11. Comparing Statistics of Three LSDVs LSDV1 LSDV2


R and adjusted R F test Standard error of b SSM/MSM SSE/MSE DFerror4
2 2

LSDV3
Correct Correct Correct Correct Correct N-K

Incorrect Incorrect Correct Incorrect Correct N-K

Correct Correct Correct Correct Correct N-K

5.4 Software Issues All data analysis software supports the LSDV1 and LSDV2. Only SAS and LIMDEP support linear regression with restriction. However, LIMDEP reports a little bit different parameter estimates across approaches. Although providing various econometric models, LIMDEP is not good for working with data sets. SAS and STATA respectively have TSCSREG procedure and .xtreg command to run fixed/random effect models without dummies. The TSCSREG procedure works only on panel data. Table 12. Comparing Estimation of Three LSDVs LSDV1 LSDV2
SAS 9.1 STATA 8.2 LIMDEP 8.0 R 2.xx SPSS 12.0
REG w/ NOINT .regress w/ nocon Regress w/o ONE > lm() w/ -1 Regression w/ Origin REG .regress Regress w/ ONE > lm() Regression

LSDV3
REG w/ RESTRICT

N/A
Regress w/ CLS

N/A N/A

The following script runs LSDV3 using the RESTRICT statement of the REG procedures.
PROC REG; MODEL rd2002 = net2000 d1-d3; RESTRICT d1 + d2 + d3 = 0; RUN;

The following STATA .xtreg command runs fixed within effect panel data model. 5 Note that the i(type2) option specifies the independent unit, and that the type2 is recoded from the type so that it has 1, 2, 3 for three firm types (d1 through d3 ).
.xtreg rd2002 net2000, fe i(type2)

6. Regression with Dummies: Two-Way LSDVs


The previous section addresses the one-way LSDV, in which only one group variable is considered. Now, let us move on to the two-way LSDV.
4

The K denotes the sum of the number of dummy variables, regressors, and the intercept included in the model. The N is the total number of observations used in the regression model. 5 Individual dummy coefficients need to be computed and their standard errors should be corrected (adjusted). http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 12

6.1 Data Structure and Estimation A new group variable is the area of firms ownership. Here is another set of three dummy variables g1 , g2 , g3 . The g1 is set 1 if firms are owned by Asian countries and 0 otherwise. Similarly, the g2 and g3 are coded for European and American companies, respectively. Look at the data structure. Table 13. Data Structure of the Two-Way LSDV
+----------------------------------------------------------------------------+ | firm type d1 d2 d3 area g1 g2 g3 | |----------------------------------------------------------------------------| | Samsung Electronics 0 1 0 Asia 1 0 0 | | AT&T Telecom 1 0 0 America 0 0 1 | | IBM IT Equipment 0 0 1 America 0 0 1 | | Siemens Electronics 0 1 0 Europe 0 1 0 | | Verizon Telecom 1 0 0 America 0 0 1 | | Microsoft Service & S/W 0 0 1 America 0 0 1 | | EDS Service & S/W 0 0 1 America 0 0 1 |

Now, our model becomes a little bit messy since it has six dummy variables. In order to avoid the perfect multicollinearity, we have to 1) omit two dummy variables, one from each set of dummy variables, 2) omit one dummy variable for ownership areas and impose restriction for firm type, 3) omit one dummy variable for firm type and impose restriction for ownership areas, or 4) impose two restriction: one for firm type and the other for ownership areas. Note that you must not omit intercept in the two-way fixed effect model. The following is the simplest approach that omits two dummy variables. researchi = 0 + 1incomei + 1d1i + 2d 2i + 1g1i + 2 g 2i + i Table 14. Two-Way Fixed Effect Model (LSDV2)
Source | SS df MS -------------+-----------------------------Model | 47996204.2 5 9599240.84 Residual | 51167501.4 33 1550530.35 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 5, 33) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.19 0.0004 0.4840 0.4058 1245.2

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .3008584 .0830277 3.62 0.001 .1319374 .4697795 d1 | -2446.278 579.9832 -4.22 0.000 -3626.262 -1266.293 d2 | -923.931 503.678 -1.83 0.076 -1948.672 100.8097 g1 | 1375.542 579.5446 2.37 0.024 196.4499 2554.635 g2 | 907.2314 570.3879 1.59 0.121 -253.2315 2067.694 _cons | 1440.654 474.693 3.03 0.005 474.8843 2406.424 ------------------------------------------------------------------------------

Note that this model has many parameters to be estimated, compared to the number of observations available. We can draw nine regression equations depending on combinations of three firm types and three areas of ownership: 9 = 3 X 3.

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 13

(1) researchi (2) researchi (8) researchi (9) researchi

= 0 + 1incomei + 0 + 0 + 0 + 0 + i (American equipment & S/W firms) = 0 + 1incomei + 1 + 0 + 0 + 0 + i (American telecom. firms) = 0 + 1incomei + 0 + 2 + 1 + 0 + i (Asian electronics firms) = 0 + 1incomei + 0 + 2 + 0 + 2 + i (European electronics firms)

For example, the regression equation for Asian telecommunication companies is, Research = 369.918 + .301* Income = (1,440.654-2,446.278+1,375.542) + .301* Income 6.2 Full-Model versus Restricted Model Let us call this two-way fixed effect model as a full- model or unrestricted model. We have four restricted or nested models that have different subsets of independent variables. Note that the second and third models should be estimated by one of LSDV approaches. (1) no fixed effect at all: (2) type effect only : (3) type effect only: (4) area effect only: researchi = 0 + 1incomei + i (Model 1) researchi = 0 + 1incomei + d i + i (Model 2) researchi = 0 + 1incomei + 1d1i + 2d 2i + 3d 3i + i researchi = 0 + 1incomei + 1 g1i + 2 g 2 i + 3 g 3i + i

Table 15. Fixed Area Effect Model


Source | SS df MS -------------+-----------------------------Model | 20395250.9 3 6798416.97 Residual | 78768454.7 35 2250527.28 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 3, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 3.02 0.0426 0.2057 0.1376 1500.2

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2930783 .097524 3.01 0.005 .0950941 .4910626 g1 | 788.3469 633.0243 1.25 0.221 -496.7608 2073.455 g2 | -29.29548 631.1481 -0.05 0.963 -1310.594 1252.003 _cons | 996.9815 550.7665 1.81 0.079 -121.134 2115.097 ------------------------------------------------------------------------------

Which one is the best model? It is not good idea to compare F statistics and t-tests for individual parameter estimates. We may use so called incremental F-test to examine changes in goodness-of- fits of the full- model (or unrestricted model) and restricted models (Greene 2000; Fox 1997). This F-test requires the sum of squared of error (SSE), ee, of the unrestricted and restricted models. The null hypothesis is that the parameters of added regressors (dummies here) are all zero (e.g., H 0 : 1 = 2 = 3 = 0 ). The formula of the F-test is F ( J , N K ) =

(e*' e* e' e) / J ( R 2 R*2 ) / J = , e' e /( N K ) (1 R 2 ) /( N K )

http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 14

' where e* e* and R*2 are respectively SSE and R2 of the restricted model. J is the number of dummy variables that were actually taken out of the full model (e.g., 2 for the second and third restricted model). Keep in mind that R2 in LSDV1 without the intercept is not well defined; so DO NOT plug R2 of LSDV1 in the second formula!

Let us compare the full- model (Table 14) and fixed area effect model (Table 15). The F statistic of 8.9005 is large enough to reject the null hypothesis (p<.0008), signaling superiority of the full- model. Adding two dummy variables may reduce SSE (ee) substantially.

F( J , N K ) =

(e*' e* e' e) / J ( 78,768,454.7 51,167,501.4 ) / 2 = = 8.9005 (2,33) e' e /( N K ) ( 51,167,501.4) /( 39 6)

Now consider the full- model versus fixed type effect model (Table 8). The small F statistic indicates that the full- model does not improve goodness of fit significantly by including two more variables (p<.0633). Thus, we do not reject the null hypothesis in favor of the restricted model.
' (e* e* e' e) / J (60,484,956.6 51,167,501.4 ) / 2 F( J , N K ) = = = 3.0046 (2,33) e' e /( N K ) (51,167,501.4) /( 39 6)

How do we compare the fixed type effect models in Table 3 with one dummy (Model 2) and Table 8 with two dummy variables? In this case, the model with two dummies becomes the full- model. A large F statistic allows us reject the null hypothesis in favor of the full- model with two dummies (p<.0080).

F( J , N K ) =

(e*' e* e' e) / J ( 74,175,756.7 60,484,956.6 ) / 1 = = 7.9223 (1,35) e' e /( N K ) ( 60, 484,956.6) /( 39 4)

7. Regression with Threshold Effect


Let us consider the fixed effect of academic degree that are grouped into Ph.D. degree, Masters degree, B.A., and diploma. In general, B.A. degree holders have diploma as well; masters degree holders have B.A. degree as well as diploma; and so forth. Degree effect is cumulative. This effect is called as threshold effect. Suppose we want to know the threshold effects of academic degree on the annual income. Note that one of dummy variable, say t 1 , needs to be dropped in order to avoid perfect multicollinearity. incomei = 0 + 1efforti + 2t 2i + 3t 3i + 4t 4i + i The data structure of threshold effect is different from that of ordinary LSDVs. In table 15, compare the d1 through d4 and t 1 through t 4 to check how differently they recode
http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 15

academic degree. For masters degree, for instance, only d3 is set to 1, while t 1 through t 3 are all coded as 1. Table 16. Data Structure for Threshold Effect Model
+---------------------------------------------------------------------------+ | income effort degree d1 d2 d3 d4 t1 t2 t3 t4 | |---------------------------------------------------------------------------| | 13.242 1.44977 Diploma 1 0 0 0 1 0 0 0 | | 32.983 1.01713 B.A 0 1 0 0 1 1 0 0 | | 47.962 .67178 Masters 0 0 1 0 1 1 1 0 | | 52.048 2.11554 Ph.D. 0 0 0 1 1 1 1 1 | | 50.528 2.55896 B.A. 0 1 0 0 1 1 0 0 | | 17.179 .68774 Ph.D. 0 0 0 1 1 1 1 1 |

There are four regression equations depending on degrees. They share the same slope, of course. Note that the intercepts are cumulative in a sense that they are 0 actually 1 ; 1 + 2 ; 1 + 2 + 3 ; and 1 + 2 + 3 + 4 , respectively. (1) (2) (3) (4) incomei incomei incomei incomei = 0 + 1efforti + 2 + 3 + 4 for the Ph.D. degree holders = 0 + 1efforti + 2 + 3 + 0 for the Masters degree holders = 0 + 1efforti + 2 + 0 + 0 for the B.A. degree holders = 0 + 1efforti + 0 + 0 + 0 for the diploma holders

It is notable that i is used to capture the marginal value of the academic degree. For example, 3 is the marginal value of the B.A. degree. We may say that Masters degree holders on average earn 3 more income than B.A. degree holders, holding all others constant.

8. Regression with Interaction Effect


We have discussed so far regression models with dummy variables that share the same slope. Only differences across groups lie in intercepts. Now, move on to regression models with dummies that have different slopes and/or intercepts. 8.1 Regression with Different Slope and Intercept Let us reconsider Model 2 discussed in section 2. This time we add one regressor, an interaction term between net income in 2000 and the dummy variable. Now, we have a revised regression model. researchi = 0 + 1incomei + 2inc _ d i + di + i The interaction term is a product of a regressor net2000 and the dummy variable d : inc_d=net2000 * d. Note that the interaction term is identical to the net2000 if the dummy variable is 1 and it otherwise has zero.
http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 16

This model has two regression equations with different slopes and intercepts. You may compare them with those in section 3. Equipment and Software : Research = 2047.062 + .255*income Telecom. and Electronics : Research = 1181.956 + .198*income Table 17. Data Structure for Interaction Effect Model
+-------------------------------------------------------------------+ | firm type rd2002 net2000 inc_d d | |-------------------------------------------------------------------| | Samsung Electronics 2,500 4,768 0 0 | | AT&T Telecom 254 4,669 0 0 | | IBM IT Equipment 4,750 8,093 8093 1 | | Siemens Electronics 5,490 6,528 0 0 | | Verizon Telecom . 11,797 0 0 | | Microsoft Service & S/W 4,307 9,421 9421 1 | | EDS Service & S/W 0 1,143 1143 1 |
6

The interaction effect turns out statistically insignificant at the .5 level (p<.738). Thus, we conclude that the slope of equipment and software companies is not substantially different from that of telecommunications and electronics firms. However, you may not conclude that the intercept of equipment and software firms is not statistically significant (close to zero) because of small t statistics (p<.186). Remember that the parameter indicates the difference of actual intercepts of the two types of firms. Table 18. Regression Model with Interaction Effect 1
Source | SS df MS -------------+-----------------------------Model | 25227993.1 3 8409331.02 Residual | 73935712.5 35 2112448.93 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 3, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 3.98 0.0153 0.2544 0.1905 1453.4

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .1975142 .1015407 1.95 0.060 -.0086244 .4036527 net_d | .0571731 .1696052 0.34 0.738 -.2871437 .4014899 d | 865.1056 641.755 1.35 0.186 -437.7264 2167.938 _cons | 1181.956 376.7763 3.14 0.003 417.0599 1946.853 ------------------------------------------------------------------------------

8.2 Regression with Different Slope and the Same Intercept Now, exclude the dummy variable so that only the regressor and the interaction term remain in the model. This model produces two regression equations with different slopes and the same intercept, which are less likely in the real world.

The equation of equipment & software firms is

researchi = ( 0 + ) + ( 1 + 2 )incomei + i .

Thus, intercept is 2,047.062=1,181.956+865.1056. The slope is .255=.1975142+.0571731. http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 17

researchi = 0 + 1incomei + 2inc _ d i + i Equipment and Software : Research = 1480.15 + .353*income Telecom. and Electronics : Research = 1480.15 + .146*income
7

The t statistic of 1.59 for interaction term indicates that there is no statistically significant interaction effect (p<.120). Note that the SEE, square root of MSE, becomes larger than that of any other models discussed so far. Table 19. Regression Model with Interaction Effect 2
Source | SS df MS -------------+-----------------------------Model | 21389278.1 2 10694639.1 Residual | 77774427.5 36 2160400.76 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 4.95 0.0126 0.2157 0.1721 1469.8

-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .1463857 .0952541 1.54 0.133 -.0467987 .33957 net_d | .2067402 .1297268 1.59 0.120 -.0563579 .4698383 _cons | 1480.15 308.4474 4.80 0.000 854.5894 2105.71 ------------------------------------------------------------------------------

Figure 3 compares the two regression models with interaction effects. The left plot depicts regression equations with different slopes and intercepts. The regression equations on the right have different slopes, but have the same intercept. Figure 3. Regression Model with Interaction Effect

8.3 Limitation and Further Direction The regression model with interaction effect is likely to sufferer from multicollinearity. It is because the interaction term tends to be highly correlated to the dummy variable. 8 As
7

The equation of equipment & software firms is

researchi = 0 + ( 1 + 2 )incomei + i . Thus, the

slope is .353=.1463857+.2067402. http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 18

many interaction terms are included in the model, accordingly, it is more likely that multicollinearity problem becomes severe. If two groups show different disturbance variances, the pooled regression may result in one biased estimate of disturbance variances and the incorrect estimate of the covariance matrix (Greene 2000: 323). It is case for the model of groupwise heteroscadasticity, an example of random group effect model for panel data.

9. Spline Regression (Greene 2000: 324)


What is the Spline? Smith (1979) put it as following, Splines are generally defined to be piecewise polynomials of degree n whose function values and first n-1 derivatives agree at points where they join. The abscissas of these joint points are called knots. Polynomial may be considered a special case of splines with no knots, and piecewise (sometimes also called grafted or segmented polynomials with fewer than the maximum number of continuity restrictions may also be considered splines. The number and degrees of polynomial pieces and the number and position of knots may vary in different situation (Smith 1979: 57). Suppose we know some threshold va lues, say age 19 and 27, which significantly change the intercepts and slopes in corresponding intervals of an independent variable. We need two dummy variables of d1 and d2 . Individuals younger than 19, for example, are coded 0 in both d1 and d2 ; for those between 19 and 27, only d1 is set to 1; and those older than 27 have 1 in both d1 and d2 . The regression equation will be, incomei = 0 + 1agei + 1d1i + 1d1i agei + 2 d 2i + 2d 2i agei + i . That is, incomei = 0 + 1 agei + i for those younger than 19 incomei = 0 + ( 1 + 1 ) agei + 1 + i for those between 19 and 27 incomei = 0 + ( 1 + 1 + 2 ) agei + 1 + 2 + i for those older than 27 We need two conditions to make the regression function continuous.
* (1) 0 + 1t 1 = ( 1 + 1 ) + ( 1 + 1 ) t1* at the age of 19

* (2) ( 1 + 1 ) + ( 1 + 1 ) t 2 = ( 1 + 1 + 2 ) + ( 1 + 1 + 2 )t * 2 at the age of 27 * * Note that t 1 and t 2 respectively represent the threshold values, often called knots (19 and 27 in this case).

Interaction is different from correlation in a sense that regressors may jointly affect dependent variable no matter whether they are correlated or not (Fox 1997). http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)


* From (1) 1 = 1t1* = 0 or 1 = 1t 1 * From (2) 2 = 2t * 2 = 0 or 2 = 2 t 2

Dummy Variables in Regression: 19

Then, plug in the two results in the original regression model.


incomei = 0 + 1agei 1t1*d 1i + 1d1i agei 2t * 2 d 2i + 2 d 2 i agei + i
* * incomei = 0 + 1agei + 1d1i (agei t1 ) + 2d 2 i ( agei t 2 ) + i

As shown in the last equation, we have to create two new variables: one for * d 1 ( agei t1* ) and the other for d 2 ( agei t 2 ) . Finally, run the OLS to estimate the spline regression model. We may test the hypotheses on the knots; 1 = 0 , 2 = 0 , or 1 = 2 = 0 . The SAS script for this spline regression will be:
PROC REG; MODEL income = age age19 age27; TEST age19=1, age27=0; RUN;

10. Conclusion
Using dummy variables in regression analysis is useful to capture fixed/random effects. This technique is able to explain how group/time differences affect models. However, it must be used with cautions. First, keep in mind that each LSDV has different interpretations of dummy parameters, and that the t-tests have different null hypotheses. Otherwise, you may be totally misleading, ending up with wrong conclusion. Second, be parsimonious by minimizing the number of dummies especially when you do not have many observations. Avoid the problem of many parameters, small sample size. Try to hit the highlights, focusing on your main arguments. Third is related to the second. Be careful not to be caught in the dummy variable trap, perfect multicollinearity. As you include many dummies, the likelihood of being in trouble will increase sharply. Finally, do not try to compare monkey and lemon. Categories should have something in common with each others so that comparison is meaningful from analytic and theoretic perspective. Comparing apple and pear is better than contrasting apple and onion. By the same token, telecommunications versus electronics firms makes much more sense than telecommunications firms versus universities.
http://mypage.iu.edu/~kucc625

2002-Present. Jeeshim and KUCC625 (2005-03-26)

Dummy Variables in Regression: 20

References
Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Wiley, John & Sons. Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related Methods. Newbury Park, CA: Sage. Freund, Rudolf J., and Ramon C. Littell. 2000. SAS System for Regression, 3rd ed. Cary, NC: SAS Institute. Greene, William H. 2000. Econometric Analysis, 4th ed. Prentice Hall. SAS Institute. 2004. SAS 9.1 Users Guide. Cary, NC: SAS Institute. http://www.sas.com/ Smith, Petricia L. 1979. "Splines as a Useful and Convenient Statistical Tool." American Statistician 33(2) (May): 57-62. STATA Press. 2003. STATA Base Reference Manual, Release 8. College Station, TX: STATA Press. http://www.stata.com/ STATA Press. 2003. STATA Cross-Sectional Time-Series Reference Manual, Release 8. College Station, TX: STATA Press. http://www.stata.com/ http://mypage.iu.edu/~kucc625/documents/Panel_Data_Models.pdf http://socserv.socsci.mcmaster.ca/jfox/Courses/soc740/lecture-5.pdf

http://mypage.iu.edu/~kucc625

You might also like