You are on page 1of 10

Regression on Dummy Variables

1. Nature of Dummy Variables


Types of Variables
_
Quantitative Variable : income, output, prices, costs, heights, etc.
Qualitative, Categorical, Indicator Variables : sex, race, color, etc.
How to quantify qualitative attributes? Introduce dummy variables.
ex) Two categories

i
= ,
1
+ ,
2
1
i
+
i
where
i
: annual starting salary, 1
i
: dummy variable for race.
i.e. ,
1
i
=
_
1 if white
0 if black
estimation : and A can be rewritten as :
=
_
_
_

1
.
.
.

n
_
_
_
. A =
_
_
_
_
_
1 1
1 0
.
.
.
.
.
.
1 1
_
_
_
_
_
where in the second column of A, 1 is for whites and 0 for blacks.
And we get :
_
/
1
/
2
_
= (A
0
A)
1
A
0

Test : H
0
: ,
2
= 0 to check dierence in salary according to race.
ex) Three categories

i
= ,
1
+ ,
2
1
Wi
+ ,
3
1
Bi
+
i
where
i
: annual starting salary, 1
i
: dummy variable for race.
i.e. ,
1
Wi
=
_
1 if white
0 if other
1
Bi
=
_
1 if black
0 if other
interpretation
_
_
_
n/itc :
i
= ,
1
+ ,
2
+
i
/|cc/ :
i
= ,
1
+ ,
3
+
i
ot/c: :
i
= ,
1
+
i
A can be rewritten as,
A =
_
_
_
_
_
1 1 0
1 0 1
.
.
.
.
.
.
.
.
.
1 1 0
_
_
_
_
_
where 1 is for whites in the second column and 1 is for blacks in the
third column.
Number of dummy variables n
: categories ! introduce only (: 1) dummies, otherwise problem of mul-
ticollinearity will occur.
2. Regression with Quantitative and Qualitative Variables
Mixture of quantitative and qualitative variables
ex) When both race and years of work aect income,
The model is :

i
= ,
1
+ ,
2
1
2i
+ ,
3
A
i
+
i
where
i
: annual income, 1
i
: dummy for race, A
i
: years of work.
A can be rewritten as :
A =
_
_
_
_
_
1 1 A
1
1 0 A
2
.
.
.
.
.
.
.
.
.
1 1 A
n
_
_
_
_
_
Test H
0
: ,
2
= 0 to check dierence in income according to race.
Interacation between quantitative and qualitative variables
ex) When income change due to years of work is dierent by race,
The model is :

i
= ,
1
+ ,
2
1
2i
+ ,
3
A
i
+ ,
4
1
2i
A
i
+
i
A can be rewritten as :
A =
_
_
_
_
_
_
_
_
1 1 A
1
1 A
1
.
.
. 0 A
2
0 A
2
.
.
. 0 A
3
0 A
3
.
.
.
.
.
.
.
.
.
.
.
.
1 1 A
n
1 A
n
_
_
_
_
_
_
_
_
3. Dummy Variables in Matrix
When all variables are aected by dummies, we have the augmented model
by dummy:

n1
= A, + 1
nn
A
nk

k1
+
i.e.,
=
_
A
nk
1A
nk
_
_
,

_
2k1
+
Let

A =
_
A
nk
1A
nk
_
.

, =
_
,

_
estimation

/ = (

A
0

A)
1

A
0

When only some variables are aected by dummies, we have :


= A, + 1A
1

1
+
where A
1
: matrix of variables aected by dummies,
1
: matrix of ,
0
: that
are considered to change according to categories.
Let
:: # of dierent kinds of categories (# of dummies)
1
i
: matrix of the i
th
dummy
A
i
: matrix of corresponding variables

i
: vector of corresponding parameters
Then,
= A, +
m

i=1
1
i
A
i

i
+
If we write

A =
_
A 1
1
A
1
1
2
A
2
1
m
A
m
_
.

, =
_
_
_
_
_
,

1
.
.
.

m
_
_
_
_
_
We get

/ = (

A
0

A)
1

A
0

4.Test for Structural Stability with Dummies


Let
H
0
: = A, + 2o +
H
1
:
1
= A
1
,
1
+ 2o +
1
(1
st
q:onj)
:
2
= A
2
,
2
+ 2o +
2
(2
nd
q:onj)
The model is :
= A, + 1A + 2o +
where 1 = 1icq
_
1
1
1
2
1
n
_
. 1
i
=
_
0 (i = 0. 1. . :
1
)
1 (i = :
1
+ 1. . :)
i.e.
1 =
_
_
_
_
_
_
_
_
_
0
.
.
. 0
0
1
0
.
.
.
1
_
_
_
_
_
_
_
_
_
Find . and calculate :.c.( ) to test the hypothesis
_
H
0
: = 0
H
1
: 6= 0
5.Seasonality and Dummy Variables
Deseasonalization is the process of removing the seasonal component from
a time series.
! Deseasonalize the data by introducing dummy variables.
ex)

t
= ,
1
+ ,
2
1
t2
+ ,
3
1
t3
+ ,
4
1
t4
+ ,
5
A
t
+ n
t
(1)

t
= ,
1
+ ,
2
1
t2
+ ,
3
1
t3
+ ,
4
1
t4
+ ,
5
A
t
+ ,
6
1
t2
A
t
(2)
+,
7
1
t3
A
t
+ ,
8
1
t4
A
t
+ n
t
where 1
t
: expenditure, A
t
: income, 1
t2
= 1 for 1
st
quarter, 1
t3
= 1
for 2
nd
quarter, 1
t4
= 1for 3
rd
quarter.
interpretation Model (2) is more general than (1).If we use (1) when (2) is
true, model specication error will occur. Only when all the slope
dummies of (2) show insignicance, can we use model (1).
Multicollinearity
1. Nature of Multicollinearity
Multicollinearity refers to one or more linear relationship between explanatory
variables.
Perfect multicollinearity
Exact linear relationship between explanatory variables.
Since A is not of full rank, A
0
A is singular.
Since / = (A
0
A)
1
A
0
is not dened, we cannot obtain the unique
estimates of /.
ex)
i
= ,
1
+ ,
2
A
2i
+ ,
3
A
3i
+
i
where A
3i
= c
1
+ c
2
A
2i
Near multicollinearity
Near(or high) linear relationship between explanatory variables.
Since the columns of A are nearly linearly independent, jA
0
Aj 0.
Var(b)=o
2
(A
0
A)
1
is very large.
ex) 1`
t
= ,
1
1
t
+ ,
2
G11
t
+
t
where c
1
1
t
+ c
2
G11
t
0 for some c
1
. c
2
(1`
t
:nominal gross income, 1
t
: GDP deator, G11
t
: Nominal GDP)
Remark Multicollinearity is a question of degree(not of kind) and a sample-
specic phenomenon.
2. Consequences of Multicollinearity
Large variance and wide condence interval of /
! Near collinearity does not destroy the minimum variance property
of OLS estimators.
High 1
2
value but few signicant t-ratios
The LS estimator and its variance become very sensitive to small changes
in the data
Wrong signs of regression coecients
! Be careful in attributing the wrong sign to multicollinearity alone,
but it should not be ruled out either!
Diculty in assesing the individual contributions of explanatory variables
to 1
2
3. Detection of Near Multicollinearity
Small t-ratio with a high 1
2
Auxilliary regression
Regress A
2
on the remaining A
s
and obtain 1
2
2
.
Regress A
3
on the remaining A
s
and obtain 1
2
3
.

.
.
.
If A
i
is not a linear combination of the other A
s
. then the 1
2
i
of that
regression should not be signicantly dierent from zero.
! Test the hypothesis that 1
2
i
= 0
1 =
1
2
i
,(/ 1)
(1 1
2
i
),(: /)
1(/ 1. : /)
Small magnitude of j

A
0

Aj

A
0

A =

1
2
e
A
0
e
A

1
2
where
e
A =
2
6
6
6
4
A
12
A
2
A
1k
A
k
A
22
A
2
A
2k
A
k
.
.
.
.
.
.
A
n2
A
2
A
nk
A
k
3
7
7
7
5
=
2
6
4
P
(A
i2
A
2
)
2
0
.
.
.
0
P
(A
ik
A
k
)
2
3
7
5
The variance ination factors(VIF)
\ 11
i
= it/ dicqo:c| c|c:c:t o, (

A
0

A)
1
=
1
1 1
2
i
4. Remedial Measures
Dropping some variables from the model
! Without some theoretical considerations, this remedy can lead to
model specication errors!
Acquiring additional data or a new sample
Prior information about some parameters
ex)

i
= ,
1
+ ,
2
A
i2
+ ,
3
A
i3
+
i
= ,
1
+ ,
2
A
i2
+ 0.3A
i3
+
i
(j:io: i:,o::ctio:)
!
i
0.3A
i3
= ,
1
+ ,
2
A
i2
+
i
Transformation of data
ex)
1`
t
= ,
1
1
t
+ ,
2
G11
t
+
t
! 1`
t
= ,
1
+ ,
2
qdj
t
+
t
(G11 : :o:i:c| G11. qdj : :cc| G11)
Alternative estimation mehod: Ridge regression
Ridge regression estimator
b
,(|) = (A
0
A + |1)
1
A
0
.
This biased estimator has a covariance matrix smaller than that of /.
The tradeo of some bias for smaller variance may be worth making,
but nonetheless, economists are generally averse to biased estimator,
so this approach has little practical use.
5. Is Multicollinearity Necessarily Bad?
If the goal of the study is to use the model to predict or forecast the future
mean value of the dependent variable, collinearity may not be bad
If the objective of the study is not only prediction but also reliable estimation
of the individual parameters of the chosen model, then serious collinearity
may be bad, because it leads to large standard errors of the estimators

You might also like