You are on page 1of 19

Chapter 8: Categorical Data Analysis

Murtaza Haider, Ph.D.


(murtaza.haider@ryerson.ca)
What are Categorical Data?
Categorical data deal with situations where the outcome of an experiment or a process can be categorized into
a finite number of mutually exclusive classes or categories. For instance, a survey of labour force
participation will record an adult’s status as either employed or unemployed. Thus the individual’s status can
be expressed as a dichotomous variable (1/0) where 1 denotes the outcome of being employed 0 denotes the
outcome of being unemployed. Other examples of categorical data include scenarios involving a choice
between the make of new automobiles. For instance, consider the situation where the national make of the
automobile is being researched. Automobile executives are interested in learning about the determinants of
the consumer choice. In this case, the choice could again be represented as a dichotomous variable carrying
the value ‘1’ if the consumer chooses an American make, and ‘0’ otherwise. Similarly, if the choice was
between American, European, or Japanese car manufacturers, the categorical variable representing the choice
would have three categories: 1 (American), 2 (European), and 3 (Japanese). A categorical variable that
represents more than two outcomes is called multinomial variable.
The above-mentioned examples are that of the unordered outcomes. We could have coded American as 2,
Japanese as 1 and European as 3 in the previous example. The change in the order does not have an impact on
the analysis because the order of alternatives is rather arbitrary. There are, however, scenarios where the
ordering of outcomes matters. For instance, consider a study of automobile ownership where households are
coded as follows:
Table 1: Coding of ordinal variables

Categories Description
0 Household without cars
1 Household owning 1 car
2 Household owning 2 cars
3 Household owning 3 cars
4 or more household owning 4 or more cars

In the above example, there is a natural ordering, which suggests that households categorized as 2 own more
cars than the households categorized as 1 or 0. In this particular case, we cannot arbitrarily change the order.
Such data, where the order of outcomes is not arbitrary, but rather systematic, is called ordered data, which is
also a type of categorical data.
The use of categorical variable as an explanatory variables, such as gender, is common in OLS regression
models. As an explanatory variable, a categorical variable captures the systematic differences latent in data
that cannot be accounted for by other variables in the model. For instance, if there is a systematic difference

Categorical Data Analysis 1


between the wages of men and women in a particular profession, the gender variable will capture the
gender-based wage differentials. However, when the dependant variable in an econometric model is
categorical rather than continuous, the use of conventional regression (OLS) techniques are no longer
appropriate. The OLS models are therefore modified to account for the categorical dependant variables. Such
modified models are called discrete choice, categorical, limited dependant variables, or qualitative response
models.
We begin this chapter with a discussion of binomial variables and their analysis. This is followed by a
discussion of multinomial variables and their analysis followed by a discussion of discrete choice models.
This chapters explains the theory and estimation of Binomial, Multinomial, and Conditional Logit models.
The discussion about the estimation uses examples of the estimation routines available in SPSS.

Analysis of categorical data


A wide variety of statistical techniques, methods, and models are available to analyze categorical data. Simple
cross tabulations are commonly used to analyze categorical variables. To illustrate this point, we use a data
set from a study of labour force participation of women (Mroz, 1987). The data contain information on 753
white, married women between the ages of 30 and 60 years. The dependant variable, lfp, reports on women’s
status as employed (1), or otherwise (0). The description of other variables is listed below:.
Table 2: Description of variables in the labour force study

Variable Description
lfp Paid Labor Force: 1=yes 0=no
k5 Number of children less than 6 years old
k618 Children between 6 and 18 years of age
age Wife's age in years
wc Wife College: 1=yes 0=no
hc Husband College: 1=yes 0=no
lwg Log of wife's estimated wages
inc Family income excluding wife's in thousands

Categorical Data Analysis 2


Table 3: Simple tabulations of the labour force data

Paid Labor Force: 1=yes 0=no

Cumulative
Frequency Percent Valid Percent Percent
Valid NotInLF 325 43.2 43.2 43.2
inLF 428 56.8 56.8 100.0
Total 753 100.0 100.0

The above table shows that 325 women (43.2%) in the sample were unemployed, whereas another 56.8%
were employed. We are interested in determining the relationship between the educational attainment of both
husband and wife on a women’s status in the labour force. The hypothesis that we would like to test is the
following: If a women has received college education, she may be more likely to be in the labour force. For
this, we perform a cross-tabulation in SPSS and select the chi-square option in the dialogue box.
Table 4: Cross tabulation of labour force participation and woman’s education

Crosstab

Wife College: 1=yes


0=no
NoCol College Total
Paid Labor Force: NotInLF Count 257 68 325
1=yes 0=no % within Wife
47.5% 32.1% 43.2%
College: 1=yes 0=no
inLF Count 284 144 428
% within Wife
52.5% 67.9% 56.8%
College: 1=yes 0=no
Total Count 541 212 753
% within Wife
100.0% 100.0% 100.0%
College: 1=yes 0=no

The above table suggests that 68% of college educated women were employed against 52.5% of women who
did not attend college. Now we would like to know if the association between wife’s education and
participation in labour force has any statistical significance. We use the chi-square statistics to test the
significance of the association between two variables.

Categorical Data Analysis 3


Table 5: Chi-square test for the association between woman’s education and employment status

Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 14.780b 1 .000
Continuity Correctiona 14.158 1 .000
Likelihood Ratio 15.076 1 .000
Fisher's Exact Test .000 .000
Linear-by-Linear
14.761 1 .000
Association
N of Valid Cases 753
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 91.
50.

The column Asymptotic Significance (two-sided) in the above figure indicates the significance of the
relationship. A low significance value of 0. 05 or less suggests that there may be a relationship between the
two variables. In the above case, the significance value of . 000 suggests that there is statistically significant
relationship between women’s education and their participation in the labour force. But what about the
relationship between education attainment of women’s husband and their participation in the workforce. A
cross-tab suggested that 60% of the women whose husbands received college education were employed
against 55% of the women whose husband did not attend college. The significance of chi-square test returned
a high value of . 160 suggesting that there was no relationship between husband’s education attainment and
the wife’s participation in labour force.

Econometric models of binomial data


We derive the binary Logit model as a latent variable model following Long and Freese (2005). Assume a
latent variable y ∗ that range from −∞ to ∞. The latent variable y ∗ is related to the observed independent
variables as per the following equation:
y∗ = xiβ + εi [1]
i represents the observation and ε is the random error. The above equation is similar to the OLS model,
however, the dependant variable is unobserved. The following equation links the latent variable y ∗ with the
observed variable y.
1 if y ∗i >0
yi =
0 if y ∗i ≤0

Now returning to the example of labour force survey of married women, we set y = 1 if the women is in the
labour force and y = 0, if she is unemployed. The independent variables include number of children,
education, and expected income. Now consider that a woman may be about to leave her job while another

Categorical Data Analysis 4


woman is steadfast in her career. Regardless of their intentions, in both instances we only observe y = 1.
Now consider that there is an "underlying propensity to work" that manifests itself as being employed y = 1 or
unemployed, y = 0. We are not able to observe directly the propensity, however, at some point a change in y ∗
results in a change in y from 1 to 0 or from 0 to 1.
Thus we can express the model as follows:
Pry = 1|x = Pry ∗ > 0|x [2]
We can substitute the structural model in the above equation to get the following:
Pry = 1|x = Prx i β + ε i > 0|x = Prε i > −x i β|x [3]

Assuming that ε is distributed logistically with variance = π , the Logit model can be expressed as
2
3
expx i β
Pry = 1|x = [4]
1 + expx i β
Unlike the OLS model, where the variance can be estimated because the dependant variable is observed, in
the binary Logit model the variance is assumed because the dependant variable, y ∗ , is latent. One can argue
that an OLS method can be used to estimate the model. however, this leads to serious estimation problems.
The first and foremost problem is the heteroscedastic error terms. Since x i β + ε i can only be 0 or 1, therefore,
either x i β + ε i = 0 or x i β + ε i = 1. This leaves ε i equal to either −x i β or 1 − x i β . In such as case, variance is
given by (Greene (1985), p.874):
Varε i |x = x i β ∗1 − x i β [5]
The above equation suggests that as x increase, so does the variance of ε i . Other major problems include the
fact that x i β cannot be constrained to the 0 − 1 interval and that one could not avoid negative variances.
Estimation of binary Logit models
To identify the binary model, we always set one category or alternative as the base case. The estimated
coefficients are then interpreted as a comparison with the base case. Using the labour force example,
probability of being employed is given by:
exp Vw
Prwork = exp Vw +exp Vu
[7]

and probability to travel by transit is given by:


Prunemp = 1 − Prwork [8]
In this particular case, we set the coefficients for being unemployed to ‘0’. This implies that
V u = 0 = β 0 + β 1 X 1 +. . . +β n X n
Therefore,
exp Vw
Prwork = exp Vw +1
, since exp0 = 1.

If we divide both the numerator and the denominator by exp Vw , we have

Categorical Data Analysis 5


Prwork = 1+
1
1
= 1
1+exp −Vw
[9]
exp V w

Prunemp = 1 − Prwork = 1 − 1
1+exp −Vw

exp −Vw
Prunemp = 1+exp −Vw
[10]

Odds Ratio
The odds ratio between the two outcomes is expressed as follows:
1
Prwork 1+exp −Vw
Prunemp
= exp −Vw
= 1
exp −Vw
= expVw [11]
1+exp −V w

Log of odds
And finally, the log of odds are expressed as
Prwork
ln Prunemp
= lnexpV w  = V w , which is equal to x i β.

Let us revisit the dataset of labour force participation of married women and estimate a Binary Logit model of
participation in the labour force using income, children, and education of women and their husbands as
explanatory variables.
Interpreting binary Logit model and statistical inference
Table 6 lists the coefficients of a Binary Logit model that estimates the probability of being employed for
married, white women. The column B lists the estimated coefficients (Betas), while the odds are presented in
the last column, Exp(B). We first begin with the interpretation of coefficients. It is better to use the odds
ratio than the actual coefficients to interpret the model. Variable k5 represents number of young children
under the age of 6 in a household. The coefficient, β, is equal to −1. 463, and the odds are expressed as
exp−1. 463 = 0. 232. This implies that each additional young child decreases the odds of mother being
employed by a factor of 0. 23, all else being equal. The odds of college educated women being employed are
2. 242 times higher than the women who did not receive college education.

Categorical Data Analysis 6


Table 6: Estimation of Binary Logit model of labour force participation

Parameter Estimates

Paid Labor Force:


a
1=yes 0=no B Std. Error Wald Sig. Exp(B)
inLF Intercept 3.182 .644 24.387 .000
k5 -1.463 .197 55.144 .000 .232
k618 -.065 .068 .902 .342 .937
age -.063 .013 24.189 .000 .939
wc .807 .230 12.321 .000 2.242
hc .112 .206 .294 .588 1.118
lwg .605 .151 16.076 .000 1.831
inc -.034 .008 17.611 .000 .966
a. The reference category is: NotInLF.

The impact of age could be interpreted as follows, with an increase in the age by one year, the odds for being
employed decline by a factor of 0. 94. But what if one is interested in determining the impact of being 10
years older, rather than being just one year older. The actual formula for odds is expβδ, where δ is the
change in the number of units. For a 10-year change in age, the odds of working decline by a factor of
exp−. 0628 ∗ 10 = 0. 53. That is, the odds decline by almost 50%. If we were interested in determining the
odds of not working, we can simply take the inverse of expβ. Thus, each additional young child increases
the odds of being unemployed by a factor of 1 = 4. 31.
. 232
We can also interpret the results as a percentage change in odds using the formula: 100expβ k δ − 1. Again,
each additional young child decreases the odds of being employed by 77% (100exp−1. 4629 − 1 = − 76.
8%).
Statistical Inference of Binary Logit models
Logit models are interpreted similar to the OLS models. Instead of the t-stat (or Z-statistics) to evaluate the
Coefficient 2
statistical significance of the model, SPSS uses Wald statistics, which is expressed as SE
.
Estimation software also reports the significance level for Wald statistics. It has been observed that when the
estimated coefficient is very large, the corresponding standard error is also very large, thus returning a very
small value for the Wald statistic. This often leads one to fail to reject the null hypothesis that the estimated
parameter is equal to 0. In cases where the model returns a large coefficient for a variable, Wald statistics
may not be the best instrument to evaluate the parameter. One may want to rescale the variable in such a case.
Another more informative and reliable method is the likelihood ratio test. Each variable from the final model
is eliminated and the reduced model is estimated to obtain -2 * log-likelihood (-2LL). The procedure is
repeated for every variable in the final model. The log-likelihood test returns a change in the value of -2LL if
the effect is removed from the final model. The difference between -2LL for the model with only an intercept

Categorical Data Analysis 7


and -2LL for the reduced model has a chi-square distribution when the coefficient for the variable is 0. The
significance level for the Chi-square can thus be used to evaluate the relative significance of the effect. The
overall fit can also be evaluated from -2LL for the model. A smaller value for -2LL suggests a good fit. If a
model returns a perfect fit, the likelihood =1 and -2LL=0. The model’s chi-square is given by:
χ 2 = −2LL intercept − −2LL final  [12]
If the observed significance level is small (0.000) for χ 2 , we can reject the null hypothesis that coefficients for
variables in the final model are equal to 0. The interpretation of this statistic is similar to the interpretation of
F-statistics in the OLS tradition. The SPSS output for the above-mentioned model suggests that the
significance level of the test is very small, leading us to reject the null hypothesis that coefficients for
variables in the final model are equal to 0.
Table 7: Goodness-of-fit statistics

Model Fitting Information

-2 Log
Model Likelihood Chi-Square df Sig.
Intercept Only 1029.746
Final 905.266 124.480 7 .000

Pseudo R-Square
Cox and Snell .152
Nagelkerke .204
McFadden .121

Other measures of goodness-of-fit statistics include McFadden R-square expressed as follows:


l0−lB lB
R 2McFadden = l0
= 1− l0
[13]

where l0 is the kernel of the log-likelihood of the intercept-only model (only information in the model are
sample shares), while lB is the kernel of the log-likelihood of the final model. This formulation of
McFadden R-square has been adopted in the logistic regression estimation techniques in some software, e.g.
SPSS, which automatically generates this and other goodness-of-fit statistics.
For Logit models, R-square of 0.07 and higher reflects a good fit. In fact, Louviere et al (2000) have argued
that ρ 2 values of 0.2 to 0.4 are "considered to be indicative of extremely good model fits." They have cited a
simulation experiment by Domenchic and McFadden (1975) who have "equialenced this range to 0.7 to 0.9
for a linear function."

Multinomial Logit Models


The preceding discussion leads us into the workings of the Multinomial Logit model. Consider the travel

Categorical Data Analysis 8


mode choice problem where an individual may have the following three options to commute to work: auto
drive, public transit, and non-motorized mode, such as bike or walk . We can code the choice set as 1,2,3.
The model is represented as follows:
β′x
e j i
ProbY i = j = [14]
∑ 3
k=1

e βkxi


In the above equation, β j is the coefficient for variable x i when Y i = j. The subscript i on x suggests that it
varies across the decision makers (i) and the subscript j on β suggests that it varies across choices (j). The
above model will return a set of probabilities for J alternatives for the decision-maker with characteristics x i .
It will also return J − 1 non-redundant baseline logits. As mentioned earlier, we normalize the Multinomial
Logit model by assuming that one set of parameters is equal to 0, i.e., β 1 = 0, therefore e 1 = 1. The choice
for the base case, whose coefficients are set to 0, is completely arbitrary. The probabilities are therefore
expressed as follows:
β′x
e j i
ProbY = j = for j = 1,2, ... , J, [15]
∑ J ′
1+ e βkxi
k=1

ProbY = 1 = 1
[16]
1+∑ k=1
J ′
e βkxi

Remember that we arbitrarily set the coefficients of alternative 1 as 0.


For the multinomial case, let’s say we have three modes: i) auto, ii) transit, iii) walk. We will have two logits,
i.e., two sets of parameters for two choices, while the third choice will serve as the reference category. Let’s
put walk as the reference category in the following example.
Pauto
g 1 = ln Pwalk
= β a0 + β a1 X 1 +. . . +β a2 X 2 [17]
Ptransit
g 2 = ln Pwalk
= β t0 + β t1 X 1 +. . . +β t2 X 2 [18]

g3 = 0 [19]
expg 1  expg 1 
Pauto = expg 1 +expg 2 +expg 3 
= 1+expg 1 +expg 2 
[20]
expg 2 
Ptransit = 1+expg 1 +expg 2 
[21]

Pwalk = 1
1+expg 1 +expg 2 
[22]

Interpretation of Multinomial Logit models


The interpretation of coefficients in a multinomial Logit model remains a complicated affair. It is possible to
have a decline in P ij with an explanatory variable x ij , which returns a positive coefficient β ij . A model should
P ij
therefore be interpreted in terms of odds ratio. In the odds ratio (ln P i0
= β ′j x i ) a positive coefficient for a
continuous explanatory variable suggests that odds of registering an observation in category j are larger than
registering that observation in the reference category with the increase in that particular variable. Similarly, a
negative coefficient for the explanatory variable suggests that the chances of baseline outcome are higher than

Categorical Data Analysis 9


the outcome for category j.
Here we are reproducing a model from Powers and Xiu (2000) to explain the interpretation of the estimated
logistic regression model. A sample of 978 observations of young men between the ages of 20 and 22 was
collected where their major activity was coded as working, school, and inactive. Regressors included a binary
variable Black (1 if Black), NONINT (1 if the family is not intact), FCOL (1 if father has some or more
college education), FAMINC (family income in thousands of dollars), UNEMP80 (local unemployment rate
in 1980), and ASVB (a scholastic test score). The reference category in the model was being inactive instead
of being employed or being in school.
Table 8: Output from a Multinomial Logit model

Variable Coefficient SE t-Stat EXP(B)


Working Constant 0.726 0.347 2.091 2.07
Black -0.444 0.219 -2.032 0.64
NONINT -0.134 0.192 -0.699 0.87
FCOL 0.180 0.241 0.745 1.20
FAMINC 0.407 0.211 1.930 1.50
UNEMP80 -0.071 0.037 -1.903 0.93
ASVAB 0.308 0.110 2.794 1.36

School Constant 0.359 0.333 1.078 1.43


Black 0.229 0.196 1.166 1.26
NONINT -0.547 0.186 -2.941 0.58
FCOL 0.241 0.235 1.025 1.27
FAMINC 0.268 0.209 1.283 1.31
UNEMP80 0.012 0.035 0.361 1.01
ASVAB 0.177 0.107 1.658 1.19

The first difference you will notice is that there are two sets of coefficients. One set of coefficients estimates
the odds of working and being inactive and the other set of coefficient measures the odds of being in school
and being inactive. The odds of black men in the labour force were exp−. 444 = 0. 64 times than that of
whites and others. Stated otherwise, odds of non-blacks working were 1/0. 64 = 1. 562 5 times higher than the
blacks. The estimated coefficient (−0. 444) is measuring the change in log-odds
(LN[Prob-Work/Prob-Inactive]) when the variable Black is increased by one unit, i.e. from 0 to 1. Whereas
exp−. 444 = 0. 64 gives the ratio of odds (Prob-Work/Prob-Inactive) of working against inactive when
Black=1 to when Black =0. Note that if one would like to determine the odds of blacks being inactive rather
than working, the odds are 1 = 1. 56 times, the same as the odds for non-blacks working against
exp−0. 444
being inactive. Similarly the odds of young men from intact families to be in school against being inactive
were 1/ exp−. 547 = 1. 73 as high as those of young men from broken homes. As for continuous explanatory
variables, we can see that odds of working or in school against being inactive increase with family income and
test score. A unit increase in the family income increases the odds of attending school by
exp. 268 − 1 ∗ 100 = 30. 7%.

Categorical Data Analysis 10


Conditional Logit Models
So far our discussion has focussed on models where the characteristics of the decision-maker, such as age,
income and the like, have been used as regressors. But what about the situations where the attributes of the
choice may also have an impact on the outcome. The Binary and Multinomial Logit models explained in the
previous section cannot deal with situations where the outcome is also impacted by the attributes of choice.
Consider for instance the travel model choice problem explained earlier. The characteristics of the
decision-maker and the attributes of choice, such as travel time and cost by mode also impact the outcome.
To deal with this problem, Professor Daniel McFadden, the 2000 Nobel Laureate in Economics, developed the
Conditional or McFadden Logit model. The Conditional Logit model has been widely applied in modelling
choices in diverse fields such as market research, economics, psychology, and travel demand analysis.
Random Utility Model
The Conditional Logit model can be derived using Random Utility theory. Let us assume that a
decision-maker is faced with two choices, a and b. Let U a represent the utility of alternative a and
U b represent the utility of alternative b. The rational decision-maker will opt for the alternative that maximizes
his or her utility. In addition, the utility can be divided into two components, the observed and unobserved
part. The linear random utility model can be expressed as
′ ′
U a = β a X + ε a and U b = β b X + ε b
If Y = 1 denotes the consumer’s choice for alternative a,
′ ′
Proby = 1|x = ProbU a > U b  = Probβ a x + ε a − β b x − ε b > 0|x
′ ′
= Probβ a − β b x + ε a − ε b > 0|x

= Probβ x +  > 0|x [23]
If Y is assumed to be a random variable, it can be shown that
β′z
ProbY i = j = e ij
β′z
[24]
∑ J
j=1
e ij

For the Conditional Logit model, z ij = x ij , w i . If x ij represents the attributes of the choices, the subscript ‘ij’
on x suggests that it varies across the decision makers (i) and choices (j). Where as w i represent the
characteristics of the decision maker (i) and hence it does not vary across alternatives. We can re-write the
above equation as follows:
β ′ x +α ′ w β′x ′
e ij e α wi
ProbY i = j = e ij
=
i
β ′ x ij +α ′ wi β′x
[25]
∑ J
j=1
e ∑ J
j=1

e ij e α wi

It is useful to note that terms that do not vary across alternatives – that is, those specific to the individual – fall
out of the probability. Therefore the above equation can be simplified as follows:
β′x
ProbY i = j = e ij
β′x
[26]
∑ J
j=1
e ij

Categorical Data Analysis 11


To create individual-specific effects, Greene (1997) suggests that a set of dummy variables could be created
for the choices, which can then be multiplied with w i . This method is analogous to the creation of interaction
terms in OLS models. For example, we can use the attributes of shopping centres as regressors along with the
characteristics of the shoppers while modelling the choice of a shopping centre. The assumption is that a
shopper is likely to choose the destination that help minimize his or her shopping trip distance and offers the
most diverse shopping experience (no. of stores). Note that for each shopping centre, the number of shops, and
the distance from the shopper’s trip origin, etc. are different for each trip maker. However, the characteristics
of the trip maker are the same for all alternatives. In the following table, two decision-makers are faced with
three choices for shopping destination. The regressors are number of stores at each location, distance to the
shopping centre, and income. It is obvious from the table that income does not change over alternatives for
each decision-maker and hence if added as a regressor in the model, income will fall out of the probability
equation.
Table 9: Data sample for Conditional Logit models

No. Shopper Alternatives Stores Distance (km) Income (000) Choice


1 David Miller Eaton Centre 125 1.8 145 1
1 David Miller Square 1 Mall 175 15.4 145 0
1 David Miller Fairview Mall 100 7.5 145 0
2 Mel Lastman Eaton Centre 125 7.5 250 0
2 Mel Lastman Square 1 Mall 175 12.8 250 1
2 Mel Lastman Fairview Mall 100 3.5 250 0

The way to accommodate income as a regressor is to introduce alternative-specific dummy variables and
multiply them with the common characteristics of the individual decision-maker.

Categorical Data Analysis 12


Table 10: Example of alternative-specific income variable for Conditional Logit models

Shopper Alternatives Stores Distance (km) Inc-Eaton Inc-Sq.One Inc-Fairview


David Miller Eaton Centre 125 1.8 145 0 0
David Miller Square 1 Mall 175 15.4 0 145 0
David Miller Fairview Mall 100 7.5 0 0 145
Mel Lastman Eaton Centre 125 7.5 250 0 0
Mel Lastman Square 1 Mall 175 12.8 0 250 0
Mel Lastman Fairview Mall 100 3.5 0 0 250

The income variable is introduced in the model as an alternative-specific variable. For example, the variable
Inc − Eaton will capture the impact of income on the utility of shopping at Eaton Centre, whereas the variable
Inc − Fairview will capture the impact of income on shopping at the Fairview Mall. One can see that by
interacting the characteristics of the decision-maker with the alternative-specific dummies (not shown in the
above table), we have created new variables that vary across alternatives for each decision-maker. Also,
remember not to include all interacted income variables in the utility function because if you add them
together, they will again reproduce the original income variable and hence will out of the equation during
estimation. In the above example, include any two interacted income variables in the model.
Unlike the Multinomial Logit model, the Conditional Logit model returns 1 set of parameters, regardless of
the number of alternatives. However, the data set has to be conditioned so that each decision-maker is
repeated in the data set for the number of available alternatives, which is evident from the above two tables.
Therefore, the total number of rows in the data set is equal to the number of decision-makers (i) times
available alternatives (j). This is only true if all decision-makers are presented with the same choice set. The
Conditional Logit model allows the modeller to restrict the number of alternatives available to a
decision-maker. Consider the example of mode choice. A trip-maker without a valid driver’s license can be
offered a choice set that excludes the auto-drive mode.
The marginal effects for any variable x k can be computed by differentiating the Logit model with respect to
the variable x k . Therefore, marginal effects are given by the following equation:
∂P j
δ jk = ∂x k
= P j 1j = k − P k β [27]

The elasticities of probabilities could be expressed as follows:


∂ ln P j
∂ ln x km
= x km 1j = k − P k β k = x km 1 − P k β k [28]

The above is referred to as direct elasticity where m indexes the regressor (attribute) variable and j,k index the
alternatives. Consider the following example where we would like to determine the direct elasticity of the
auto-drive mode with respect to the cost of driving. Direct elasticity calculations require the following inputs:
x km is the cost of driving,
P k is the probability of auto-drive, and

Categorical Data Analysis 13


β k is the estimated coefficient for the cost variable.
Cross elasticities could be computed as follows:
∂ ln P j
∂ ln x km
= −x jm P k β k where k ≠ j, [29]

Cross elasticity calculations for change in the auto-drive mode with respect to changes in transit costs require
the following inputs:
x jm is the cost of transit,
P k is the probability of auto-drive, and
β k is the estimated coefficient for the cost variable.
In estimating Conditional Logit models, one is not restricted by the number of choices. Here the "size of the
estimation problem is independent of the number of choices" (Greene,1997, p. 920). Greene further argues
that the number of choices should be restricted to 100. The fact remains that even with 100 choices,
interpretation of the model becomes a major concern. From the behavioural perspective, a decision-maker
seldom undertakes simultaneous evaluation of 100 choices. To assume that a rational decision-maker can
simultaneously evaluate 100 choices is debatable at best.
The Conditional Logit model does not contain a constant term (β 0 in the OLS tradition). The Conditional
Logit model can only include J − 1 alternative-specific constants. In the above-mentioned mode choice
problem involving three alternatives, we can create alternative-specific constants for any two alternatives.
In conditional Logit models, we do not set any category as the base case or set its systematic utility to 0. The
binary choice is presented as conditional Logit:
exp Va
Pauto = exp Va +exp Vt
[30]

If we divide both the numerator and the denominator by exp V a , we have


Pauto = 1
exp V t
= 1
1+exp Vt −Va
[31]
1+
exp V a

The above equations presents an interesting property of conditional Logit models. We do not observe the
actual utility, but the difference in the utility of two choices.
exp Vt −Va
Ptransit = 1 − 1
1+exp Vt −Va
= 1+exp Vt −Va
[32]

Odds ratio for Conditional Logit is therefore given by


1
Pauto 1+exp V t −V a
Ptransit
= exp V t −V a
= exp V a −V t [33]
1+exp V t −V a

And the log of odds are given by


Pauto
ln Ptransit
= lnexp V a −V t   = V a − V t [34]

Categorical Data Analysis 14


In case we had a third choice as walk with the utility function expressed as V w , the probabilities are expressed
as:
exp Va
Pauto = exp Va +exp Vt +exp Vw

exp Vt
Ptransit = exp Va +exp Vt +exp Vw

exp Vw
Pwalk = exp Va +exp Vt +exp Vw

If you notice the probability function carefully, we are still dealing with the difference in utilities. Let us
divide both the denominator and the numerator with exp V a in Pauto
Pauto = 1
exp V t exp Vw
= 1
1+exp Vt −Va +exp Vw−Va
[35]
1+ +
exp V a exp V a

Interpretation of Conditional Logit models


Let us define the following two probabilities for event j and j ′ :
β′x
P ij = e ij
β′x
[36]
∑ J
j=1
e ij

β′x ′
P ij ′ = e ij
β′x
[37]
∑ J
j=1
e ij

Therefore the odds of opting j over j ′ are given by the following:


P ij β ′ x ij

P ij ′
= e
β′x ′
= expβ ′ x ij − x ij ′  [38]
e ij

While the Logit is expressed as


P ij
ln P ij ′
= lnexpβ ′ x ij − x ij ′  = β ′ x ij − x ij ′  [39]

The above expression suggests that the log-odds of choosing j over j ′ are given by the "weighted difference
between the individual’s values on the explanatory variables for the two alternatives, with the weights being
the estimated parameters", i.e., βs.
The interpretation is illustrated by using a model estimated by David Hensher, which has been reproduced by
Greene (1997) and Powers and Xiu (2000). The example is that of a classic mode choice problem where 152
respondents were surveyed. The original model consisted of four choices: air, bus, car, and train. Powers and
Xiu (2000) have excluded air as an alternative and have reported the results for a three mode choice set where
the choices are 1=train, 2=bus, and 3=car. We have retained Powers and Xiu (2000) results in this discussion.
The explanatory variables are terminal wait time (TTME), in-vehicle time (INVT), in-vehicle cost (INVC),
and GC which is a generalised cost measure computed as INVC + (INVT* Value of Time).
Table 11: Estimates from a Conditional Logit Model with alternative-specific attributes

Categorical Data Analysis 15


Variable Coefficient SE t-Stat
TTME -0.002 0.007 -0.314
INVC -0.435 0.133 -3.277
INVT -0.077 0.019 -3.991
GC 0.431 0.133 3.237

The log-odds for an individual of choosing train (1) over bus (2) are given as:
P i1
ln P i2
= −. 002TTME 1 − TTME 2  −. 435INVC 1 − INVC 2  −. 077INVT 1 − INVT 2  +. 431GC 1 − GC 2 

The above suggests that the odds of choosing a mode decline with the increase in wait time, in-vehicle travel
time, and in-vehicle costs. The odds of choosing a mode, however, increase with the increase in generalised
cost.
When the attributes of choices as well as the characteristics of the decision-makers explain the utility of the
alternatives, the model can contain alternative-specific variables as well as individual-specific covariates after
multiplying individual-specific covariates with alternative-specific dummies.
The model is presented as follows:
β ′ x +α ′ w β′x α′w
e ij j i e ij e j i
ProbY i = j = β ′ x +α ′ w
= β′x α′w
[40]
∑ J
j=1
e ij j i ∑ J
j=1
e ij e j i

where x ij are the alternative-specific covariates, while w i are individual-specific attributes. Interpretation of
the above model is similar to that of Conditional Logit model discussed earlier. In the Conditional Logit
model we have included alternative-specific variables. Now we include an individual-specific variable,
household income (HHINC), which does not vary across alternatives. As mentioned earlier, HHINC will be
multiplied with the alternative-specific dummies to enter the model as a regressor. Powers and Xiu (2000)
omit the lowest coded category (train) and create two alternative-specific constants DB (dummy for bus) and
DC (dummy for car). The new variables are:
HHINC * Dummy for bus = HHINC_DB
HHINC * Dummy for car = HHINC_DC

Categorical Data Analysis 16


Table 12: Estimates from a mixed Conditional Logit model

Variable Coefficient SE t-Stat


TTME -0.074 0.017 -4.360
INVC -0.619 0.152 -4.067
INVT -0.096 0.022 -4.361
GC 0.581 0.150 3.883
DB -2.108 0.739 -2.577
HHINC_DB 0.031 0.021 1.404
DC -6.147 1.029 -5.974
HHINC_DC 0.048 0.023 2.682

The results indicate that an increase in the household income increases the odds in favour of bus and car
against train. However, a look at the t-statistics reveal that only HHINC_DC returns a statistically significant
coefficient.

Independence of Irrelevant Alternatives


Pj
It has been shown in the previous section that the odds ratio Pk
is independent of the remaining
probabilities. This property of Logit models is termed as the Independence of Irrelevant Alternatives (IIA).
This assumption is rooted in the earlier assumption that error terms are independent and homoscedastic. This
is a highly desired property of Logit models from the estimation point of view. The IIA assumption results in
strong restrictions on consumer behaviour. Problems resulting from this assumption are highlighted in the
literature as the red bus/blue bus problem (McFadden, 1974, cited in Powers and Xiu, 2000). Let’s assume
that a commuter’s choice set consists of four modes: red bus, blue bus, car, and train. Let’s also assume that
commuters are equally likely to take any mode and hence the mode share for any particular mode is 25%.
The odds between any two alternatives are 1. Let us also assume that the red bus and the blue bus are perfect
substitutes for each other. Hence, if the blue bus is removed from the service (we can simply paint the blue
buses red), the blue-bus riders will shift to the red bus with an increase in the red bus’s mode share to 50%
from 25%. This is because the bus alternatives are substitutes of each other. The mode shares for train and car
will remain at 25%. However, this is not the case with Logit models. IIA dictates that with the exclusion of
blue bus, the mode share for red bus, car, and train will all be equal to 33.33%, thus maintaining the odds
between any two alternatives at 1.
Hausman and McFadden (1984) have posited that one can eliminate a subset of choices from the universal
choice set, assuming that the subset is "truly irrelevant" (Greene (1995), p. 921). The elimination of the
subset of choices will not "influence the parameter estimates systematically." However, if the odd ratios of
the remaining alternatives are not completely independent, the exclusion of the subset of choice set will return
inconsistent parameter estimates. Hausman’s specification test checks for independence and is presented
below:
ϰ 2 = β̂ s − β̂ f  ′ V̂ s − V̂ f  −1 β̂ s − β̂ f  [41]

Categorical Data Analysis 17


where s represents the estimators obtained for the subset of the choice set and f represents estimators obtained
for the complete choice set. V̂ s and V̂ f represent the estimates of the asymptotic covariance matrices. It has
been proved that the test statistics is asymptotically distributed as chi-squared with K-degrees of freedom. If
the above-mentioned test suggests violations of the IIA assumption, one can either estimate a Nested Logit
model or a Probit model instead.

Estimating Logit models in SPSS


Estimating Multinomial Logit models in SPSS is straight forward. The command NOMREG estimates the
Multinomial Logit model, which in fact can also be used to estimate the Binary Logit model. The dependant
variable could be a dichotomous variable taking the values 1/0, or it could be a polytomous variable taking
values, such as 1,2,and 3 as was the case in mode choice problem. Remember that the estimated model will
return J − 1 set of estimated coefficients, where J is the total number of alternatives.
The Conditional Logit model is not directly available within SPSS. However, one can trick the Cox
Proportional Hazard model in SPSS to run a Conditional Logit model. The likelihood function of the Cox
Proportional Hazard model is the same as the Conditional Logit model. The mechanics and theory of Hazard
models are not explained here. We are only offering necessary definitions required to restructure data to run
Cox Proportional Hazard model in SPSS.
The Cox Proportional Hazard model estimation requires three additional variables. These are status variable,
failure time variable, and the strata variable. Remember that the choice variable assumes the value 1 for the
chosen alternative and 0 for non-chosen alternatives (see Table 9). We use the choice variable as the status
variable in Cox Proportional Hazard model. Make sure that you identify 1 as the single value event in the
option "Define Event" for the status variable.
For the failure time variable in SPSS, the preferred choice (i.e., the chosen mode) should occur at time = 1,
while other modes (choices) should occur at time > 1. Therefore, the time (failure time) variable should
assume the value 1 for the chosen alternative and 2 for other alternatives for every individual. This can be
achieved by creating a new variable t as follows:
t = 2 − choicevariable [42]
Lastly, we need a variable to control for stratification in the Cox Proportional Hazard model. The strata
variable identifies individual decision-makers. Since each decision-maker is represented by multiple
observations, the strata variable has a unique ID for each individual and thus acts as a grouping variable. The
SPSS code for Conditional Logit is as follows:
Definition Coxreg t with aasc casc tasc gc ttme hinca /status=mode(1) /strata=subject.

Categorical Data Analysis 18


References

Mroz, T. A. (1987). The sensitivity of an empirical model of married women’s hours of work to economic and
statistical assumptions. Econometrica. Vol. 55, no. 4. pp.765-799.
Long, J. Scott, Freese, Jeremy. (2005). Regression models for categorical dependant variables using Stata.
Stata Press. Texas.
Louviere, J. J., Hensher, D. A., and Swait, J. D. (2000). Stated choice methods: Analysis and application.
Cambridge University Press.
Domencich, T., McFadden, D. (1975). Urban travel demand: A behavioral analysis. North- Holland,
Amsterdam.
Greene, William H. (1997). Econometric Analysis. 3rd edition. Prentice Hall.
McFadden, D. (1974). The measurement of urban travel demand. Journal of Public Economics. 3(4) 303-328.
Powers, Daniel A., Xie, Yu. (2000). Statistical methods for categorical data analysis. Academic Press.
California.

Categorical Data Analysis 19

You might also like