You are on page 1of 29

0578-02 Bolt 6/5/02 15:27 Page 381

Journal of Educational and Behavioral Statistics


Winter 2001, Vol. 26, No. 4, pp. 381–409

A Mixture Item Response Model


for Multiple-Choice Data
Daniel M. Bolt
Allan S. Cohen
James A. Wollack
University of Wisconsin, Madison

A mixture item response model is proposed for investigating individual differ-


ences in the selection of response categories in multiple-choice items. The model
accounts for local dependence among response categories by assuming that
examinees belong to discrete latent classes that have different propensities
towards those responses. Varying response category propensities are captured
by allowing the category intercept parameters in a nominal response model
(Bock, 1972) to assume different values across classes. A Markov Chain Monte
Carlo algorithm for the estimation of model parameters and classification of
examinees is described. A real-data example illustrates how the model can be
used to distinguish examinees that are disproportionately attracted to different
types of distractors in a test of English usage. A simulation study evaluates item
parameter recovery and classification accuracy in a hypothetical multiple-choice
test designed to be diagnostic. Implications for test construction and the use of
multiple-choice tests to perform cognitive diagnoses of item response patterns are
discussed.

Keywords: cognitive diagnosis, differential alternative functioning, item response theory,


Markov Chain Monte Carlo estimation, mixture modeling, nominal response model

Several polytomous Item Response Theory (IRT) models have been proposed to
model multiple-choice item responses, including Bock’s (1972) nominal response
model and the multiple-choice models of Samejima (1979) and Thissen & Stein-
berg (1984). In contrast to dichotomous IRT models, which model the probability
of correct response, polytomous IRT models model the probability of selecting
each response category. Polytomous IRT modeling can result in more precise esti-
mates of examinee ability (Baker, 1992), as well as deeper insight into the func-
tioning of individual test items, such as the relative attractiveness of distractors at
specific ability levels (Thissen, Steinberg, & Fitzpatrick, 1989).
In this article we demonstrate how polytomous IRT models can also provide a
way to investigate individual differences related to response category selection.
For many educational tests it is believed that the specific response categories exam-
inees select may provide information about examinee cognition that is not appar-
ent from total test scores or IRT-based ability estimates (Mislevy, 1995). Distractors
in multiple-choice items can often be designed to be attractive to students using

381
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 382

Bolt, Cohen, and Wollack


certain problem-solving strategies or applying erroneous operational rules (Tatsuoka,
1985a). For example, an arithmetic test containing subtraction items might include
distractors based on an erroneous rule in which the sign of the subtrahend is ignored
(e.g., 3 minus (−2) = ′1′; −6 minus (−6) = ′−12′). Students who consistently apply
this rule will not only answer items having a negative subtrahend incorrectly, but also
select the same distractors as responses to those items. Provided enough students fol-
low this erroneous rule, local dependence among response categories may become
apparent when fitting an item response model. The detection and interpretation of
patterns in response category dependence can be useful in determining why students
answer items incorrectly, and may ultimately provide a basis for individual-level
cognitive diagnoses of item response vectors (Tatsuoka, 1985a).
One way of accounting for patterns of local dependence among item responses is
through a discrete mixture model (Rost, 1990; Mislevy & Verhelst, 1990; Kelderman
& Macready, 1990; Yamamoto, 1987). Discrete mixture models assume that a data
set is composed of distinct subpopulations of observations that are described by dif-
ferent parametric distributions (Titterington, Smith, & Makov, 1985). In IRT, dis-
crete mixture models allow items to assume different parameter values across
different latent classes of examinees (Rost, 1997). An example of a mixture IRT
model is the mixture Rasch model (Rost, 1990), which allows the Rasch difficulty
parameters of items to vary across classes. Rost (1991) has also presented a mixture
partial credit model in which category threshold parameters vary across classes.
In this article, a mixture model is described for items having unordered response
categories. The model is a generalization of the Nominal Response Model (NRM)
(Bock, 1972) that allows the category intercept parameters of the model to vary
across latent classes. The model can be of practical use both for investigating indi-
vidual differences in response category selection and for classifying item response
vectors into diagnostic classes. Example items from a test that will be studied in an
exploratory application of the model are shown in Appendix A. The items are from
a test of English usage that is used in an English course placement examination
administered at a midwestern university. Each item requires that students identify
a usage error in an English sentence by selecting from several underlined options
that part of the sentence that should be changed (or if there is no error, to select
Category 5, “No error”). The table of specifications for this test attends to the type
of error introduced into each sentence (e.g., a punctuation problem, verb tense prob-
lem, misuse of pronoun, etc.). As in the subtraction example, however, there is
likely also to be information in the distractors a student selects when answering
incorrectly. Selecting a distractor other than Category 5 not only demonstrates an
inability to detect the true error in the sentence, but also implies a proposed correc-
tion where no correction was needed.
The remainder of the article is organized as follows. First we present the mixture
NRM, and a Markov Chain Monte Carlo (MCMC) algorithm that can be used to
estimate it. Two applications of the model are considered next. The first is an
exploratory application of the model using data from the English Usage test. In the
context of this analysis, two criteria for determining the correct number of classes
382
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 383

A Mixture Item Response Model for Multiple-Choice Data


in the data are investigated, and an index for comparing classes is described. The sec-
ond application is a constrained application of the model, as might be applied to a
test with distractors intentionally designed to distinguish classes. A simulation study
is used to evaluate the accuracy of the model in classifying examinees and in recov-
ering model parameters using the MCMC algorithm.

Nominal Response Model


The NRM can be traced in influence to the classical choice models of Thurstone
(1927), Bradley and Terry (1952), and Luce (1959). Assume a propensity to select
category k on item j, denoted zjk , is a linear function of a latent ability parameter θ
and item category parameters λjk and ζjk:

z jk = λ jk θ + ζ jk . (1)

The λ parameter reflects the influence of θ on category propensity, while the ζ param-
eter reflects the influence of factors unrelated to θ. In the NRM, zjk is translated into
a probability of selecting category k using a multinomial logistic function:

exp( z jk )
Pjk = , (2)
∑h =1 exp( z jh )
K

where K is the total number of categories. The probability of selecting category k is


affected not only by the propensity towards k, but also by the propensities toward all
other categories, making the NRM a “divide-by-total” model (Thissen & Steinberg,
1986). The λs and ζs for an item are often arbitrarily scaled so that Σ Kk=1 λjk = 0 and
Σ Kk=1 ζjk = 0. Samejima (1972) showed that the item category characteristic curves
(ICCCs) for categories with the largest λ (typically the correct answer) and small-
est λ in an item are monotonically increasing and decreasing in θ, respectively,
while the λs for all other item categories determine the relative orderings of the
modal points of their ICCCs.

A Mixture Nominal Response Model


A mixture extension of the NRM assumes that each examinee belongs to a latent
class, and allows the propensity zjk to be affected both by θ and class. Class effects
are accounted for through class-specific category intercept parameters, ζgjk. The
propensity towards response k on item j for members of class g is written as:

zgjk = λ jk θ + ζ gjk , (3)

and the resulting mixture nominal response model (MNRM) as:

exp( zgjk )
Pgjk = . ( 4)
∑h =1 exp( zgjh )
K

383
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 384

Bolt, Cohen, and Wollack


If the class memberships of examinees were known, the model would be equivalent
to one in which the NRM is used to study differential alternative functioning (Green,
Crone, & Folk, 1989; Thissen, Steinberg, & Wainer, 1993). When class member-
ship is unknown, the model takes the standard form of a discrete mixture model,

Pjk = ∑ π g Pgjk , (5)


g

where the πgs are mixing proportions, Σ g=1 G


πg = 1, and G is the number of classes.
In addition to the class-specific category, intercept and mixing proportion parameters
are class parameters related to the distribution of θ. The mean and precision (i.e.,
the inverse of variance) of θ in class g are denoted as µg and τg, respectively.
Unless additional constraints are imposed on the above model, not all of its param-
eters are identified. As in the NRM, an indeterminacy in the item category param-
eters can be resolved by setting Σ k=1 λjk = 0 and Σ k =1 ζgjk = 0 for all items j and
K K

classes g. To define a metric for θ, for Class 1 we set µ1 = 0 and τ1 = 1. A constraint


implicit in the model is that the λs be the same across classes. This constraint allows
θ to function in the same way across classes, and thus also makes the differences in
category intercept parameters across classes interpretable. We consider some other
identification issues in the context of specific applications of the model below.
By allowing the ζ parameters to vary across classes, a mixture model can account
for varying propensities towards response categories. An example is shown in
Figure 1. In this hypothetical item, the difference between classes occurs with
respect to the third response category, which is 2.0 higher for Class 2 (to make the
difference in parameters apparent, the ζs in Class 2 have not been normalized).
The consequence is a greater probability of selecting the third response category
(conditional on θ) for Class 2 members.

A Markov Chain Monte Carlo Estimation Algorithm for the MNRM


MCMC estimation methods have demonstrated much potential for complex statis-
tical models (Gilks, Richardson & Spiegelhalter, 1996) and are receiving increased
attention in IRT (see, for example, Patz & Junker, 1999a). The MCMC methodol-
ogy adopts a perspective of Bayesian inference in which variables are distinguished
as: (a) observed data, denoted Y, and (b) unobserved model parameters and missing
data, denoted Ω. Their joint density can be written as:

P(Y, Ω) = P(Y Ω) P(Ω), (6)

where P(Ω) represents a prior density for Ω. An application of Bayes’ theorem pro-
duces the posterior distribution of interest:

P(Ω) P(Y Ω)
P(Ω Y ) = . (7)
∫ P(Ω) P(Y Ω)dΩ

384
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 385

A Mixture Item Response Model for Multiple-Choice Data

FIGURE 1. ICCCs for a hypothetical MNRM item


385
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 386

Bolt, Cohen, and Wollack


Estimates of the model parameters are computed from this posterior distribution
(Gilks, Richardson & Spiegelhalter, 1996). However, the integration needed to eval-
uate Equation 7 is frequently too complex to be performed analytically. In MCMC,
the integral is evaluated using Monte Carlo integration. A common MCMC method,
the Gibbs sampler, iteratively samples parameter values directly from their full con-
ditional distributions (Gilks, 1996). At each iteration t of the Markov chain, each
model parameter ωs in Ω, is sampled conditional on all other parameter values ω−s,
and the data. When the full conditional distributions are of a familiar form, such as
a normal, direct Gibbs sampling can be easily implemented using statistical software.
However, because the posterior distributions of parameters in IRT models can often
only be determined up to a normalizing constant, alternatives to direct Gibbs sam-
pling are needed (see, e.g., Patz & Junker, 1999a; 1999b). In this article we use the
method of Adaptive Rejection Sampling (ARS) (Gilks & Wild, 1992), which can be
applied when the full conditional distributions are log-concave.
MCMC methods have been found particularly useful in estimating mixture dis-
tributions (Diebold & Robert, 1994), including mixtures that involve random
effects within classes (Lenk & DeSarbo, 2000). A common MCMC strategy is to
sample a class membership parameter for each observation at each stage of the
Markov chain (Robert, 1996). For the current model, a class membership parame-
ter, ci = 1, 2, . . . , G, is sampled for each examinee i along with a continuous latent
ability parameter θi.

Prior and Full Conditional Distributions


To obtain full conditional distributions, the following prior distributions are assumed
for each MNRM parameter:

ci ~ Multinomial(1 π1 , π 2 , . . . , π G )
[θ i ci = g] ~ Normal(µ g , τ g )
 = ( π1 , π 2 , . . . , π G ) ~ Dirichlet(α1 , α 2 , . . . , α G )
λ jk ~ Normal(0, τ λ )
ζ gjk ~ Normal(0, τ ζ )
µ g ~ Normal(0, τ µ )
τ g ~ Gamma(ε, ξ).

Hyperparameters for several of the prior distributions were fixed at what were
regarded as reasonable values: τλ = τζ = τµ = 1, ε = 2, ξ = 4, and α1 = α2 = . . . = αG
= .01. Assuming a joint distribution as in Equation 6, full conditional distributions
for each parameter conditional upon the data and other model parameters can be
determined at least up to a normalizing constant. We use [ab] as notation for the
conditional distribution of a given b. The MCMC sampling procedure is then com-
posed of the following steps:
386
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 387

A Mixture Item Response Model for Multiple-Choice Data


Step 1. Sample a class membership ci from 1, 2, . . . , G for each examinee.
Assuming independence of examinees, the cs have full conditional distributions of
the form:

[ci = g all other parameters]


∝ P(y i ci = g,  g , , θ i , µ g , τ g ) fθ (θ i ci = g) P(ci = g)

∝ [∏ ∏ j k ]
Pgjk (θ i ) I ( yij = k ) Normal(θ i ; µ g , τ g )π g ,

where yi is the item response vector,  and g are the item category parameters for
class g, I is an indicator function taking value 1 if response k is selected on item j
and 0 otherwise, and Normal(θi; µg, τg) is the normal density evaluated at θi with
mean µg and precision τg.
Step 2. Sample a latent ability θi for each examinee. Assuming independence of
examinees, the θs have full conditional distributions of the form:

[θ i = θ all other parameters] ∝ P(y i ci = g,  g , , θ, µ g , τ g ) fθ (θ ci = g)


∝ [∏ ∏ j k ]
Pgjk (θ i ) I ( yij = k ) Normal(θ i ; µ g , τ g ).

Step 3. Sample category parameters λjk and ζgjk for all item categories and classes.
Assuming conditional independence across items, the λ parameters have full con-
ditional distributions of the form:

[λ jk = l all other parameters] ∝ P(y j c,  gj ,  j ( − k ) , l, ) fλ (l )

∝ [∏ ∏ P( y
g i ijk ζ gj , λ j ( − k ) , 1, θ i )
I ( ci = g )
]
Normal(l; 0, 1),

where yj is the column item response vector for item j across examinees, c and 
are the examinee class membership and ability vectors, gj is the vector of inter-
cept parameters in class g for item j, j(−k) is the vector of all slope parameters for
item j except for category k, and I now indicates whether examinee i is in class g.
For the ζ parameters,

[ζ gjk = z all other parameters] ∝ P(y j c,  gj ( − k ) , z,  j , ) fζ ( z )

∝ [∏ P( y i ijk  gj ( − k ) , z,  j , θ i )
I ( ci = g )
]
Normal( z; 0, 1),

where g j(−k ) is the vector of all intercept parameters in class g for item j except for
category k, and j is the vector of all slope parameters for item j.

387
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 388

Bolt, Cohen, and Wollack

Step 4. Sample the mixing proportions  = (π1, π2, . . . , πG). Assuming condi-
tional independence between the mixing proportions and all parameters except the
class memberships of examinees, the mixing proportions have a full conditional
distribution of the form:

[ = p all other parameters] ∝ P(c p) fπ ( p)


∏g pg(
.01 + ng )
∝ ,

where ng is the number of persons in class g. This full conditional distribution is


Dirichlet, with parameters .01 + n1, .01 + n2, . . . , .01 + nG.

Step 5. Sample class ability means and precisions µg and τg for each class. Assum-
ing the ability distribution parameters are independent of all parameters except the
θs for examinees in class g, the conditional distributions are of the form:

[µ g = µ all other parameters] ∝ [ ∏ f (θ µ, τ ) ] f ( µ )


i θ i g
I ( ci = g )
µ

∝ [∏ Normal(θ ; µ, τ ) ]Normal(µ; 0, 1), I ( ci = g )


i i g

which results in the following full conditional distribution for µg:

 τ g Σ i θ i I (ci = g) 1 
µ g ~ Normal  , + 1 .
 τ g ng + 1 τ g ng 

For the precision parameters,

[τ g = τ all other parameters] ∝ [ ∏ f (θ µ , τ ) ] f ( τ )


i θ i g
I ( ci = g )
τ

∝ [∏ Normal(θ ; µ , τ) ]Gamma(τ; 2, 4), I ( ci = g )


i i g

where Gamma(τ; 2, 4) is the gamma density evaluated at τ. This produces a full


conditional distribution for τg of:

 
 ng 1 
τ g ~ Gamma + 2, .
2 ng σˆ 2g 1
 + 
 2 4

where σ̂ g2 is the variance of θi about µg for members of class g.


Adaptive Rejection Sampling
Because the full conditional distributions in Step 1 are discrete, the sampling of cs
is straightforward. Likewise, the conditional distributions in Steps 4 and 5 are of
388
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 389

A Mixture Item Response Model for Multiple-Choice Data


familiar forms (i.e., Dirichlet, normal, and gamma distributions), so that direct
Gibbs sampling could be easily conducted. Sampling from the distributions in
Steps 2 and 3, however, is not as straightforward, as the full conditional distribu-
tions are only known up to a normalizing constant. In both cases, however, ARS
can be used because the distributions are log-concave (Gilks, 1996). ARS is a
method of rejection sampling that samples parameter values from an envelope
function of the conditional distribution. For a specified parameter ω having full
conditional distribution d(ω), an envelope function, denoted DS(ω), is constructed
from the intersections of tangent lines of d(ω), which are evaluated at an initial set
of abscissae points, S. A value ω′ sampled from DS(ω) is retained only if a ran-
domly sampled uniform variable is less than or equal to d(ω′)/DS(ω′). The algo-
rithm is adaptive in that the sampled value ω′ and associated d(ω′) provides a new
abscissa point that can be used to construct a tighter envelope for future sampling.
For additional details on ARS, the reader is referred to Gilks and Wild (1992).
Gibbs sampling and ARS can be implemented using the WINBUGS software
(Spiegelhalter, Thomas, & Best, 2000). WINBUGS performs its own check to
determine the best method for sampling each parameter. ARS has been studied
with the nominal response model and demonstrated comparable results to marginal
maximum likelihood in recovering the generating parameters of simulated data
(Wollack, Bolt, Cohen, & Lee, in press).
Implementing the MCMC Algorithm
Starting values
Starting values are needed to define the initial state of the Markov chain. For the cur-
rent study, starting values for the λs were selected such that the correct response was
assigned the highest value. For example, if Category 2 was the correct answer on Item
1, initial values of λ12 = 1.00 and λ11 = λ13 = λ14 = λ15 = −.25 (assuming the item has
five categories) were selected. For the mixing proportions and class ability distribu-
tions, values of πg = 1/G, µg = 0, and τg = 1, g = 1, 2, . . . , G were used. Starting val-
ues for all other parameters were randomly generated within the WINBUGS program.
Convergence of the Markov chain
In MCMC applications, the sampled values for initial iterations in the chain are
discarded because of their dependence on the starting state. Several criteria com-
puted in the CODA package of the WINBUGS program, including criteria pro-
posed by Geweke (1992) and Gelman and Rubin (1992), were used to determine
the number of initial iterations, called burn-in iterations, to discard. Parameter val-
ues sampled after the burn-in iterations are obtained from a chain that is assumed
to have converged to its stationary distribution; estimates of model parameters are
computed from these iterations. In this article, the mean across iterations is used
as the parameter estimate.
A couple of potential problems can be encountered when applying MCMC to
mixture models. One is the occurrence of a trapping state, which occurs when very
few or no observations are assigned to a class, and the mixing proportion for that

389
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 390

Bolt, Cohen, and Wollack


class nears zero. Trapping states occur more frequently in solutions involving a
large number of classes and when very uninformative priors are used for the πgs.
When encountered, slightly stronger priors can be imposed on the πgs. A second
problem is that the classes can exchange identity over the course of the chain.
Although this did not appear to occur for any of the results reported in this article,
it becomes more likely when the distinguishability of the classes is low. A strategy
that appeared to remedy this problem was to impose ordinal constraints on a ζ
parameter across classes (see also, Hoijtink & Molenaar, 1997). Other strategies
for handling this problem are described by Lenk and DeSarbo (2000).

Exploratory Analysis: English Usage Data


Item response vectors for 1,000 examinees to 12 items from the English Usage sec-
tion of a college-level English placement test were analyzed. The test is used to assist
college freshmen in course selection and consists of items developed by a panel of
college English teachers. The 12 items were selected from a 31-item section of the test
based on their anticipated potential to display individual differences. Each item was
of the format displayed in Appendix A. The 1,000 examinee response patterns were
randomly selected from a larger data set containing a total of 10,165 response patterns.
One-, two-, three- and four-class MNRM solutions were obtained. Because the
solutions were exploratory, the ζs for all item categories were allowed to vary
across classes. For solutions involving two or more classes, however, this makes
the θs and ζs within each class underidentified (e.g., an increase in the ζ for the
correct response categories can take the place of an increase in θ). One way of
addressing this problem is to fix the µgs to zero in all classes. In practice, however,
this strategy makes the class differences more difficult to interpret, as θ no longer
represents comparable levels of ability across classes. Because the primary objec-
tive in the current analysis was to investigate patterns of local dependence among
response categories, an alternative strategy was used in which the θs were simply
fixed at their estimates from the one-class solution when estimating the two-, three-,
and four-class solutions. In this way, class differences with respect to the category
intercepts could be more easily interpreted, and differences in ability distributions
across classes could also be ascertained.
The Markov chains for each solution were run for at least 10,000 iterations past
the burn-in iterations. Figure 2 illustrates the history of the chain simulated for the
class mixing proportion and ability distribution parameters in the two-class solu-
tion (sampling histories for the item parameters are not shown here). For this chain,
a burn-in of 500 iterations was discarded and the remaining 15,000 iterations were
used to compute parameter estimates.

Determining the Number of Classes


Various criteria for model comparison have been proposed in conjunction with
MCMC estimation (Gelfand, 1995). Two criteria were considered here for deter-
mining the number of MNRM classes to retain. Both criteria use cross-validation
data. The first criterion uses a cross-validation log-likelihood to perform a “pseudo-
390
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 391

A Mixture Item Response Model for Multiple-Choice Data

FIGURE 2. Markov Chain history for item and class parameters, two class MNRM solu-
tion, English Usage data

391
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 392

Bolt, Cohen, and Wollack


Bayes factor” comparison between models (Geisser & Eddy, 1979; Gelfand &
Dey, 1994). The second criterion is a measure of local dependence computed for
pairs of item response categories (Yen, 1984).

Cross Validation Log-Likelihood


A common Bayesian approach to comparing two models, say Model A and Model
B, computes a Bayes factor, which is the ratio of the posterior odds of Model A to
Model B divided by the prior odds of Model A to Model B. A Bayes factor greater
than one supports Model A while a value less than one supports Model B. One lim-
itation in using Bayes factors is that they are only appropriate if it can be assumed
that one of the models being compared is the true model (Smith, 1991). When the
models are more appropriately regarded as proxies for a true model, it is better to
use indices based on a cross-validation sample (Spiegelhalter, Thomas, Best &
Gilks, 1996). Therefore, to evaluate this criterion for the English Usage test, a
cross-validation sample of 1,000 examinee response vectors was randomly
selected from the remaining 9,165 examinees. Using the WINBUGS software, esti-
mates of cross-validation log-likelihoods for each response vector were obtained
by holding constant the parameter estimates computed from the initial calibration
sample, and sampling class membership and ability parameters for each examinee
in the cross-validation sample in a second Markov chain. The likelihood of a
response vector yi from the cross-validation sample is:

P(y i Ycal ) = ∫ [ P(y i ω i , Ycal ) fω (i Ycal )]di , (8)

where Ycal denotes the data for the calibration sample and ωi = (ci, θi) consists of
both a class membership and an ability parameter for each examinee i in the cross-
validation sample. An overall cross-validation log-likelihood is computed as the
sum of the examinee log-likelihoods:

log P(Ycv Ycal ) = ∑ log P(y i Ycal ). ( 9)


i ∈cv

The log-likelihood in Equation 9 can be computed at each iteration of the Markov


chain. An overall estimate of the cross-validation log-likelihood is the mean of the
log-likelihoods across 3,000 iterations. This cross-validation log-likelihood can be
compared across MNRM solutions involving different numbers of classes, with
the best solution being the one that maximizes the log-likelihood (Spiegelhalter et
al., 1996).

Local Dependence Among Response Categories


Unlike dichotomous IRT models, the nominal response model accounts for statis-
tical dependence among all pairs of item response categories, not just the correct-
incorrect responses. The dimensionality of a test is often defined as the number of
latent dimensions needed to achieve local independence among item responses
392
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 393

A Mixture Item Response Model for Multiple-Choice Data


(McDonald, 1997). In a similar way, the appropriate number of classes for an
MNRM solution corresponds to that needed to achieve local independence among
response categories. As the number of classes increases, the amount of local depen-
dence among response categories decreases. Local dependence among response
categories can be measured using a variation on Yen’s Q3 statistic (Yen, 1984). For
a pair of dichotomously scored items, Q3 is computed as the correlation between
their residuals. An analogous index can be computed for a pair of response cate-
gories, such as category k on item j and category k ′ for item j′ ( j ≠ j ′). For the
NRM, the index is given by:

([ ( )] [ ( )]),
r (j1,)k ; j ′, k ′ = Corr I ( yij = k ) − Pjk θˆ i , I ( yij ′ = k ′) − Pj ′k ′ θˆ i (10)

where I (yij = k) assumes values of 0 and 1 according to whether examinee i


selected k on item j, and θ̂i represents the ability estimate of examinee i. (The super-
script ‘1’ indicates that the residuals are being computed with respect to a one-class
nominal response model.) The correlations are computed using the estimated item
category and examinee ability parameters. When no local dependence exists, the
correlations should be approximately normally distributed (Yen, 1984).
For solutions having two or more latent classes, both class membership and θ
must be accounted for in evaluating local dependence. Because the sampled class
memberships for most examinees place them in more than one class across itera-
tions of the Markov chain, for each examinee a separate ability estimate exists for
each class. Let θ̂gi denote the mean of the sampled values of θ for examinee i when
in class g. For a G-class solution, the residual correlations are then computed as

 
( )
G
r (jG, k); j ′, k ′ = Corr  I ( yij = k ) − ∑ Pgjk θˆ gi • P(ci = g),
 g =1 
(11)
 
( )
G

 I ( yij ′ = k ′) − ∑ Pgj ′k ′ θ gi • P(ci = g) ,


ˆ
 g =1 

where P(ci = g) denotes the proportion of iterations in which examinee i was sam-
pled as a member of class g. Computing residual correlations across all pairs of cat-
egories results in a total of K x (J − 1) correlations for each solution (residuals for
pairs of categories within an item are not included), where K equals the number of
categories per item, and J equals the number of items. A residual correlation of
zero implies that the ability dimension and latent classes are able to account for the
dependence among response categories. We use the mean absolute value of these
residual correlations, the minimum and maximum residual correlations, and nor-
mal QQ-plots of the residual correlations to evaluate local dependence.
Simulation Study Evaluating Cross-Validation Criteria
The usefulness of both the cross-validation log-likelihood and local dependence cri-
teria in determining the correct number of classes was evaluated in a short simula-

393
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 394

Bolt, Cohen, and Wollack


tion study. Data for four calibration data sets were generated under one-, two-, three-,
and four-class conditions. Response vectors were simulated for twelve five-category
items, as in the English Usage data. From a base set of λ and ζ parameters used for
Class 1, the parameters for the remaining classes were derived by shifting the ζ pa-
rameters for five response categories upward by 2.5. In this way, five categories are
defined for which each class has a stronger propensity. The ζ-shifts were applied to
different response categories for each additional class. The θ distributions used to
generate the data for each of the four classes were Normal (0, 1), and the mixing
proportions were always equal across classes that is, π = (.50, .50) in the two class
condition, (.33, .33, .33) in the three-class condition, and (.25, .25, .25, .25) in the
four-class condition. A calibration data set was simulated consisting of 1,000 exam-
inees for each of one-, two-, three-, and four-class conditions. For each calibration
data set, one-, two-, three-, and four-class solutions were obtained.
Next, five cross-validation data sets, each containing 1,000 item response vec-
tors, were simulated for each of the one-, two- three-, and four-class conditions
using the same parameters that were used to generate the calibration data sets.
Holding as constant the parameter estimates from the calibration data sets, log
P(Ycv | Ycal) and residual correlations were computed for each cross-validation data
set with respect to each of the one-, two-, three-, and four-class solutions.
Table 1 reports the results. First, the mean across the five cross-validation data
sets of log P(Ycv | Ycal) is reported along with the minimum and maximum observed
for each data set. For each of the four solutions considered, log P(Ycv | Ycal) was con-
sistently maximized for the solution in which the correct number of classes was
specified. A similar trend was observed with respect to the residual correlations. As
the number of classes in the solution increases, the residuals generally get smaller,
but usually the drop is not very substantial when the number of classes in the solu-
tion exceeds the number of classes simulated. For example, in the two-class simu-
lation condition, the mean absolute value of the residual correlations is .041 for the
one-class solution, .031 for the two-class solution, and .030 for the three- and four-
class solutions, suggesting the two-class solution is best.

Application of Cross-validation Criteria to the English Usage Data


Application of the same criteria to the English Usage data produced the results in
Table 2. Normal QQ-plots of the residual correlations for all four solutions are
shown in Figure 3. Both the log P(Ycv | Ycal) and residual correlation criteria sug-
gest that the two-class solution is best. First, the two-class solution results in a
cross-validation log-likelihood that is highest. Second, the residual correlations
demonstrate a modest drop from the one-class to two-class solutions, and then vir-
tually no decrease from the two- to three- and four-class solutions. Finally, the nor-
mal QQ-plots of the residual correlations show some departure from normality in
the one-class solution, but a much smaller departure in the two-class solution.
These results, combined with the clear interpretation that could be given to the two-
class solution, led to a decision to retain two classes. Table 3 displays item param-
eter estimates and standard errors for each parameter in the two-class solution.
394
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 395

A Mixture Item Response Model for Multiple-Choice Data


TABLE 1
Results of Simulation Analyses Evaluating Two Criteria for Determining the Number of
MNRM Classes
Cross-validation log-likelihoods [log P(Ycv Ycal)]
Number
Number of Classes in Solution
of Classes
Simulated 1 2 3 4
1 Mean −13542 −14256 −14670 −15040
Min, Max −13710, −13350 −14420, −14050 −14820, −14470 −15300, −14770
2 Mean −14482 −13850 −14274 −15052
Min, Max −14650, −14360 −13990, 13680 −14440, −14100 −15240, −14740
3 Mean −14954 −14788 −14374 −14646
Min, Max −15080, −14770 −14880, −14640 −14480, −14260 −14870, −14460
4 Mean −15084 −15144 −14944 −14610
Min, Max −15130, −15000 −15200, −15090 −15000, −14920 −14700, −14480

Residual correlations
Number of Classes in Solution
Number of Classes
Simulated 1 2 3 4
1 Mean Abs Corr .030 .030 .029 .030
Mean Min, Max −.14, .14 −.13,.14 −.17,.13 −.14,.14
2 Mean Abs Corr .041 .031 .030 .030
Mean Min, Max −.23,.34 −.13,.14 −.12,.17 −.13,.15
3 Mean Abs Corr .052 .037 .028 .029
Mean Min, Max −.27,.35 −.21,.29 −.12,.15 −.14,.15
4 Mean Abs Corr .051 .047 .034 .030
Mean Min, Max −.24,.32 −.30,.27 −.22,.26 −.15,.15
Note. Min = Minimum, Max = Maximum, Mean Abs Corr = Mean Absolute Value of residual
Correlation.

TABLE 2
Criteria for the Number of MNRM Classes in English Usage data
Number of Classes in Solution
Statistic 1 2 3 4
log P(Ycv Ycal) −15200 −14370 −14490 −14860
Mean Abs Corr .038 .034 .034 .034
Min/Max Corr −.21,.26 −.20,.17 −.19,.16 −.19,.15
π̂ 1.00 .28, .72 .20, .09, .71 .22, .24, .12, .42
Note. Mean Abs Corr = Mean Absolute Value of residual Correlation, Min = Minimum, Max =
Maximum.

395
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 396

Bolt, Cohen, and Wollack

FIGURE 3. Normal QQ-plots of residual correlations, English Usage data

Interpreting Classes in the English Usage Data


To interpret the two classes, ICCCs across classes were compared using a signed
average difference index. For each response category, a weighted average differ-
ence between the ICCCs of the two classes is given by:

Djk = ∫θ[ P1 jk (θ) − P2 jk (θ)] f (θ) dθ. (12)

This integral can be approximated using the 1,000 θ estimates from the calibra-
tion examinees. Items that perform the most differentially across classes are
396
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 397

A Mixture Item Response Model for Multiple-Choice Data


TABLE 3
Two-Class Solution: English Usage Data
Class Parameter Estimates
Class π̂ SE µ̂ SE τ̂ SE
1 .28 .04 −0.06 0.04 1.25 0.07
2 .72 .04 −0.02 0.01 1.21 0.02
Item Category Parameter Estimates
Class 1 Class 2
Item Category λ̂ SE ζ̂ SE ζ̂ SE
1 1 0.28 0.11 0.96 0.19 0.68 0.14
2 0.56 0.09 1.83 0.17 2.32 0.12
3 0.05 0.25 −1.58 0.39 −2.10 0.35
4 −0.82 0.17 −1.44 0.37 −0.87 0.22
5 −0.08 0.13 0.24 0.22 −0.04 0.17
2 1 0.14 0.37 −1.98 0.45 −1.91 0.43
2 1.12 0.15 2.19 0.20 3.48 0.20
3 −0.46 0.28 −1.51 0.40 −1.20 0.36
4 −0.63 0.16 1.09 0.21 0.95 0.22
5 −0.17 0.22 0.21 0.25 −1.32 0.43
3 1 0.11 0.13 −0.06 0.27 0.19 0.15
2 −0.54 0.16 −1.46 0.40 −0.47 0.18
3 1.37 0.11 1.66 0.19 2.07 0.11
4 −0.57 0.23 −1.23 0.33 −2.01 0.31
5 −0.37 0.12 1.10 0.19 0.21 0.15
4 1 −0.21 0.18 −0.50 0.34 −0.02 0.20
2 0.30 0.35 −2.08 0.49 −2.16 0.39
3 0.86 0.13 1.96 0.22 2.56 0.15
4 −0.45 0.28 −1.24 0.40 −1.86 0.36
5 −0.51 0.13 1.86 0.21 1.47 0.16
5 1 −0.59 0.11 0.21 0.18 −0.34 0.14
2 −0.46 0.20 −1.55 0.32 −1.77 0.23
3 1.23 0.10 1.05 0.17 0.52 0.11
4 −0.49 0.10 −0.46 0.25 0.71 0.10
5 0.31 0.09 0.74 0.17 0.87 0.09
6 1 1.34 0.10 0.28 0.20 0.69 0.10
2 −0.23 0.10 0.29 0.17 0.04 0.11
3 0.08 0.09 0.21 0.18 0.41 0.10
4 −0.68 0.17 −1.36 0.32 −1.58 0.22
5 −0.51 0.09 0.58 0.16 0.44 0.10
7 1 −0.48 0.26 −1.71 0.41 −2.35 0.31
2 −0.56 0.13 −0.59 0.32 −0.30 0.15
3 0.74 0.09 2.46 0.20 1.74 0.11
4 0.07 0.11 −0.78 0.40 0.70 0.11
5 0.22 0.11 0.61 0.24 0.21 0.13
(continued )

397
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 398

Bolt, Cohen, and Wollack


TABLE 3 (Continued)
Two-Class Solution: English Usage Data
Class 1 Class 2
Item Category λ̂ SE ζ̂ SE ζ̂ SE
8 1 0.06 0.10 −0.56 0.29 0.53 0.12
2 −0.02 0.21 −1.62 0.36 −1.74 0.26
3 0.67 0.08 1.19 0.17 1.85 0.10
4 −0.17 0.10 0.37 0.19 0.13 0.14
5 −0.55 0.12 0.63 0.18 −0.77 0.19
9 1 −0.27 0.15 −1.56 0.42 −0.56 0.16
2 −0.20 0.17 −0.61 0.25 −1.35 0.22
3 −0.56 0.11 0.13 0.20 −0.17 0.14
4 1.24 0.09 1.03 0.19 1.57 0.10
5 −0.21 0.09 1.00 0.17 0.51 0.11
10 1 −0.39 0.25 −1.71 0.40 −1.66 0.31
2 −0.79 0.14 −0.40 0.27 0.25 0.16
3 1.28 0.11 1.66 0.18 2.45 0.13
4 −0.18 0.20 −0.73 0.28 −1.18 0.25
5 0.07 0.13 1.18 0.18 0.13 0.18
11 1 −1.23 0.18 −1.84 0.33 −1.66 0.23
2 1.13 0.10 1.44 0.17 0.46 0.13
3 −0.01 0.09 0.13 0.23 1.16 0.10
4 0.07 0.15 −0.59 0.26 −1.01 0.18
5 0.03 0.08 0.86 0.17 1.06 0.10
12 1 −0.12 0.20 −0.76 0.36 −1.14 0.23
2 −0.53 0.12 −0.19 0.32 1.34 0.12
3 −0.37 0.30 −2.18 0.49 −2.27 0.33
4 1.18 0.12 2.20 0.21 1.26 0.13
5 −0.16 0.12 0.93 0.22 0.82 0.13

determined by computing the sum of the absolute values of the Djks across item
categories:

TDj = ∑ Djk . (13)


k

Table 4 reports the Djk and TDj indices for each item. Also indicated is the item
type as reported in the test specifications. A positive Djk indicates that conditional
on θ-level, members of Class 1 are more likely to endorse the category than mem-
bers of Class 2. The TDj indices would appear to suggest that some items more
effectively differentiate the classes than others, with Items 5, 8, 10, 11, and 12
being the most different. A comparison of the classes with respect to the correct
response options shows a clear association between classes and item type. Class 2
is more likely to select the correct option when the error is punctuation-related

398
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 399

A Mixture Item Response Model for Multiple-Choice Data


TABLE 4
Average ICCC Differences Across Classes: Two-Class Solution, English Usage Data
Item No. Item type Dj1 Dj2 Dj3 Dj4 Dj5 TDj
1 punctuation for clarity 0.099 −0.146* 0.011 −0.015 0.052 0.323
2 run-on 0.004 −0.226* 0.005 0.138 0.079 0.454
3 punctuation for clarity −0.023 −0.055 −0.129* 0.018 0.188 0.414
4 comma splice −0.016 0.002 −0.169* 0.010 0.172 0.369
5 subject-verb agreement 0.096 0.008 0.134* −0.198 −0.041 0.482
6 punctuation for clarity −0.074* 0.050 −0.027 0.009 0.042 0.205
7 pronoun reference 0.006 −0.043 0.201* −0.170 0.006 0.434
8 punctuation for clarity −0.100 0.007 −0.182* 0.062 0.213 0.567
9 comma splice −0.047 0.035 0.037 −0.156* 0.131 0.406
10 punctuation for clarity 0.004 −0.057 −0.202* 0.026 0.229 0.517
11 subject-verb agreement 0.001 0.262* −0.230 0.022 −0.055 0.587
12 adverb-adjective 0.011 −0.293 0.001 0.264* 0.017 0.643
Note. *Denotes correct response category

(e.g., comma splice; Items 1, 2, 3, 4, 6, 8, 9, and 10), while Class 1 is more likely
to select the correct option when the error is related to word usage (e.g., subject-
verb agreement; Items 5, 7, 11, and 12). At the same time, however, the classes
are also very much distinguished by the distractors selected as incorrect
responses. For example, Class 2 is disproportionately attracted to Option 2 on
Item 12 and Option 4 on Item 5. Each of these options relates to a correctly-placed
colon (:) in a sentence. In selecting these options, members of Class 2 imply the
need to remove or replace the colon, when in fact the colon is not the error. Mem-
bers of Class 2 are also disproportionately attracted to Option 4 on Item 7 and
Option 3 on Item 11. These options are also punctuation-related responses, imply-
ing the perceived need to insert a comma (or some other form of punctuation)
where no punctuation is needed. In general, Class 1 is more likely to select Cate-
gory 5 (“No error”) than Class 2, especially on items in which the error involves
punctuation for clarity, such as Items 3, 8, and 10. In punctuation-for-clarity items,
punctuation that should be present (such as a comma) to improve the clarity of the
sentence has not been included. It would appear that members of Class 2 did not
detect the need for punctuation to improve clarity, and thus indicated the sentence
had no error.
Taken together, it would appear that the best way of distinguishing the two classes
is not with respect to which items they answer correctly, but more generally with
respect to the types of responses to which they are attracted. Class 1 appears to be
disproportionately attracted to word usage as the cause of problems in sentences,
while Class 2 is disproportionately attracted to punctuation. Given that there is fre-
quently some degree of subjectivity in what defines correct English usage, the two
classes would appear to have differential sensitivities as to what constitutes an error
in English usage. An alternative explanation is presented in the conclusion section.
399
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 400

Bolt, Cohen, and Wollack

Classification of Item Response Patterns


A useful byproduct of the MCMC approach is that the item response vectors can
be readily classified into the latent classes based on the frequency with which they
were classified into each class over the history of the Markov chain. An estimate
of the posterior probability of class membership is the proportion of times the
examinee was classified in each class.
Table 5 provides examples of response vectors from the cross-validation sam-
ple, and their estimated class for the two-class MNRM solution. Also reported is
the posterior probability of class membership (the proportion of iterations the
examinee was sampled in that class) and latent ability estimate θ̂. Because of the
interpretation that could be given to the two classes, the classification of response
patterns would appear to have practical value. Members of Class 1 can be informed
that they are focusing too heavily on word usage as the cause of errors, while mem-
bers of Class 2 are focusing too heavily on punctuation.

Simulation Analyses
While the previous analysis was exploratory, in many applications distractors may
be intentionally designed so as to distinguish known class types (such as in the sub-
traction example). In such applications, a constrained version of the model can be
fit in which only the ζ parameters associated with categories believed to distinguish
classes are allowed to vary, and equality constraints are imposed on the ζ parame-
ters for other categories (up to the normalization constraint Σ k = 1 ζgjk = 0). For
K

example, based on the exploratory analysis of the English Usage data, future analy-
ses might apply a constrained version of the two-class MNRM to other English
usage test forms in which only categories that are clearly punctuation or word-
usage categories are allowed to vary across classes.
To evaluate the accuracy of item parameter estimates and examinee classifica-
tion, a simulation study was conducted. As in the simulation analysis for the number-

TABLE 5
Classification of Item Response Vectors: Examples from English Usage Data,
Cross-Validation Sample
Response Pattern Estimated Class Posterior Probability θ̂
123515545354 1 .83 −.05
223331555324 1 .80 1.04
545552555555 1 .99 −1.16
145552551242 1 .54 −1.55
125535345515 1 .97 −.69
223311434324 2 .98 1.22
223545135552 2 .94 −.77
223525234325 2 .87 .06
223331335535 2 .76 .58
523335331315 2 .96 .03

400
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 401

A Mixture Item Response Model for Multiple-Choice Data


of-classes criteria, the data for this study were generated so that only the ζs for a
small number of categories varied across classes. Data sets containing simulated
responses to a twenty-item test for 1,000 examinees were generated for two classes.
For Class 1, the λs and ζs were selected from NRM estimates of 20 items from the
English Usage test. For Class 2, the same ζs and λs were used, with the exception
of selected items for which the ζ for category 2 was increased by 2.5. In each class,
θs were generated from normal distributions with precisions of one. Three factors
were manipulated: (a) the number of items in which there was a ζ-shift for Class 2
(five versus 10 items); (b) the mixing proportions of the classes [π = (.50, .50) vs.
π = (.80, .20)]; and (c) the mean of θ within each class (µ1 = µ2 = 0.0 vs. µ1 = .5,
µ2 = − .5). To impose equality constraints on ζs for which no shift is simulated, the
model was reparameterized so that the same ζs and λs were used for both classes,
and a shift parameter (representing ζ2ij − ζ1ij) was introduced to account for differ-
ences in the category intercepts across classes. The shift is constrained to zero for
all categories except those in which the shift was simulated. Two-class MNRM
solutions were obtained for each data set. Each chain was run for at least 10,000
iterations after omitting 500 burn-in iterations. Unlike the previous exploratory
analysis, the use of equality constraints on the ζ parameters also allowed θ pa-
rameters to be sampled.
Parameter recovery results across the eight conditions are shown in Table 6. The
accuracy of the resulting parameter estimates within each class are reported in
terms of Root Mean Square Error. The classification accuracy for examinees is
reported in terms of a hit rate, which is the proportion of examinees that were clas-
sified into the class from which they were simulated.
Across all eight simulation conditions, the λ and ζ parameters appear to be
recovered reasonably well. The estimates for other model parameters were more
noticeably affected by the three factors manipulated in the study. This was antic-
ipated, as each of these factors affects the distinguishability of the classes.
Because the accuracy of parameter estimation is dependent on having examinees
classified correctly (and likewise, classification accuracy is dependent on accu-
rate parameter estimation), smaller between-class differences result in poorer hit
rates and poorer parameter recovery within each class. For example, when five
as opposed to 10 items include a ζ-shift, examinee hit rates tend to be lower and
parameter estimates less accurate. The difference between µ1 and µ2 appears to
have a similar effect. When there is no difference between µs across classes,
θ no longer contributes to estimation of class membership, contributing to
poorer parameter recovery. Finally, the mixing proportions also appeared to
have an effect on hit rates, with lower hit rates in classes simulated to have fewer
examinees.

Conclusion
The goal of this article was to investigate a discrete mixture version of the nomi-
nal response model. A unique feature of the mixture model presented in this arti-
cle is its capacity to explain patterns in the types of incorrect responses examinees
401
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
402
TABLE 6
Bolt, Cohen, and Wollack

Results of Simulation Analyses: Item and Class Parameter Estimation


0578-02 Bolt 6/5/02 15:28 Page 402

Hit Rate
RMSE RMSE RMSE RMSE RMSE RMSE
Condition µ1, µ2 N1, N2 #Shift λ ζ ζ-Shift π µ τ Class 1 Class 2
1 0,0 500,500 5 .115 .128 .435 .013 .090 .032 .867 .825
2 0,0 500,500 10 .117 .130 .338 .001 .044 .022 .942 .906
3 0,0 800,200 5 .101 .130 .466 .017 .012 .045 .958 .740
4 0,0 800,200 10 .104 .131 .171 .011 .071 .182 .967 .884
5 −.5,.5 500,500 5 .107 .133 .292 .019 .143 .038 .906 .900
6 −.5,.5 500,500 10 .105 .138 .181 .006 .086 .050 .972 .942
7 −.5,.5 800,200 5 .106 .108 .214 .027 .077 .263 .989 .690
8 −.5,.5 800,200 10 .109 .101 .163 .002 .042 .113 .988 .925
Note. RMSE = Root Mean Square Error. # Shift = The number of items simulated to have a category intercept shift across classes.

Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015


0578-02 Bolt 6/5/02 15:28 Page 403

A Mixture Item Response Model for Multiple-Choice Data


select on multiple-choice items. Dependence among response categories in multiple-
choice items is often investigated using dual scaling techniques (see e.g., Nishisato,
1994). The model-based approach presented in this article provides a natural gen-
eralization of the nominal response model that allows it to account for local depen-
dence while simultaneously estimating the ability levels and class memberships of
examinees. Related psychometric models for classifying examinees into cogni-
tively diagnostic classes have been proposed by Dibello, Stout, Roussos and
(1995), Tatsuoka (1985b), and Yamamoto (1987), but have typically been applied
to dichotomously scored test items.
The results obtained for the English Usage test would appear to support previ-
ous suggestions regarding the potential diagnostic value of attending to the incor-
rect responses examinees select in multiple-choice test items (Mislevy & Verhelst,
1990). For this particular test, a distinction could be made between two classes—
one drawn to punctuation-related and the other to word-usage response categories.
As mentioned earlier, this class difference could reflect a fundamental individual
difference among students in perceptions of English grammar. Alternatively, it
may be the result of different preconceptions as to the types of errors expected on
the test. Regardless, this detectable class distinction seems to have important impli-
cations for test and item development. Attention in item development should be
given not only to the types of errors introduced into sentences, but also to the types
of distractors introduced as alternatives on test items. In attempting to develop par-
allel forms, for instance, it would appear important to balance not only the pro-
portion of word-usage versus punctuation-related errors introduced into sentences,
but also the relative frequency with which distractors focus on word-usage versus
punctuation. At the same time, the class distinction identified in this study makes
a cognitive diagnosis of response patterns possible. By reporting to examinees not
only test scores but also class membership, examinees gain information on how
best to improve test scores (e.g., to attend more to word-usage errors, or alterna-
tively, to attend more to punctuation-related errors).
Clearly, one limitation of the model is its large number of parameters. As shown
in the simulation study, the model can be considerably simplified by imposing
equality constraints on ζ parameters across classes for which no differences are
expected. This may often be appropriate in practice, especially when distractors
are intentionally designed to distinguish known latent classes.
Due to the preliminary nature of this study, many practical issues related to use
of the model, such as model comparison criteria, require further investigation. For
example, alternative indices for determining the number of classes, such as the
alternatives to Q3 considered by Chen and Thissen (1997), may perform better than
the two considered here. It should also be possible to consider a multidimensional
model analogous to the discrete mixture model presented here. However, a discrete
representation was considered beneficial for three reasons: (a) It lends itself more
towards diagnostic classification of examinee response patterns; (b) since the
MNRM is heavily dependent on a graphical interpretation, it becomes easier to
interpret the difference between classes when considered in discrete as opposed to
403
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 404

Bolt, Cohen, and Wollack


continuous terms; and (c) it is consistent with current emphases by educational and
cognitive psychologists that many differences in test performance are best inter-
preted qualitatively (Mislevy, 1995).
Finally, this study provides some initial support for use of MCMC methods in
estimating IRT mixture models. An advantage of this methodology is its ease of
implementation for even complex latent variable models. Clearly more work can
be done to investigate its feasibility with other types of IRT mixture models.

Appendix A: Sample Items from an English Usage Test

1. I, Claudius, one of television’s most lauded series, are being rebroadcast.


1 2 3 4
No error. (Answer: 4)
5

2. Maria, who had just eaten, thought concerning having a candy bar or ice
1 2 3 4
cream. No error. (Answer: 3)
5

3. Nobody believes that the defendant will be acquitted, even his strongest sup-
1 2 3
porters are convinced of his guilt. No error. (Answer: 3)
4 5

404
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 405

A Mixture Item Response Model for Multiple-Choice Data

Appendix B: WINBUGS Code for Mixture Nominal Response Model:


Exploratory Application

#### Beginning of Model Command File ####


#### Notation: r = item response data matrix; theta =
#### vector of student latent abilities;
#### gmem = vector of student class member-
#### ships; zeta = matrix of category inter-
#### cepts; lambda = matrix of category
#### slopes; zetan = matrix of normalized
#### category intercepts; lambdan = matrix
#### of normalized category slopes; mu =
#### vector of class ability means; tau =
#### vector of class ability precisions;
#### pi = vector of class mixing proportions;
#### N = Number of students; T = Number of
#### items; NC = Number of categories per
#### item; G = Number of class

model
{
#### Read in Item Response Data and Fixed Abilities from Data List

for (j in 1:N) {
for (k in 1:T) {
r[j,k]<-resp[j,k]
}
theta[j]<-fixthet[j]
}

#### Specify Mixture Nominal Model

for (j in 1:N) {
for (k in 1:T) {
for (l in 1:NC) {
prop [j,k,l]<-exp(zetan[gmem[j],k,l]+lambdan[k,l]*theta[j])
}
for (l in 1:NC) {
p[j,k,l]<-prop[j,k,l]/(sum(prop[j,k,]))
}
r[j,k]~dcat(p[j,k,])
}

405
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 406

Bolt, Cohen, and Wollack


#### Specify Priors for Person Parameters

theta[j]~dnorm(mu[gmem[j]],tau[gmem[j]])
gmem[j]~dcat(pi[1:G])
}

#### Specify Priors for Item and Class Parameters

for (k in 1:T) {
for (l in 1:NC) {
for (j in 1:G) {
zeta[j,k,l]~dnorm(0,1.); zetan[j,k,l]<-zeta[j,k,l]-mean(zeta[j,k,])
}
lambda[k,l]~dnorm(0,1.); lambdan[k,l]<-lambda[k,l]-
mean(lambda[k,])
}}
pi[1:2]~ddirch(alpha[1:2])
mu[1]~dnorm(0.,1.)
mu[2]~dnorm(0.,1.)
tau[1]~dgamma(2.,4.)
tau[2]~dgamma(2.,4.)
}

#### End of Model Command File

###Beginning of Data List

list(N=1000, T=12, NC=5, G=2,alpha=c(.01,.01),


resp=structure(.Data=c(
2,2,3,1,4,1,3,1,4,3,2,5,
. . .
. . .
5,2,5,5,5,5,5,3,5,3,3,2), .Dim=c(1000,12))),
fixthet=c(
.44,
.
.
-1.35))

#### End of Data List

####Beginning of Initial Values List

406
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 407

A Mixture Item Response Model for Multiple-Choice Data


list(lambda=structure(.Data=c(
-.25, 1.0, -.25, -.25, -.25,
. . .
. . .
-.25, -.25, -.25, 1.0, -.25), .Dim=c(12,5)),
mu=c(.00,.00),tau=c(1.,1.),pi=c(.5,.5))

#### End of Initial Values List

References
Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York:
Marcel-Dekker.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored
in two or more nominal categories. Psychometrika, 37, 29–51.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs. I.
Method of paired comparisons. Biometrika, 39, 324–345.
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item
response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.
DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diag-
nostic assessment likelihood-based classification techniques. In P. D. Nichols, S. F.
Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389).
Hillsdale, NJ: Lawrence Erlbaum.
Diebold, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through
Bayesian sampling. Journal of the Royal Statistical Society, B, 56, 163–175.
Geisser, S., & Eddy, W. (1979). A predictive approach to model selection. Journal of the
American Statistical Association, 74, 153–160.
Gelfand, A. E. (1995). Model determination using sampling-based methods. In W. R Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte Carlo in Practice
(pp.145–161). Washington DC: Chapman & Hall.
Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: asymptotics and exact calcu-
lations. Journal of the Royal Statistical Society, B, 56, 501–514.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple
sequences. Statistical Science, 7, 457–472.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calcula-
tion of posterior moments. In J. M. Bernado, J. O. Berger, A. P. Dawid, & A. F. M. Smith
(Eds.), Bayesian Statistics 4, (pp. 169–193). Oxford: Oxford University Press.
Gilks, W. R. (1996). Full conditional distributions. In W. R. Gilks, S. Richardson, & D. J.
Spiegelhalter (Eds.), Markov Chain Monte Carlo in practice (pp. 75–88). Washington,
DC: Chapman & Hall.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in
practice. Washington, DC: Chapman & Hall.
Gilks, W. R., & Wild, P. (1992). Adaptive rejection sampling for Gibbs’ sampling. Applied
Statistician, 41, 337–348.
Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential dis-
tractor functioning. Journal of Educational Measurement, 26, 147–160.
Hoijtink, H., & Molenaar, I. W. (1997). A multidimensional item response model: Con-
strained latent class analysis using posterior predictive checks. Psychometrika, 62,
171–189.

407
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 408

Bolt, Cohen, and Wollack


Kelderman, H., & Macready, G. B. (1990). The use of loglinear models for assessing dif-
ferential item functioning across manifest and latent examinee groups. Journal of Edu-
cational Measurement, 27, 307–327.
Lenk, P. J., & DeSarbo, W. S. (2000). Bayesian inference for finite mixtures of generalized
linear models with random effects. Psychometrika, 65, 93–119.
Luce, R. D. (1959). Individual choice behavior. New York: Wiley.
McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J. van der Linden,
& R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269).
New York: Springer.
Mislevy, R. J. (1995). Foundations of a new test theory. In N. Fredericksen., R. J. Mislevy,
& I. I. Bejar (Eds.), Test theory for a new generation of tests (pp. 19–39). Hillsdale, NJ:
Lawrence Erlbaum.
Mislevy, R. J., & Verhelst, N. (1990). Modeling item responses when different subjects
employ different solution strategies. Psychometrika, 55, 195–215.
Nishisato, S. (1994). Elements of dual scaling: An introduction to practical data analysis.
Hillsdale, NJ: Lawrence Erlbaum.
Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov Chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral Statis-
tics, 24, 146–178.
Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Mul-
tiple item types, missing data, and rated responses. Journal of Educational and Behav-
ioral Statistics, 24, 342–366.
Robert, C. P. (1996). Mixtures of distributions: inference and estimation. In W. R. Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte Carlo in Practice
(pp. 441–464). Washington DC: Chapman & Hall.
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item
analysis. Applied Psychological Measurement, 14, 271–282.
Rost, J. (1991). A logistic mixture distribution model for polychotomous item responses.
British Journal of Mathematical and Statistical Psychology, 44, 75–92.
Rost, J. (1997). Logistic mixture models. In W. J. van der Linden, & R. K. Hambleton (Eds.)
Handbook of modern item response theory (pp. 449–463). New York: Springer.
Samejima, F. (1979). A new family of models for the multiple-choice item. Office of Naval
Research Report, 79–4. Knoxville, TN: University of Tennessee.
Samejima, F. (1972). A general model for free-response data. Psychometric Monograph,
No. 18.
Smith, A. F. M. (1991). Discussion of ‘Posterior Bayes factors’ by M. Aitken. Journal of
the Royal Statistical Society B, 53, 132–133.
Spiegelhalter, D., Thomas, A., Best, N., & Gilks, W. (1996). BUGS 0.5* Bayesian Infer-
ence Using Gibbs Sampling Manual (version ii). [Computer program.]
Spiegelhalter, D., Thomas, A., & Best, N. (2000). WINBUGS version 1.3. [Computer pro-
gram]. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.
Tatsuoka, K. K. (1985a). Diagnosing cognitive errors: Statistical pattern classification and
recognition approach. Research Report 85-1-ONR. University of Illinois at Urbana-
Champaign.
Tatsuoka, K. K. (1985b). A probabilistic model for diagnosing misconceptions by the
pattern classification approach. Journal of Educational Statistics, 10, 55–73.

408
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 409

A Mixture Item Response Model for Multiple-Choice Data


Thissen, D., & Steinberg, L. (1984). A response model for multiple choice items. Psycho-
metrika, 49, 501–519.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika,
51, 567–577.
Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice models: The dis-
tractors are also part of the item. Journal of Educational Measurement, 26, 161–176.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning
using the parameters of item response models. In P. H. Holland, & H. Wainer (Eds.) Dif-
ferential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence-Erlbaum.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34,
278–286.
Titterington, D. M., Smith, A. F. M, & Makov, U. E. (1985). Statistical analysis of finite
mixture distributions. New York: Wiley.
Wollack, J. A., Bolt, D. M., Cohen, A. S., & Lee, Y. S. (in press). Recovery of item param-
eters in the nominal response model: A comparison of marginal maximum likelihood and
Markov Chain Monte Carlo estimation. Applied Psychological Measurement.
Yamamoto, K. (1987). A model that combines IRT and latent class models. Unpublished
doctoral dissertation, University of Illinois at Urbana Champaign.
Yen, W. (1984). Effects of local item dependence on the fit and equating performance of the
three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.

Authors
DANIEL M. BOLT is an Assistant Professor of Educational Psychology at the University
of Wisconsin, Madison, Wisconsin: dmbolt@facstaff.wisc.edu. His specialty is item
response theory.
ALLAN S. COHEN is Director of Testing and Evaluation Services at the University of Wis-
consin, Madison, Wisconsin: ascohen@facstaff.wisc.edu. His specialty is educational
measurement and test development.
JAMES A. WOLLACK is an Assistant Scientist for Testing and Evaluation Services at the
University of Wisconsin, Madison, Wisconsin: jwollack@facstaff.wisc.edu. His spe-
cialty is item response theory.

Manuscript Received October 2001


Revision Received January 2002
Accepted January 2002

409
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015

You might also like