Professional Documents
Culture Documents
Several polytomous Item Response Theory (IRT) models have been proposed to
model multiple-choice item responses, including Bock’s (1972) nominal response
model and the multiple-choice models of Samejima (1979) and Thissen & Stein-
berg (1984). In contrast to dichotomous IRT models, which model the probability
of correct response, polytomous IRT models model the probability of selecting
each response category. Polytomous IRT modeling can result in more precise esti-
mates of examinee ability (Baker, 1992), as well as deeper insight into the func-
tioning of individual test items, such as the relative attractiveness of distractors at
specific ability levels (Thissen, Steinberg, & Fitzpatrick, 1989).
In this article we demonstrate how polytomous IRT models can also provide a
way to investigate individual differences related to response category selection.
For many educational tests it is believed that the specific response categories exam-
inees select may provide information about examinee cognition that is not appar-
ent from total test scores or IRT-based ability estimates (Mislevy, 1995). Distractors
in multiple-choice items can often be designed to be attractive to students using
381
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 382
z jk = λ jk θ + ζ jk . (1)
The λ parameter reflects the influence of θ on category propensity, while the ζ param-
eter reflects the influence of factors unrelated to θ. In the NRM, zjk is translated into
a probability of selecting category k using a multinomial logistic function:
exp( z jk )
Pjk = , (2)
∑h =1 exp( z jh )
K
exp( zgjk )
Pgjk = . ( 4)
∑h =1 exp( zgjh )
K
383
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 384
where P(Ω) represents a prior density for Ω. An application of Bayes’ theorem pro-
duces the posterior distribution of interest:
P(Ω) P(Y Ω)
P(Ω Y ) = . (7)
∫ P(Ω) P(Y Ω)dΩ
384
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 385
ci ~ Multinomial(1 π1 , π 2 , . . . , π G )
[θ i ci = g] ~ Normal(µ g , τ g )
= ( π1 , π 2 , . . . , π G ) ~ Dirichlet(α1 , α 2 , . . . , α G )
λ jk ~ Normal(0, τ λ )
ζ gjk ~ Normal(0, τ ζ )
µ g ~ Normal(0, τ µ )
τ g ~ Gamma(ε, ξ).
Hyperparameters for several of the prior distributions were fixed at what were
regarded as reasonable values: τλ = τζ = τµ = 1, ε = 2, ξ = 4, and α1 = α2 = . . . = αG
= .01. Assuming a joint distribution as in Equation 6, full conditional distributions
for each parameter conditional upon the data and other model parameters can be
determined at least up to a normalizing constant. We use [ab] as notation for the
conditional distribution of a given b. The MCMC sampling procedure is then com-
posed of the following steps:
386
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 387
∝ [∏ ∏ j k ]
Pgjk (θ i ) I ( yij = k ) Normal(θ i ; µ g , τ g )π g ,
where yi is the item response vector, and g are the item category parameters for
class g, I is an indicator function taking value 1 if response k is selected on item j
and 0 otherwise, and Normal(θi; µg, τg) is the normal density evaluated at θi with
mean µg and precision τg.
Step 2. Sample a latent ability θi for each examinee. Assuming independence of
examinees, the θs have full conditional distributions of the form:
Step 3. Sample category parameters λjk and ζgjk for all item categories and classes.
Assuming conditional independence across items, the λ parameters have full con-
ditional distributions of the form:
∝ [∏ ∏ P( y
g i ijk ζ gj , λ j ( − k ) , 1, θ i )
I ( ci = g )
]
Normal(l; 0, 1),
where yj is the column item response vector for item j across examinees, c and
are the examinee class membership and ability vectors, gj is the vector of inter-
cept parameters in class g for item j, j(−k) is the vector of all slope parameters for
item j except for category k, and I now indicates whether examinee i is in class g.
For the ζ parameters,
∝ [∏ P( y i ijk gj ( − k ) , z, j , θ i )
I ( ci = g )
]
Normal( z; 0, 1),
where g j(−k ) is the vector of all intercept parameters in class g for item j except for
category k, and j is the vector of all slope parameters for item j.
387
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 388
Step 4. Sample the mixing proportions = (π1, π2, . . . , πG). Assuming condi-
tional independence between the mixing proportions and all parameters except the
class memberships of examinees, the mixing proportions have a full conditional
distribution of the form:
Step 5. Sample class ability means and precisions µg and τg for each class. Assum-
ing the ability distribution parameters are independent of all parameters except the
θs for examinees in class g, the conditional distributions are of the form:
τ g Σ i θ i I (ci = g) 1
µ g ~ Normal , + 1 .
τ g ng + 1 τ g ng
ng 1
τ g ~ Gamma + 2, .
2 ng σˆ 2g 1
+
2 4
389
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 390
FIGURE 2. Markov Chain history for item and class parameters, two class MNRM solu-
tion, English Usage data
391
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 392
where Ycal denotes the data for the calibration sample and ωi = (ci, θi) consists of
both a class membership and an ability parameter for each examinee i in the cross-
validation sample. An overall cross-validation log-likelihood is computed as the
sum of the examinee log-likelihoods:
([ ( )] [ ( )]),
r (j1,)k ; j ′, k ′ = Corr I ( yij = k ) − Pjk θˆ i , I ( yij ′ = k ′) − Pj ′k ′ θˆ i (10)
( )
G
r (jG, k); j ′, k ′ = Corr I ( yij = k ) − ∑ Pgjk θˆ gi • P(ci = g),
g =1
(11)
( )
G
where P(ci = g) denotes the proportion of iterations in which examinee i was sam-
pled as a member of class g. Computing residual correlations across all pairs of cat-
egories results in a total of K x (J − 1) correlations for each solution (residuals for
pairs of categories within an item are not included), where K equals the number of
categories per item, and J equals the number of items. A residual correlation of
zero implies that the ability dimension and latent classes are able to account for the
dependence among response categories. We use the mean absolute value of these
residual correlations, the minimum and maximum residual correlations, and nor-
mal QQ-plots of the residual correlations to evaluate local dependence.
Simulation Study Evaluating Cross-Validation Criteria
The usefulness of both the cross-validation log-likelihood and local dependence cri-
teria in determining the correct number of classes was evaluated in a short simula-
393
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:27 Page 394
Residual correlations
Number of Classes in Solution
Number of Classes
Simulated 1 2 3 4
1 Mean Abs Corr .030 .030 .029 .030
Mean Min, Max −.14, .14 −.13,.14 −.17,.13 −.14,.14
2 Mean Abs Corr .041 .031 .030 .030
Mean Min, Max −.23,.34 −.13,.14 −.12,.17 −.13,.15
3 Mean Abs Corr .052 .037 .028 .029
Mean Min, Max −.27,.35 −.21,.29 −.12,.15 −.14,.15
4 Mean Abs Corr .051 .047 .034 .030
Mean Min, Max −.24,.32 −.30,.27 −.22,.26 −.15,.15
Note. Min = Minimum, Max = Maximum, Mean Abs Corr = Mean Absolute Value of residual
Correlation.
TABLE 2
Criteria for the Number of MNRM Classes in English Usage data
Number of Classes in Solution
Statistic 1 2 3 4
log P(Ycv Ycal) −15200 −14370 −14490 −14860
Mean Abs Corr .038 .034 .034 .034
Min/Max Corr −.21,.26 −.20,.17 −.19,.16 −.19,.15
π̂ 1.00 .28, .72 .20, .09, .71 .22, .24, .12, .42
Note. Mean Abs Corr = Mean Absolute Value of residual Correlation, Min = Minimum, Max =
Maximum.
395
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 396
This integral can be approximated using the 1,000 θ estimates from the calibra-
tion examinees. Items that perform the most differentially across classes are
396
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 397
397
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 398
determined by computing the sum of the absolute values of the Djks across item
categories:
Table 4 reports the Djk and TDj indices for each item. Also indicated is the item
type as reported in the test specifications. A positive Djk indicates that conditional
on θ-level, members of Class 1 are more likely to endorse the category than mem-
bers of Class 2. The TDj indices would appear to suggest that some items more
effectively differentiate the classes than others, with Items 5, 8, 10, 11, and 12
being the most different. A comparison of the classes with respect to the correct
response options shows a clear association between classes and item type. Class 2
is more likely to select the correct option when the error is punctuation-related
398
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 399
(e.g., comma splice; Items 1, 2, 3, 4, 6, 8, 9, and 10), while Class 1 is more likely
to select the correct option when the error is related to word usage (e.g., subject-
verb agreement; Items 5, 7, 11, and 12). At the same time, however, the classes
are also very much distinguished by the distractors selected as incorrect
responses. For example, Class 2 is disproportionately attracted to Option 2 on
Item 12 and Option 4 on Item 5. Each of these options relates to a correctly-placed
colon (:) in a sentence. In selecting these options, members of Class 2 imply the
need to remove or replace the colon, when in fact the colon is not the error. Mem-
bers of Class 2 are also disproportionately attracted to Option 4 on Item 7 and
Option 3 on Item 11. These options are also punctuation-related responses, imply-
ing the perceived need to insert a comma (or some other form of punctuation)
where no punctuation is needed. In general, Class 1 is more likely to select Cate-
gory 5 (“No error”) than Class 2, especially on items in which the error involves
punctuation for clarity, such as Items 3, 8, and 10. In punctuation-for-clarity items,
punctuation that should be present (such as a comma) to improve the clarity of the
sentence has not been included. It would appear that members of Class 2 did not
detect the need for punctuation to improve clarity, and thus indicated the sentence
had no error.
Taken together, it would appear that the best way of distinguishing the two classes
is not with respect to which items they answer correctly, but more generally with
respect to the types of responses to which they are attracted. Class 1 appears to be
disproportionately attracted to word usage as the cause of problems in sentences,
while Class 2 is disproportionately attracted to punctuation. Given that there is fre-
quently some degree of subjectivity in what defines correct English usage, the two
classes would appear to have differential sensitivities as to what constitutes an error
in English usage. An alternative explanation is presented in the conclusion section.
399
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 400
Simulation Analyses
While the previous analysis was exploratory, in many applications distractors may
be intentionally designed so as to distinguish known class types (such as in the sub-
traction example). In such applications, a constrained version of the model can be
fit in which only the ζ parameters associated with categories believed to distinguish
classes are allowed to vary, and equality constraints are imposed on the ζ parame-
ters for other categories (up to the normalization constraint Σ k = 1 ζgjk = 0). For
K
example, based on the exploratory analysis of the English Usage data, future analy-
ses might apply a constrained version of the two-class MNRM to other English
usage test forms in which only categories that are clearly punctuation or word-
usage categories are allowed to vary across classes.
To evaluate the accuracy of item parameter estimates and examinee classifica-
tion, a simulation study was conducted. As in the simulation analysis for the number-
TABLE 5
Classification of Item Response Vectors: Examples from English Usage Data,
Cross-Validation Sample
Response Pattern Estimated Class Posterior Probability θ̂
123515545354 1 .83 −.05
223331555324 1 .80 1.04
545552555555 1 .99 −1.16
145552551242 1 .54 −1.55
125535345515 1 .97 −.69
223311434324 2 .98 1.22
223545135552 2 .94 −.77
223525234325 2 .87 .06
223331335535 2 .76 .58
523335331315 2 .96 .03
400
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 401
Conclusion
The goal of this article was to investigate a discrete mixture version of the nomi-
nal response model. A unique feature of the mixture model presented in this arti-
cle is its capacity to explain patterns in the types of incorrect responses examinees
401
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
402
TABLE 6
Bolt, Cohen, and Wollack
Hit Rate
RMSE RMSE RMSE RMSE RMSE RMSE
Condition µ1, µ2 N1, N2 #Shift λ ζ ζ-Shift π µ τ Class 1 Class 2
1 0,0 500,500 5 .115 .128 .435 .013 .090 .032 .867 .825
2 0,0 500,500 10 .117 .130 .338 .001 .044 .022 .942 .906
3 0,0 800,200 5 .101 .130 .466 .017 .012 .045 .958 .740
4 0,0 800,200 10 .104 .131 .171 .011 .071 .182 .967 .884
5 −.5,.5 500,500 5 .107 .133 .292 .019 .143 .038 .906 .900
6 −.5,.5 500,500 10 .105 .138 .181 .006 .086 .050 .972 .942
7 −.5,.5 800,200 5 .106 .108 .214 .027 .077 .263 .989 .690
8 −.5,.5 800,200 10 .109 .101 .163 .002 .042 .113 .988 .925
Note. RMSE = Root Mean Square Error. # Shift = The number of items simulated to have a category intercept shift across classes.
2. Maria, who had just eaten, thought concerning having a candy bar or ice
1 2 3 4
cream. No error. (Answer: 3)
5
3. Nobody believes that the defendant will be acquitted, even his strongest sup-
1 2 3
porters are convinced of his guilt. No error. (Answer: 3)
4 5
404
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 405
model
{
#### Read in Item Response Data and Fixed Abilities from Data List
for (j in 1:N) {
for (k in 1:T) {
r[j,k]<-resp[j,k]
}
theta[j]<-fixthet[j]
}
for (j in 1:N) {
for (k in 1:T) {
for (l in 1:NC) {
prop [j,k,l]<-exp(zetan[gmem[j],k,l]+lambdan[k,l]*theta[j])
}
for (l in 1:NC) {
p[j,k,l]<-prop[j,k,l]/(sum(prop[j,k,]))
}
r[j,k]~dcat(p[j,k,])
}
405
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 406
theta[j]~dnorm(mu[gmem[j]],tau[gmem[j]])
gmem[j]~dcat(pi[1:G])
}
for (k in 1:T) {
for (l in 1:NC) {
for (j in 1:G) {
zeta[j,k,l]~dnorm(0,1.); zetan[j,k,l]<-zeta[j,k,l]-mean(zeta[j,k,])
}
lambda[k,l]~dnorm(0,1.); lambdan[k,l]<-lambda[k,l]-
mean(lambda[k,])
}}
pi[1:2]~ddirch(alpha[1:2])
mu[1]~dnorm(0.,1.)
mu[2]~dnorm(0.,1.)
tau[1]~dgamma(2.,4.)
tau[2]~dgamma(2.,4.)
}
406
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 407
References
Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York:
Marcel-Dekker.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored
in two or more nominal categories. Psychometrika, 37, 29–51.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs. I.
Method of paired comparisons. Biometrika, 39, 324–345.
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item
response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.
DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diag-
nostic assessment likelihood-based classification techniques. In P. D. Nichols, S. F.
Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389).
Hillsdale, NJ: Lawrence Erlbaum.
Diebold, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through
Bayesian sampling. Journal of the Royal Statistical Society, B, 56, 163–175.
Geisser, S., & Eddy, W. (1979). A predictive approach to model selection. Journal of the
American Statistical Association, 74, 153–160.
Gelfand, A. E. (1995). Model determination using sampling-based methods. In W. R Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte Carlo in Practice
(pp.145–161). Washington DC: Chapman & Hall.
Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: asymptotics and exact calcu-
lations. Journal of the Royal Statistical Society, B, 56, 501–514.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple
sequences. Statistical Science, 7, 457–472.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calcula-
tion of posterior moments. In J. M. Bernado, J. O. Berger, A. P. Dawid, & A. F. M. Smith
(Eds.), Bayesian Statistics 4, (pp. 169–193). Oxford: Oxford University Press.
Gilks, W. R. (1996). Full conditional distributions. In W. R. Gilks, S. Richardson, & D. J.
Spiegelhalter (Eds.), Markov Chain Monte Carlo in practice (pp. 75–88). Washington,
DC: Chapman & Hall.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in
practice. Washington, DC: Chapman & Hall.
Gilks, W. R., & Wild, P. (1992). Adaptive rejection sampling for Gibbs’ sampling. Applied
Statistician, 41, 337–348.
Green, B. F., Crone, C. R., & Folk, V. G. (1989). A method for studying differential dis-
tractor functioning. Journal of Educational Measurement, 26, 147–160.
Hoijtink, H., & Molenaar, I. W. (1997). A multidimensional item response model: Con-
strained latent class analysis using posterior predictive checks. Psychometrika, 62,
171–189.
407
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 408
408
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015
0578-02 Bolt 6/5/02 15:28 Page 409
Authors
DANIEL M. BOLT is an Assistant Professor of Educational Psychology at the University
of Wisconsin, Madison, Wisconsin: dmbolt@facstaff.wisc.edu. His specialty is item
response theory.
ALLAN S. COHEN is Director of Testing and Evaluation Services at the University of Wis-
consin, Madison, Wisconsin: ascohen@facstaff.wisc.edu. His specialty is educational
measurement and test development.
JAMES A. WOLLACK is an Assistant Scientist for Testing and Evaluation Services at the
University of Wisconsin, Madison, Wisconsin: jwollack@facstaff.wisc.edu. His spe-
cialty is item response theory.
409
Downloaded from http://jebs.aera.net at GEORGIAN COURT UNIV on May 13, 2015