Item Response Theory and Health Outcomes Measurement in The 21st Century

MEDICAL CARE
Volume 38, Number 9‚ Supplement II, pp II-28–II-42

©2000 Lippincott Williams & Wilkins, Inc.
Item Response Theory and Health Outcomes Measurement in

the 21st Century
RON D. HAYS, PHD,*† LEO S. MORALES, MD, MPH,*† AND STEVE P. REISE, PHD*‡
Item response theory (IRT) has a number of implementing computer adaptive testing. Fi-
potential advantages over classical test theory nally, IRT methods can be helpful in develop-
in assessing self-reported health outcomes. ing better health outcome measures and in
IRT models yield invariant item and latent assessing change over time. These issues are
trait estimates (within a linear transformation), reviewed, along with a discussion of some of
standard errors conditional on trait level, and the methodological and practical challenges in
trait estimates anchored to item content. IRT applying IRT methods.
also facilitates evaluation of differential item Key words: item response theory; health
functioning, inclusion of items with different outcomes; differential item functioning; com-
response formats in the same scale, and assess- puter adaptive testing. (Med Care 2000;38
ment of person fit and is ideally suited for [suppl II]:II-28 –II-42)
Classical test theory (CTT) partitions observed errors conditional on trait level, and trait estimates
item and scale responses into true score plus error. linked to item content. In addition, IRT facilitates
The person to whom the item is administered and evaluation of whether items are equivalent in
the nature of the item itself influence the proba- meaning to different respondents (differential
bility of a particular item response. A major limi- item functioning) and inclusion of items with
tation of CTT is that person ability and item different response formats in the same scale, as-
difficulty cannot be estimated separately. In addi- sessing person fit, and it is ideally suited for
tion, CTT yields only a single reliability estimate implementing computerized adaptive testing. IRT
and corresponding standard error of measure- methods can also be helpful in developing better
ment, but the precision of measurement is known health outcome measures over time. After a basic
to vary by ability level. introduction to IRT models, each of these issues is
Item response theory (IRT) comprises a set of discussed. Then we discuss how IRT models can
generalized linear models and associated statisti- be useful in assessing change. Finally, we note
cal procedures that connect observed survey re- some of the methodological and practical prob-
sponses to an examinee’s or a subject’s location on lems in applying IRT methods.
an unmeasured underlying (“latent”) trait.1 IRT To illustrate results from real data, we refer to
models have a number of potential advantages the 9-item measure of physical functioning ad-
over CTT methods in assessing self-reported ministered to participants in the HIV Cost and
health outcomes. IRT models yield item and latent Services Utilization Study (HCSUS).2– 4 Study par-
trait estimates (within a linear transformation) that ticipants were asked to indicate whether their
do not vary with the characteristics of the popu- health limited them a lot, a little, or not at all in
lation with respect to the underlying trait, standard each of the 9 activities during the past 4 weeks (see
*From UCLA, School of Medicine, Los Angeles, California. Address correspondence and requests for reprints to:
†From UCLA, Department of Psychology, Los Ange- Ron D. Hays, PhD, Division of General Internal Medi-
cine and Health Services Research, 911 Broxton Plaza,
les, California.
Room 110, Box 951736, Los Angeles, CA 90095-1736.
‡From RAND, Health Sciences, Santa Monica, California. E-Mail: hays@rand.org
II-28
Vol. 38, No. 9‚ Suppl. II IRT AND HEALTH OUTCOMES
Appendix). The items were selected to represent a TABLE 1. Item Means, Standard Deviations, and
range of functioning, including basic activities of Percent Not Limited for 9 Physical
daily living (feeding oneself, bathing or dressing, Functioning Items
preparing meals, or doing laundry), instrumental Not
activities of daily living (shopping), mobility (get- Limited,
ting around inside the home, climbing stairs, Item Mean (SD) %
walking 1 block, walking ⬎1 mile), and vigorous
activities. Five items (vigorous activities, climbing 1 Vigorous activities 1.97 (0.86) 45
flight of stairs, walking ⬎1 mile, walking 1 block, Walking ⬎1 mile 2.22 (0.84) 49
bathing or dressing) are identical to those in the Climbing 1 flight of stairs 2.37 (0.76) 55
SF-36 health survey.5 Shopping 2.61 (0.68) 72
Item means, standard deviations, and the per- Walking 1 block 2.63 (0.64) 72
centage not limited in each activity are provided in Preparing meals or doing 2.67 (0.63) 75
Table 1. The 9 items are scored on the 3-point laundry
response scale, with 1 representing limited a lot, 2 Bathing or dressing 2.80 (0.49) 84
representing limited a little, and 3 representing not Getting around inside 2.81 (0.47) 84
limited at all. Items are ordered by their means, your home
which range from 1.97 (vigorous activities) to 2.90 Feeding yourself 2.90 (0.36) 91
(feeding yourself). These data will be used to
provide an example of estimating item difficulty Items are scored 1 ⫽ yes, limited a lot; 2 ⫽ yes,
limited a little; and 3 ⫽ no, not limited at all.
and discrimination parameters, category thresh-
olds, model fit, and the unidimensionality as-
sumption of IRT. dent trait level and increasing probability of en-
dorsing an item. As shown in Figure 1, the ICC
displays the nonlinear regression of the probability
IRT Basics of a particular response (y axis) as a function of
trait level (x axis). Items that produce a nonmono-
IRT models are mathematical equations de- tonic association between trait level and response
scribing the association between a respondent’s probability are unusual, but nonparametric IRT
underlying level on a latent trait and the probabil- models have been developed.9 The middle of the
ity of a particular item response using a nonlinear ICC is steeper in slope, implying large changes in
monotonic function.6 The correspondence be- probability of an endorsement with small changes
tween the predicted responses to an item and the in trait level. Item discrimination corresponds to
latent trait is known as the item-characteristic the slope of the ICC. The ICC for items with a
curve (ICC). Most applications of IRT assume higher probability of endorsement (easier items)
unidimensionality, and all IRT models assume are located farther to the left on the trait scale, and
local independence.7 Unidimensionality means those with a lower probability of endorsement
that only 1 construct is measured by the items in a (harder items) are located further to the right
scale. Local independence means that the items (Figure 1). For example, Figure 1 shows ICCs for 3
are uncorrelated with each other when the latent items, each having 2 possible responses (dichoto-
trait or traits have been controlled for.8 In other mous): (1) no and (2) yes. Item 1 is the “easiest”
words, local independence is obtained when the item because the probability of a “yes” response
complete latent trait space is specified in the for a given trait level tends to be higher for it than
model. If the assumption of unidimensionality for the other 2 items. Item 3 is the “hardest” item
holds, then only a single latent trait is influencing because the probability of a “yes” response for a
item responses and local independence is ob- given trait level tends to be lower than for the
tained. other 2 items.
Item-Characteristic Curves Dichotomous IRT Models

With dichotomous items, there tends to be an The different kinds of IRT models are distin-
s-shaped relationship between increasing respon- guished by the functional form specified for the
II-29
HAYS ET AL MEDICAL CARE
FIG. 1. Item characteristic curves for 3 dichotomous items.
relationship between underlying ability and item items to vary in their difficulty level (probability of
response probability (ie, the ICC). For simplicity, endorsement or scoring high on the item), but it
we focus on dichotomous item models here and assumes that all items are equally discriminating
briefly describe examples for polytomous items (the item discrimination parameter, ␣, is fixed at
(items with multiple response categories). Polyto- the same value for all items). Observed dichoto-
mous models are extensions of dichotomous IRT mous item responses are a function of the latent
models. The features of the 3 main types of trait (␪) and the difficulty of the item (␤):
dichotomous IRT models are summarized in Table
P(␪I) ⫽ eD␣(␪ ⫺ ␤)/[1 ⫹ eD␣(␪ ⫺ ␤)]
2. As noted, each of these models estimates an
item difficulty parameter. The 2- and 3-parameter ⫽ 1/[1⫹ e⫺D␣(␪ ⫺ ␤)]
models also estimate an item discrimination pa-
rameter. Finally, the 3-parameter model includes a D is a scaling factor that can be used to make the
“guessing” parameter. logistic function essentially the same as the nor-
This article takes the position that the Rasch mal ogive model (ie, setting D ⫽ 1.7). Latent trait
model is nested within the 2- and 3-parameter
models. We do not address the side debate in the TABLE 2. Features of Different Types of
literature about whether the Rasch model should Dichotomous IRT Models
be referred to as distinct rather than a special case
of IRT models. Item Item Guessing
Difficulty Discrimination Parameter
1-Parameter X
One-Parameter Model (Rasch)
2-Parameter X X
The Rasch model specifies a 1-parameter logis-
3-Parameter X X X
tic (1-PL) function.10,11 The 1-PL model allows
II-30
scores and item difficulty parameters are estimated TABLE 3. Item Difficulty Estimates for Physical
independently, and both values are on the same Functioning Items: Rasch Model
z-score metric (constrained to sum to zero). Most
Item Item Difficulty (SE)
trait scores and difficulty estimates fall between
⫺2 and 2. Vigorous activities 0.46 (0.02)
The difficulty parameter indicates the ability Walking ⬎1 mile 0.06 (0.03)
level or trait level needed to have a 50% chance of Climbing 1 flight of stairs ⫺0.14 (0.02)
endorsing an item (eg, responding “yes” to a “yes Shopping ⫺0.65 (0.02)
or no”item). In the Rasch model, the log odds of a Walking 1 block ⫺0.67 (0.02)
person endorsing or responding in the higher
Preparing meals or doing laundry ⫺0.78 (0.03)
category is simply the difference between trait
Bathing or dressing ⫺1.18 (0.03)
level and the item difficulty. A nice feature of the
Getting around inside your home ⫺1.19 (0.03)
Rasch model is that observed raw scores are
sufficient for estimating latent trait scores using a Feeding yourself ⫺1.60 (0.04)
nonlinear transformation. In other IRT models, the Items are ordered by difficulty level. Estimates were
raw score is not a sufficient statistic. obtained from MULTILOG, version 6.30. Slopes were
Figure 1 presents ICCs for 3 dichotomous items, fixed at 3.49.
differing in their degree of difficulty on the z-score
metric from ⫺1, 0, to 1. Note that the s-shaped rameter. The discrimination parameter is similar to
curves are parallel (have the same slope) because an item-total correlation and typically ranges from
only item difficulty is allowed to vary in the 1-PL ⬃0.5 to 2. Higher values of this parameter are
model. The probability of a “yes” response to the associated with items that are better able to dis-
easiest item (␤1 ⫽ ⫺1) for someone of average criminate between contiguous trait levels near the
ability (␪ ⫽ 0) is ⬃0.73, whereas the probability for inflection point. This is manifested as a steeper
the item with the intermediate difficulty level slope in the graph of the probability of a particular
(␤2 ⫽ 0) is 0.50 (this is true by definition given its response (y axis) by underlying ability or trait level
difficulty level), and the probability for the hardest (x axis). An important feature of the 2-PL model is
items (␤3 ⫽ 1) is 0.27. The discrimination param- that the distance between an individual’s trait level
eter of each item was set to 1.0. and item difficulty has a greater effect on the
For purposes of illustration of the Rasch model, probability of endorsing highly discriminating
we dichotomized the items by collapsing the “yes, items than on less discriminating items. Thus,
limited a lot”and the “yes, limited a little”response more discriminating items provide greater infor-
options together and coding this 0. The “no, not mation about a respondent than do less discrimi-
limited”at all response was coded 1. We fit a 1-PL nating items. Unlike the Rasch model, discrimina-
model to these data using MULTILOG12 (see Table tion needs to be incorporated, and the raw score is
3). Slope (discrimination) estimates are typically not sufficient for estimating trait scores.
fixed at 1.0 in the absence of any information. In We also fit a 2-PL model for the dichotomized
this example, the slopes were fixed at 3.49 by physical functioning items (Table 4). Difficulty
MULTILOG on the basis of the generally high estimates were similar to those reported above for
level of discrimination for this set of items. the 1-PL model, ranging from ⫺1.62 (feeding
Item difficulty estimates (Table 3) ranged from yourself) to 0.49 (vigorous activities). Thus, diffi-
⫺1.60 (feeding yourself) to 0.46 (vigorous activi- culty estimates were robust to whether or not the
ties). The second hardest item was walking ⬎1 item discrimination parameter was estimated.
mile, followed by climbing 1 flight of stairs, shop- Item discriminations (slopes) ranged from 2.51
ping, walking 1 block, bathing or dressing, and (vigorous activities) to 4.09 (walking ⬎1 mile).
getting around inside the home. These slopes are very high; each one exceeds the
upper value (2.00) of the typical range noted
above.
Two-Parameter Model
Three-Parameter Model
The 2-parameter (2-PL) IRT model extends the
1-PL Rasch model by estimating an item discrim- The 3-parameter (3-PL) model includes a
ination parameter (␣) and an item difficulty pa- pseudo-guessing parameter (c), as well as item
II-31
TABLE 4. Item Difficulty and Discrimination graded response model, items need not have the
Estimates for Physical Functioning Items: same number of response categories. Threshold
Two-Parameter Model parameters represent the trait level necessary to
Item
respond above threshold with 0.50 probability.
Difficulty Discrimination Category response curves represent the probability
Item (SE) (SE) of responding in a particular category conditional
on trait level. Generally speaking, items with
Vigorous activities 0.49 (0.03) 2.51 (0.12) higher slope parameters provide more item infor-
Walking ⬎1 mile 0.06 (0.02) 4.09 (0.19) mation. The spread of the item information and
Climbing 1 flight ⫺0.14 (0.03) 3.46 (0.15) where on the trait continuum information is
of stairs peaked are determined by the between-category
Shopping ⫺0.64 (0.02) 3.74 (0.26) threshold parameters.
Walking 1 block ⫺0.66 (0.02) 3.69 (0.26) We fit the graded response model to the HCSUS
Preparing meals or ⫺0.76 (0.03) 3.83 (0.25) physical functioning items, preserving the original
doing laundry 3-point response scale. Responses were scored as
Bathing or ⫺1.18 (0.03) 3.52 (0.21) shown in the Appendix: 1 ⫽ yes, limited a lot;
dressing 2 ⫽ yes, limited a little; and 3 ⫽ no, not limited at all.
Getting around ⫺1.18 (0.03) 3.59 (0.21) In running the model, 2 category threshold param-
inside your eters and 1 slope parameter were estimated for each
home item.
Feeding yourself ⫺1.62 (0.05) 3.21 (0.25) Table 5 shows the category threshold parame-
Items are ordered by difficulty level. Estimates were
ters and the slope parameter for each of the 9
obtained from MULTILOG, version 6.30. physical functioning items. The category threshold
parameters represent the point along the latent
trait scale at which a respondent has a 0.50
discrimination and difficulty parameters: P(␪I) ⫽ probability of responding above the threshold.
c ⫹ (1⫺c)eD␣(␪⫺␤)/(1 ⫹eD␣(␪⫺␤)). This additional Looking at the first row of Table 5, one can see that
parameter adjusts for the impact of chance on a person with a trait level of 0.62 has a 50/50
observed scores. In the 3-PL model, the probabil- chance of responding “not limited at all” in vigor-
ity of the response at ␪ ⫽ ␤ ⫽ (1 ⫹ c)/2. In ability ous activities. Similarly, a person with a trait level
testing, examinees can get an answer right by of ⫺0.31 has a 50/50 chance of responding “lim-
chance, raising the lower asymptote of the func- ited a little” or “not limited at all” in vigorous
tion. The relevance of this parameter to HRQOL activities. The trait level associated with a 0.50
assessment remains to be demonstrated. Re- probability of responding above the 2 thresholds is
sponse error, rather than guessing, is a plausible higher for the vigorous activities item than for any
third parameter for health outcomes measure- of the other 8 physical functioning items. This is
ment. consistent with the fact that more people reported
limitations in vigorous activities than on any of the
other items. For example, 65% of the sample
Examples of Polytomous IRT Models reported being limited in vigorous activities com-
pared with only 9% for feeding.
Graded Response Model
The graded response model,13 an extension of Partial Credit Model

the 2-PL logistic model, is appropriate to use
when item responses can be characterized as The partial credit model14 is an extension of the
ordered categorical responses. In the graded re- Rasch model to polytomous items. Thus, item slopes
sponse model, each item is described by a slope are assumed to be equal across items. The model
parameter and between category threshold pa- depicts the probability of a person responding in
rameters (one less than the number of response category x as a function of the difference between
categories). For the graded response model, 1 their trait level and a category intersection parameter.
operating characteristic curve needs to be esti- These intersection parameters represent the trait
mated for each between category threshold. In the level at which a response in a category becomes
II-32
TABLE 5. Category Thresholds and Slope Estimates for HCSUS Physical Functioning Items:
Graded Response Model
Category Threshold Category Threshold Slope
Parameter—Between “A Lot” Parameter—Between “A Little” Parameter
Item and “A Little” (SE) and “Not at All” (SE) (SE)
Vigorous activities ⫺0.31 (0.03) 0.62 (0.04) 2.22 (0.09)

Climbing 1 flight of stairs ⫺1.09 (0.04) ⫺0.05 (0.03) 2.77 (0.10)
Walking ⬎1 mile ⫺0.65 (0.03) 0.17 (0.03) 3.28 (0.13)
Walking 1 block ⫺1.56 (0.04) ⫺0.62 (0.03) 3.27 (0.19)
Bathing or dressing ⫺2.03 (0.07) ⫺1.13 (0.03) 3.25 (0.20)
Preparing meals or doing laundry ⫺1.59 (0.04) ⫺0.73 (0.03) 3.27 (0.19)
Shopping ⫺1.41 (0.04) ⫺0.59 (0.03) 3.39 (0.16)
Getting around inside your home ⫺2.14 (0.07) ⫺1.14 (0.04) 3.18 (0.18)
Feeding yourself ⫺2.73 (0.12) ⫺1.71 (0.06) 2.35 (0.18)
Estimates were obtained from MULTILOG, version 6.30.
more likely than a response in the previous category. conducted before choosing a 2-PL model. Finally,
The number of category intersection parameters is item fit ␹2 statistics17 and model residuals can be
equal to one less than the number of response examined as a means of checking model predic-
options. The partial credit model makes no assumptions against actual test data (Table 6). The mean
tion about rank ordering of response categories on discrepancy (absolute values) across the 9 items
the underlying continuum. and 3 response categories was 0.1 (SD ⫽ 0.01).
The item fit ␹2 statistics were significant (P ⬍0.05)
for all items. Because statistical power increases
Rating Scale Model with sample size, larger samples lead to a greater
likelihood of significant ␹2 differences. Appropri-
The rating scale model15 assumes that the re- ate caution is needed in interpreting ␹2 statistics.
sponse categories are ordered. Response catego- The results in Table 6 suggest minimal practical
ries are assigned intersection parameters that are differences between observed and expected re-
considered equal across items, and item location is sponse frequencies.
described by a single scale location parameter. The
location parameter represents the average diffi-
culty for a particular item relative to the category Potential Advantages of Using IRT in
intersections. Each item is assumed to provide the Assessing Health Outcome Assessment
same amount of information and have the same
slope. Therefore, the rating scale model is also an Table 7 lists some of the advantages of using IRT
extension of the Rasch model. in health outcome assessment. This section sum-
marizes these potential advantages.
Assessing Model Fit

More Comprehensive and Accurate
Choosing which model to use depends on the Evaluation of Item Characteristics
reasonableness of the assumptions about the scale
items in the particular application.16 Unlike CTT, Invariant Item and Latent Trait Estimates.
the reasonableness of the IRT model can be eval- In CTT, item means are confounded by valid group
uated by examining its fit to the data. Dimension- differences, and item-scale correlations are af-
ality should be evaluated before choosing an IRT fected by group variability on the construct. When
model. Tests of equal discrimination should be an IRT model fits the data exactly in the popula-
conducted before choosing a 1-PL model, and tion, sample invariant item and latent trait esti-
tests of minimal guessing, if relevant, should be mates are possible.18,19 Within sampling error, the
II-33
TABLE 6. Difference Between Observed and Expected Response Frequencies (Absolute Values) by Item
and Response Category
Yes, Limited Yes, Limited No, Not Limited
a Lot a Little at All P
Vigorous activities 0.01 0.02 0.02 ⬍0.05

Walking ⬎1 mile 0.01 0.02 0.02 ⬍0.05
Climbing 1 flight of stairs 0.01 0.03 0.03 ⬍0.05
Shopping 0.01 0.01 0.01 ⬍0.05
Walking 1 block 0.01 0.01 0.01 ⬍0.05
Preparing meals or doing laundry 0.01 0.00 0.01 ⬍0.05
Bathing or dressing 0.01 0.01 0.00 ⬍0.05
Getting around inside your home 0.00 0.00 0.02 ⬍0.05
Feeding yourself 0.01 0.01 0.01 ⬍0.05
The mean difference (absolute values) between the observed and expected response frequencies across all items
and all response categories was 0.01 (SD ⫽ 0.01). The reported P values are based on the item-fit ␹2 reported by
Parscale 3.5.
ICC should be the same regardless of what sample is dependent on how well the ICC fits the data).19
it was derived from (within a linear transforma- Information curves are analogous to reliability of
tion), and the person estimates should be the measurement and indicate the precision (reciprocal
same regardless of what items they are based on. of the error variance) of an item or test along the
Invariance is a population property that cannot underlying trait continuum. An item provides the
be directly observed but can be evaluated within a most information around its difficulty level. The
sample. For instance, one can look to see if maximum information lies at ␤ in the 1-PL and 2-PL
individual scores are the same regardless of what models. In a 3-PL model, the maximum information
items are administered or whether item parame- is not quite at ␤ because as c decreases information
ters are the same across subsets of the sample. increases (all else being equal). The steeper the slope
Embretson20 illustrated with simulations that CTT in the ICC and the smaller the item variance, the
estimates of item difficulty for different subgroups greater the item information.
of the population can vary considerably and the Scale information depends on the number of
association between the estimates can be nonlin- items and how good the items are. The informa-
ear. In contrast, difficulty estimates derived from tion provided by a multi-item scale is simply the
the Rasch model were robust and very highly sum of the item information functions. Standard
correlated (r ⫽ 0.997). error of measurement in IRT models is inversely
Item and Scale Information Conditional on related to information and hence is conditional on
Trait Level. Any ICC can be transformed into an trait level: SE ⫽ 1/(information兩␪)1/2. Because in-
item information curve (utility of information curves formation varies by trait level, a scale may be quite
precise for some people and not so precise for
TABLE 7. Potential Advantages of Using IRT in others. It is also possible to average the individual
Health Outcomes Assessment standard errors to obtain a composite estimate for
the population.20 This means that items can be
● More comprehensive and accurate evaluation of selected that are most informative for specific
item characteristics
subgroups of the population.
● Assess group differences in item and scale
functioning To illustrate the information function, the for-
● Evaluate scales containing items with different mula for the 3-PL logistic model is given below:
response formats
● Improve existing measures
● CAT I共␪ I) ⫽ 2.89␣i2(1 ⫺ ci)/[cI ⫹ e1.7␣(␪ ⫺ ␤i)]
● Model change
● Evaluate person fit
⫻ 关1 ⫹ e ⫺1.7␣(␪ ⫺ ␤i)]2
II-34
FIG. 2. Item information functions for 3 polytomous items: 3-PL model.
Working through this equation shows that infor- concrete picture of response pattern probabilities
mation is higher when the difficulty of the item is for an individual given the trait score. If the
closer to the trait level, when the discrimination person’s trait level exceeds the difficulty of an item,
parameter is higher, and when the pseudo- then the person is more likely than not to “pass”
guessing parameter, c, is smaller. Figure 2 plots or endorse this item. Conversely, if the trait level is
item information curves for 3 items that vary in the below the item difficulty, then the person is less
3-PL parameters: item 1 (c ⫽ 0.0; ␤ ⫽ ⫺1.5; likely to endorse than not endorse the item.
␣ ⫽ 1.8), item 2 (c ⫽ 0.1; ␤ ⫽ ⫺0.5; ␣ ⫽ 1.2), and
item 3 (c ⫽ 0.0; ␤ ⫽ 1.0; ␣ ⫽ 1.8). Note that the
information peaks at the difficulty level for items 1 Assessing Group Differences in Item and
and 3, because c ⫽ 0.0 for both of these items. For Scale Functioning
item 2, information peaks close to its difficulty
level, but the peak is shifted a little because of the IRT methods provide an ideal basis for assess-
0.1 c parameter. ing differential item functioning (DIF), defined as
Trait Estimates Anchored to Item Content. different probabilities of endorsing an item by
In CTT, the scale score is not typically informative respondents from 2 groups who are equal on the
about the item response pattern. However, if di- latent trait. When DIF is present, scoring respon-
chotomous items are consistent with a Guttman dents on the latent trait using a common set of
scale,21 then they are ordered along a single item parameters causes trait estimates to be too
dimension in terms of their difficulty, and the high or too low for those in 1 group relative to the
pattern of responses to items is determined by the other.22,23 DIF is identified by looking to see if item
sum of the endorsed items. The linkage between characteristic curves differ (item parameters differ)
trait level and item content in IRT is similar to the by group.24
Guttman scale, but IRT models are probabilistic One way to assess DIF is to fit multigroup IRT
rather than deterministic. models in which the slope and difficulty parame-
In IRT, item and trait parameters are on the ters are freely estimated versus constrained to be
same metric, and the meaning of trait scores can equal for different groups. If the less constrained
be related directly to the probability of item re- model fits the data better, this suggests that there
sponses. Hence, it is possible to obtain a relatively is significant DIF. For example, Morales et al25
II-35
compared satisfaction with care responses be- information function would be one that is highly
tween whites and Hispanics in a study of patients peaked at the trait level associated with the de-
receiving medical care from an association of 48 pressive symptom threshold.
physician groups.26 This analysis revealed that 2 of IRT statistics cannot tell the researcher how to
9 items functioned differently in the 2 groups but write better items or exactly what items will fill an
that the DIF did not have meaningful impact on identified gap in the item difficulty range. Poorly
trait scores. When all 9 items were included in the fitting items can provide a clue to the types of
satisfaction scale, the effect size was 0.27, with things to avoid when writing new items. An item
whites rating care significantly more positively with a double negative may not fit very well
than Hispanics. When the biased items were because of respondent confusion. Items bounding
dropped from the scale, the effect size became 0.26 the target difficulty range can provide anchors for
and the mean scale scores remained significantly items that need to be written.
different. Thus, statistically significant DIF does For example, a Rasch analysis of the SF-36
not necessarily invalidate comparisons between physical functioning scale resulted in log-odds
groups. (logits) item location estimates of ⫺1.93 for walk-
ing 1 block compared with ⫺3.44 for bathing or
dressing.28 To fill the gap between these difficulty
Evaluating Scales Containing Items With levels, one might decide to write an item about
Different Response Formats preparing meals or laundry. This could be based
on intuition about where the item will land or on
In CTT, typically one tries to avoid combining existing data including CTT estimates of item
items with different variances because they have difficulty.
differential impact on raw scale scores. It is possi-
ble in CTT to convert items with different response
options to a common range of scores (eg, 0 to 100) Computerized Adaptive Testing
or to standardize the items so that they have the
same mean and standard deviation before com- CTT scales tend to be long because they are
bining them. However, these procedures yield designed to produce a high coefficient ␣.29,30 But
arbitrary weighting of items toward the scale most of the items are a waste of time for any
score. IRT requires only that item responses have a particular respondent because they yield little in-
specifiable relationship with the underlying con- formation. In contrast, IRT methods make it pos-
struct.27 IRT models, such as the graded response sible to estimate person trait levels with any subset
model, have been developed that allow different of items in an item pool. Computerized adaptive
items to have varying numbers of response cate- testing (CAT) is ideally suited to IRT. Traditional,
gories.13 fixed-length tests require administering items that
are high for those with low trait values and items
that are too low for those with high trait values.
Improving Existing Measures There are multiple CAT algorithms.31 We de-
scribe one example here to illustrate the general
One possible benefit of IRT is facilitation of the approach. First, an item bank of highly discrimi-
development of new items to improve existing nating items of varying difficulty levels is devel-
measures. Because standard errors of measure- oped. Then each item administered is targeted at
ment are estimated conditional on trait level, IRT the trait level of the respondent. Without any prior
methods provide a strong basis for identifying information, the first item administered is often a
where along the trait continuum the measurement randomly selected item of medium difficulty. After
provides little information and is in need of im- each response, examinee trait level and its stan-
provement. The ideal measure will provide high dard error are estimated. If maximum likelihood is
information at the locations of the trait continuum used to estimate trait level, step-size scoring (eg,
that are important for the intended application. 0.25 increment up or down) can be used after the
For example, it may be necessary to identify only first item is administered. The next item adminis-
people who score so high on a depression scale tered to those not endorsing the first item is an
that mental health counseling is needed to prevent easier item located the specified step away from
a psychological crisis from occurring. The desired the first item. If the person endorses the item, the
II-36
next item administered is a harder item located the ment. Although interval level and even ratio level
specified step away. After 1 item has been en- measurement has been argued for Rasch models34
dorsed and 1 not endorsed, maximum likelihood and a nonlinear transformation of trait level esti-
scoring is possible and begun. The next item mates can provide ratio-scale type of interpreta-
selected is an item that maximizes the likelihood tion,35 the trait level scale is not strictly an interval
function (ie, item with a 50% chance of endorse- scale. However, it has been noted that assessing
ment in the 1- and 2-PL models). CAT is termi- change in terms of estimated trait level rather than
nated when the standard error falls below an raw scores can yield more accurate estimates of
acceptable value. Note that CAT algorithms can be change.36 Ongoing work directed at item response
designed for polytomous items as well. theory models of change for within-subject
change that can be extended to group level change
offers exciting possibilities for longitudinal analy-
Modeling Change
ses.37
IRT models are well suited for tracking change
in health. IRT models offer considerable flexibility
in longitudinal studies when the same items have Evaluating Person Fit
not been administered at every data collection
wave. Because trait level can be estimated from An important development in the use of IRT
any subset of items, it is possible to have a good methods is detection of the extent to which a
trait estimate even if the items are not identical at person’s pattern of item responses is consistent
different time points. Thus, the optimal subset of with the IRT model.38 – 40 Person fit indexes have
items could be administered in theory to different been developed for this purpose. The standardized
respondents, and this optimal subset would vary, ZL Fit Index is one such index: ZL兩␪I ⫽ ⌺[ln L兩␪I ⫺
depending on their trait level at each time point. ⌺E(ln L兩␪I]/(⌺V[ln L兩␪I)),1/2 where ln ⫽ natural
This feature of IRT models means that respondent logarithm. Large negative ZL values (ZL⬍ ⫽ ⫺2.0)
burden is minimized. Each respondent can be indicate misfit. Large positive ZL values indicate
administered only the number of items that are response patterns that are higher in likelihood
needed to establish a satisfactory small enough than the model predicts.
standard error of measurement. This feature of IRT Depending on the context, person misfit can be
also will help to ensure that the reliability of suggestive of an aberrant respondent, response
measurement is sufficiently high to allow for mon- carelessness, cognitive errors, fumbling, or cheat-
itoring individual patients over time. ing. The bottom line is that person misfit is a red
IRT can also help address the issue of clinically flag that should be explored. For example, unpub-
important difference or change (see the article by lished baseline data from HCSUS revealed a large
Testa et al32 in this issue). Anchor-based ap- negative ZL index41 for a respondent who reported
proaches have been proposed that compare pro- that he was “limited a lot” in feeding, getting
spectively measured change in health to change around inside his home, preparing meals, shop-
on a clinical parameter (eg, viral load) or to ping, and climbing 1 flight of stairs but only
retrospectively reported global change.33 For ex- “limited a little” in vigorous activities, walking ⬎1
ample, the change in a health-related quality of mile, and walking 1 block. The apparent inconsis-
life scale associated with going from detectable to tencies in this response pattern suggests the pos-
undetectable levels of viral load in people with sibility of carelessness in answers given in this
HIV disease might be deemed clinically important. face-to-face interview.
Because IRT trait estimates have direct implica-
tions for the probability of item responses and
items are arrayed along a single continuum, sub- Methodological and Practical
stantive meaning can be attached to point-in-time Challenges in Applying IRT Methods
and change scores. Trait level change can therefore
be cast in light of concrete change in levels of Unidimensionality
functioning and well-being to help determine the
threshold for clinically meaningful change. In evaluations of dimensionality in the context
One suggested advantage of IRT over CTT is of exploratory factor analysis, it has been recom-
interval level as opposed to ordinal level measure- mended that one examine multiple criteria such as
II-37
the scree test,42 the Kaiser-Guttman eigenvalues loadings ranged from 0.72 to 0.94, and the average
⬎1.00 rule, the ratio of first to second eigenvalues, absolute standardized residual was 0.05.
parallel analysis,43 the Tucker-Lewis44 reliability
coefficient, residual analysis,45 and interpretability
of resulting factors.46 Determining the extent to When Does IRT Matter?
which items are unidimensional is important in
IRT analysis because this is a fundamental as- Independent of whether IRT scoring improves
sumption of the method. Multidimensional IRT on classic approaches to estimating true scores,
models have been developed.7 IRT is likely to be viewed as a better way of
It is generally acknowledged that the assump- analyzing measures. Nonetheless, there is interest
tion of unidimensionality “cannot be strictly met in the comparability of CTT and IRT-based item
because several cognitive, personality, and test- and person statistics. Recently, Fan52 used data
taking factors always affect test performance, at collected from 11th graders on the Texas Assess-
least to some extent.”19 As a result, there has been ment of Academic Skills to compare CTT and IRT
recognition that establishing “essential unidimen- parameter estimates. The academic skills assess-
sionality” is sufficient for satisfying this assump- ment battery included a 48-item reading test and a
tion. Stout47,48 developed a procedure by which to 60-item math test, with each of the multiple
judge whether or not a data set is essentially choice items scored correct or incorrect. Twenty
unidimensional. In short, a scale is essentially random samples of 1,000 examinees were drawn
unidimensional when the average between-item from a pool of more than 193,000 participants.
residual covariances after fitting a 1-factor model CTT item difficulty estimates were the propor-
approaches zero as the length of the scale in- tion of examinees passing each item, transformed
creases. to a the (1-p)th percentile from the z distribution.
Essential unidimensionality can be illustrated This transformation assumed that the underlying
conceptually by use of a previously published trait measured by each item was normally distrib-
example. In confirmatory factor analysis, it is pos- uted. CTT item discrimination estimates were ob-
sible that multiple factors provide a better fit to the tained by taking the Fisher z transformation of the
data than a single dimension. Categorical confir- item-scale correlation: z ⫽ [ln(1 ⫹ r) ⫺ ln(1-r)]/2.
matory factor analytic models can now be estimat- Fan52 found that the CTT and IRT item difficulty
ed.49,50 The estimated correlation between 2 fac- and discrimination estimates were very similar. In
tors can be fixed at 1.0, and the fit of this model this particular application, the resulting estimates
can be contrasted to a model that allows the factor did not change as a result of using the more
correlation to be estimated. A ␹2 test of the sophisticated IRT methodology.
significance of the difference in model fit (1 df) can In theory, there are many possibilities for iden-
be used to determine whether 2 factors provide a tifying meaningful differences between CTT and
better fit to the data. Even when 2 factors are IRT. Because IRT models may better reflect actual
extremely highly correlated (eg, r ⫽ 0.90), a response patterns, one would expect IRT estimates
2-factor model might provide better fit to the data to be more accurate reflections of true status than
than a 1-factor model.51 Thus, statistical tests CTT estimates. As a result, IRT estimates of health
alone cannot be trusted to provide a reasonable outcomes should be more sensitive to true cross-
answer about dimensionality. This is a case in sectional differences and more responsive to
which, even though unidimensionality was not change in health over time. Indeed, a recent study
fully satisfied, the items may be considered to have found that the sensitivity of the SF-36 physical
essential unidimensionality. functioning scale to differences in disease severity
The 9 physical functioning items described ear- was greater for Rasch model– based scoring than
lier are polytomous (ie, have 3 response choices). for simple summated scoring.53 Similarly, a study
We tested the unidimensionality assumption of of 194 individuals with multiple sclerosis54 re-
IRT by fitting 1-factor categorical confirmatory vealed that the RAND-36 HIS mental health
factor analysis.49 The model was statistically re- composite score, an IRT-based summary mea-
jectable because of the large sample size sure,55 correlated more strongly with the Ex-
(␭2 ⫽ 1,059.29, n ⫽ 2,829, df ⫽ 27, P ⬍0.001), but panded Disability Status Scale than did the SF-36
it fit the data well according to practical fit indexes mental health summary score, a CTT-based mea-
(comparative fit index ⫽ 0.99). Standardized factor sure. Moreover, the RAND-36 scores were found
II-38
to be more responsive to change in seizure fre- health services researchers will lead to enhance-
quency than the SF-36 scores in a sample of 142 ments of the method’s utility for the field and
adults participating in a randomized controlled improvements in the collective applications of the
trial of an antiepileptic drug.56 methodology. We look forward to a productive 100
Because nonlinear transformations of depen- years of IRT and health outcomes measurement.
dent variables can either eliminate or create inter-
actions between independent variables, apparent
interactions in raw scores may vanish (and vice
Acknowledgments
versa) when scored with IRT.57 Given the greater
complexity and difficulty of IRT model– based es- This article was written as one product from HCSUS.
timates, it is important to document when IRT HCSUS was funded by a cooperative agreement
scoring (trait estimates) makes a difference. (HS08578) between RAND and the Agency for Health-
care Research and Quality (M.F. Shapiro, principal in-
vestigator; S.A. Bozzette, co–principal investigator). Sub-
stantial additional support for HCSUS was provided by
the Health Resources and Services Administration, Na-
Practical Problems in Applying IRT
tional Institute for Mental Health, National Institute for
There are a variety of software products avail- Drug Abuse, and National Institutes of Health Office of
Research on Minority Health through the National In-
able that can be used to analyze health outcomes
stitute for Dental Research. The Robert Wood Johnson
data with IRT methods, including BIGSTEPS/ Foundation, Merck and Company, Glaxo-Wellcome, and
WINSTEPS,58 MULTILOG,12 and PARSCALE.59 the National Institute on Aging provided additional
BIGSTEPS implements the 1-PL model. MUL- support. Comments on an earlier draft provided by Paul
TILOG can estimate dichotomous or polytomous Cleary were very helpful in revising the paper.
1-, 2- and 3-PL models; Samejima’s graded re-
sponse model; Master’s partial credit model; and References
Bock’s nominal response model. Maximum likeli-
1. Mellenbergh GJ. Generalized linear item re-
hood and marginal maximum likelihood estimates
sponse theory. Psychol Bull 1994;15:300–307.
can be obtained. PARSCALE estimates 1-, 2-, and
3-PL logistic models; Samejima’s graded response 2. Hays RD, Spritzer KL, McCaffrey D, Cleary
PD, Collins R, Sherbourne C, et al. The HIV Cost and
model; Muraki’s modification of the graded re-
Services Utilization Study (HCSUS) measures of health-
sponse model (rating scale version); the partial related quality of life. Santa Monica, Calif: RAND; 1998.
credit model; and the generalized partial credit DRU-1897-AHCPR.
model.
3. Shapiro MF, Morton SC, McCaffrey DF,
None of these programs are particularly easy to Senterfitt JW, Fleishman JA, Perlman JF, et al. Vari-
learn and implement. The documentation is often ations in the care of HIV-infected adults in the United
difficult to read, and finding out the reason for States: Results from the HIV Cost and Services Utiliza-
program failures can be time consuming and tion Study. JAMA 1999;281:2305–2315.
frustrating. The existing programs have a striking 4. Wu AW, Hays RD, Kelly S, Malitz F, Bozzette
similarity to the early versions of the LISREL SA. Applications of the Medical Outcomes Study
structural equation-modeling program.60 LISREL health-related quality of life measures in HIV/AIDS.
required a translation of familiar equation lan- Qual Life Res 1997;6:531–554.
guage into matrixes and Greek letters. Widespread 5. Ware JE, Sherbourne CD. The MOS 36-item
adoption of IRT in health outcome studies will be short-form health survey (SF-36), I: Conceptual framework
facilitated by the development of user-friendly and item selection. Med Care 1992;30:473– 483.
software. 6. Reise SP, Widaman KF, Pugh RH. Confirma-
tory factor analysis and item response theory: Two
approaches for exploring measurement invariance. Psy-
chol Bull 1993;114:552–566.
Conclusions 7. Reckase MD. The past and future of multidi-
mensional item response theory. Appl Psychol Meas
IRT methods will be used in health outcome 1997;21:25–36.
measurement on a rapidly increasing basis in the 8. McDonald RP. The dimensionality of tests and
21st century. The growing experience among items. Br J Math Stat Psychol 1981;34:100 –117.
II-39
9. Santor DA, Ramsay JO, Zuroff DC. Nonpara- 26. Hays RD, Brown JA, Spritzer KL, Dixon
metric item analyses of the Beck Depression Inventory: WJ, Brook RH. Satisfaction with health care provided
Evaluating gender item bias and response option by 48 physician groups. Arch Intern Med 1998;158:
weights. Psychol Assess 1994;6:255–270. 785–790.
10. Rasch G. An individualistic approach to item 27. Thissen D. Repealing rules that no longer apply
analysis. In: Lazarsfeld PF, Henry NW, eds. Readings in to psychological measurement. In: Frederiksen N, Mis-
mathematical social science. Cambridge, Mass: Massa- levy RJ, Bejar II, eds. Test theory for a new generation of
chusetts Institute of Technology Press; 1966:89 –108. tests. Hillsdale, NJ: Lawrence Erlbaum Associates;
1993:79 –97.
11. Rasch G. Probabilistic models for some intelli-
gence and attainment tests. Copenhagen, Denmark: 28. Haley SM, McHorney CA, Ware JE. Evalua-
Danmarks Paedogogiske Institut; 1960. tion of the MOS SF-36 physical functioning scale (PF-
10), I: Unidimensionality and reproducibility of the Rasch
12. Thissen D. MULTILOG user’s guide: Multiple
item scale. J Clin Epidemiol 1994;47:671– 684.
categorical item analysis and test scoring using item
response theory. Chicago, Ill: Scientific Software, Inc; 29. Cronbach LJ. Coefficient alpha and the internal
1991. structure of tests. Psychometrika 1951;16:297–334.
13. Samejima F. The graded response model. In: 30. Guttman L. A basis for analyzing test-retest
van der Linden WJ, Hambleton R, eds. Handbook of reliability. Psychometrika 1945;10:255–282.
modern item response theory. New York, NY: Springer; 31. Wainer H. Computerized adaptive testing: A
1996:85–100. primer. Hillsdale, NJ: Lawrence Erlbaum; 1990.
14. Masters GN. A Rasch model for partial credit 32. Testa MA. Interpretation of quality-of-life out-
scoring. Psychometrika 1992;47:149 –174. comes: Issues that affect magnitude and meaning. Med
15. Andrich D. A rating formulation for ordered Care 2000;38(suppl II):II-166-II-174.
response categories. Psychometrika 1978;43:561–573. 33. Samsa G, Edelman D, Rothman ML, Wil-
16. Andrich D. Distinctive and incompatible prop- liams GR, Lipscomb J, Matchar D. Determining
erties of two common classes of IRT models for graded clinically important differences in health status mea-
responses. Appl Psychol Meas 1995;19:101–119. sures: A general approach with illustration to the
Health Utilities Index Mark II. Pharmacoeconomics
17. McDonald RP, Mok MMC. Goodness of fit in 1999;15:141–155.
item response models. Multivariate Behav Res
34. Fischer GH. Some neglected problems in IRT.
1995;30:23– 40.
Psychometrika 1995;60:459 – 487.
18. Bejar II. A procedure for investigating the uni-
35. Hambleton RK, Swaminathan H. Item re-
dimensionality of achievement tests based on item pa-
sponse theory: Principles and applications. Boston, Mass:
rameter estimates. J Educ Meas 1980;17:283–296.
Kluwer-Nijhoff; 1985.
19. Hambleton RK, Swaminathan H, Rogers HJ.
36. May K, Nicewander WA. Measuring change
Fundamentals of item response theory. Newbury Park,
conventionally and adaptively. Educ Psychol Meas
Calif: Sage; 1991.
1998;58:882– 897.
20. Embretson SE. The new rules of measurement.
37. Mellenbergh GJ, van den Brink WP. The
Psychol Assess 1996;8:341–349.
measurement of individual change. Psychol Methods
21. Menzel H. A new coefficient for scalogram 1998;3:470 – 485.
analysis. Public Opinion Q 1953;17:268 –280.
38. Reise SP. A comparison of item- and person-fit
22. Holland PW, Wainer H. Differential item func- methods of assessing model-data fit in IRT. Appl Psychol
tioning. Hillsdale, NJ: Erlbaum; 1993. Meas 1990;14:127–137.
23. Milsap RE, Everson HT. Methodology review: 39. Reise SP, Flannery WP. Assessing person-fit
Statistical approaches for assessing measurement bias. on measures of typical performance. Appl Psychol Meas
Appl Psychol Meas 1993;17:297–334. 1996;9:9 –26.
24. Reise SP, Widaman KF, Pugh RH. Confirma- 40. Reise SP, Waller NG. Traitedness and the
tory factor analysis and item response theory: Two assessment of response pattern scalability. J Pers Soc
approaches for exploring measurement invariance. Psy- Psychol 1993;65:143–151.
chol Bull 1993;114:552–566. 41. Drasgow F, Levine MV, Williams EA. Appro-
25. Morales LS, Reise SP, Hays RD. Evaluating priateness measurement with polychotomous item re-
the equivalence of health care ratings by whites and sponse models and standardized indices. Br J Math Stat
Hispanics. Med Care 2000;38:517–527. Psychol 1985;38:67– 86.
II-40
42. Cattell R. The scree test for the number of 52. Fan X. Item response theory and classical test
factors. Multivariate Behav Res 1966;1:245–276. theory: An empirical comparison of their item/person
43. Montanelli RG, Humphreys LG. Latent roots statistics. Educ Psychol Meas 1998;58:357–381.
of random data correlation matrices with squared mul- 53. McHorney CA, Haley SM, Ware JE. Evalua-
tiple correlations on the diagonal: A Monte Carlo study. tion of the MOS SF-36 physical functioning scale (PF-
Psychometrika 1976;41:341–347. 10), II: Comparison of relative precision using Likert
44. Tucker LR, Lewis C. A reliability coefficient for and Rasch scoring methods. J Clin Epidemiol 1997;50:
maximum likelihood factor analysis. Psychometrika 451– 461.
1973;38:1–10. 54. Nortvedt MW, Riise T, Myhr K, Nyland HI.
45. Hattie J. Methodology review: Assessing unidi- Performance of the SF-36, SF-12, and RAND-36 sum-
mensionality of tests and items. Appl Psychol Meas mary scales in a multiple sclerosis population. Med Care.
1985;9:139 –164. In press.
46. Floyd FJ, Widaman KF. Factor analysis in the 55. Hays RD, Prince-Embury S, Chen H.
development and refinement of clinical assessment in- RAND-36 Health Status Inventory. San Antonio, Tex:
struments. Psychol Assess 1995;7:286 –299. Psychological Corp; 1998.
47. Stout W. A nonparametric approach for assess- 56. Birbeck GL, Kim S, Hays RD, Vickrey BG.
ing latent trait unidimensionality. Psychometrika Quality of life measures in epilepsy: How well can they
1987;52:589 – 617. detect change over time? Neurology 2000;54:1822–1827.
48. Stout W. A new item response theory modeling 57. Embretson SE. Item response theory models
approach with applications to unidimensional assessment and spurious interaction effects in factorial ANOVA
and ability estimation. Psychometrika 1990;55:293–326. designs. Appl Psychol Meas 1996b;20:201–212.
49. Lee SY, Poon WY, Bentler PM. A two stage 58. Wright BD, Linacre JM. User’s guide to BIG-
estimation of structural equation models with continu- STEPS: Rasch-model computer program. Chicago, Ill:
ous and polytomous variables. Br J Math Stat Psychol MESA Press; 1997.
1995;48:339 –358. 59. Muraki E, Bock RD. PARSCALE (version 3.5):
50. Muthen LK, Muthen BO. The comprehensive Parameter scaling of rating data. Chicago, Ill: Scientific
modeling program for applied researchers: User’s guide. Software Inc; 1998.
Los Angeles, Calif: Muthen & Muthen; 1998. 60. Joreskog KG, Sorbom D. LISREL V: Analysis
51. Marshall GN, Hays RD, Sherbourne C, Wells of linear structural relationships by the method of max-
KB. The structure of patient ratings of outpatient med- imum likelihood (user’s guide). Chicago, Ill: National
ical care. Psychol Assess 1993;5:477– 483. Educational Resources; 1981.
II-41
Appendix
FIG. 3. Physical functioning items in HCSUS. R indicates respondent.
II-42

Item Response Theory and Health Outcomes Measurement in The 21st Century

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Item Response Theory and Health Outcomes Measurement in The 21st Century

Uploaded by

Copyright:

Available Formats

MEDICAL CARE

Volume 38, Number 9‚ Supplement II, pp II-28–II-42

Item Response Theory and Health Outcomes Measurement in

Item-Characteristic Curves Dichotomous IRT Models

FIG. 1. Item characteristic curves for 3 dichotomous items.

The graded response model,13 an extension of Partial Credit Model

Vigorous activities ⫺0.31 (0.03) 0.62 (0.04) 2.22 (0.09)

Assessing Model Fit

Vigorous activities 0.01 0.02 0.02 ⬍0.05

FIG. 2. Item information functions for 3 polytomous items: 3-PL model.

FIG. 3. Physical functioning items in HCSUS. R indicates respondent.

You might also like