You are on page 1of 8

Item Response Theory (IRT) Models for Questionnaire Evaluation

Ron D. Hays, Ph.D.

Professor, UCLA Department of Medicine

September 7, 2009 1st Draft

Contact Information:

Ron D. Hays, Ph.D.

UCLA Department of Medicine

911 Broxton Avenue, Room 110

Los Angeles, CA 90024-2801

Phone: 310-794-2294

Fax: 310-794-0732

E-mail: drhays@ucla.edu

Acknowledgements: Preparation of this paper was supported in part by grants from the National
Institute on Aging (AG020679-01, P30AG021684), NCMHD (2P20MD000182), the National
Institutes of Health Roadmap for Medical Research Grant (AR052177), and the UCLA Older
Americans Independence Center (P30-AG028748). This paper will be presented at the
Workshop on Question Evaluation Methods sponsored by the National Center for Health
Statistics and National Cancer Institute to be held October 21-23, 2009 at the National Center for
Health Statistics in Hyattsville, MD (during item response theory modeling session 2:30-4:30 on
October 22).
The paper by Reeve (2009) presents a general overview of item response theory (IRT) with
implications for questionnaire evaluation. He discusses IRT models and assumptions along with
examples of how IRT can help in the evaluation of questionnaire items. This response to Reeve
(2009) focuses on the features of IRT mentioned that have the greatest implications for
questionnaire evaluation and development: category response curves, differential item
functioning, information, and computer-adaptive testing. Some issues not covered in Reeve
(2009) are also mentioned such as the evaluation of person fit and determination of the
appropriate unit of analysis.

Category Response Curves (CRCs). CRCs help in the evaluation of the response options of
each item. The CRCs display the relative position of the response categories along the
underlying continuum. In the example shown in Figure 7 of the Reeve (2009) paper, the
categories fall along the expected ordinal order from no change to a very great change. In
addition, 2 of 6 response options (i.e., very small change and small change) are never most likely
to be chosen by respondents across all levels of the posttraumatic growth scale continuum.

Reeve suggests that one or both of the response categories could be dropped or reworded to
improve the response scale. But it might be challenging to determine what it is about “very
small” and “small” that makes them less likely to be endorsed than “no” change for respondents
with posttraumatic growth scores in the less depressed range of the construct (-2 to -1 theta).
Additional information such as cognitive interviews may provide insights (Hays & Reeve, 2008).
One might also consider eliciting perceptions of where the problematic response options and
alternative possibilities lie along the underlying construct continuum using the method of equal-
appearing intervals (Thurstone & Chave, 1929). In this method, a sample of raters is asked to
rate the position of intermediate response choices using a 10-cm line anchored by the extreme
(lowest and highest) response choices (Ware et al., 1996).

Differential item functioning (DIF). If response options are changed and it is important to
know the impact of the revisions, one can evaluate DIF. When the target construct is less well
defined, questions that tap it are more sensitive to changes in response categories (Rockwood,
Sangster & Dillman, 1997). DIF is also more likely to be present if response options are
changed to a larger extent rather than a little bit. It has been shown, for example, that items
whose response anchors changed the most were more likely to exhibit DIF across two years of
administration of an attitude survey to Federal Aviation Administration employees (Farmer et al.,
2001).

Coons et al. (2009) note that substantial changes to items (stems or response scales) may make
equivalence of the old and new items irrelevant. However, estimating the comparability of the
old and new versions may still be valuable for purposes such as bridging scores. For example, a
few years ago UCLA transitioned from the Picker to the Consumer Assessment of Healthcare

2
Providers and Systems (CAHPS®) hospital survey. Differences in wording, response options,
and cut-points for “problem scores” yielded large differences in problem score rates between the
two survey instruments that required bridging formulas. Tetrachoric correlations for 5 of 6 item
pairs indicated high correspondence (r’s of 0.71-0.97) in the underlying constructs. Bridged
scores contained less information than directly measures new scores, but with sufficient sample
sizes they could be used to detect trends across the transition (Quigley et al., 2008).

Examination of DIF may need to be done for different modes of administration because the
effect of changes in response options might vary by mode. For example, the CAHPS®
clinician/group survey can be administered using either a four-category (never, sometimes,
usually, always) or six-category (never, almost never, sometimes, usually, almost always,
always) response scale. There is likely to be greater DIF for telephone administration than self-
administration because of the effect of memory limitations on responses over the phone.

Information. Another advantage of IRT pointed out by Reeve (2009) is that information
(precision) is estimated at different points along the construct continuum. Reliability in its basic
formulation is equal to 1- SE2, where the SE = 1/ (information)1/2 . The SE and confidence
interval around estimated scores for those with milder levels of depressive symptoms are larger
than for those with moderate to severe depressive symptoms (see Figure 4 of Reeve, 2009).
Lowering the SE for those with mild symptoms requires adding or replacing existing items with
more informative items at this range of the depressive symptoms continuum.

But this is easier said than done. There is a limit on the number of ways to ask about a targeted
range of the construct. One needs to avoid including essentially the same item multiple times.
For example, “I’m generally sad about my life” and “My life is generally sad” are so similar that
this could lead to violations of the IRT local independence assumption. That is, these depressive
symptom items would be correlated above and beyond what the common depression construct
would predict. One would then find significant residual correlations for the item pair after
controlling for the common factor defining the depressive symptom items. Ignoring the local
dependency would lead to inflated slope or discrimination parameters for the pair of items.

Candidate global physical health items administered in the Patient-Reported Outcome


Measurement Information System (PROMIS) included the commonly used excellent to poor
rating of health and a parallel rating of physical health (Hays, Bjorner et al., 2009). A single-
factor categorical confirmatory factor analytic model for 5 global physical health items including
this pair showed less than adequate fit. Adding a residual correlation (r = 0.29) between the item
pair lead to noteworthy improvement in model fit. When the graded response model was fit to
the 5 items, the discrimination parameters for the locally dependent items were 7.37 and 7.65;
the next largest value for an item was 1.86. Dropping the “In general, how would you rate your
health” item resulted in a discrimination parameter of 2.31 for the rate your physical health item
and this was no longer the largest discrimination parameter for the remaining 4 physical health
items (it was second largest).
3
With IRT-based SEs and precision, users have access to a more appropriate confidence interval
around an individual’s score for clinical applications. This is very important because use of
patient-reported outcomes (PROS) at the individual level necessitates at high level of reliability
(0.90 minimum reliability was advocated by Nunnally, 1978). Accurate information about the
accuracy of measures is critically important in clinical decision making. Having the best possible
information is especially important as PROs are used more frequently in the coming years to
monitor effects of different treatment options and as input to clinical decisions (Fung & Hays,
2008).

Computer-Adaptive Testing. The capacity to include an unlimited number of items (item bank)
and rely upon CAT to tailor the number and kinds of items administered to different respondents
is one of the most exciting aspects of IRT for questionnaire developers. By tailoring items based
on iterative theta estimates, the fewest number of items can be administered to achieve a target
level of precision for each individual. The items chosen to be administered to an individual are
those that provide the most information at the person’s estimated theta level. As noted by Reeve
(2009, information is driven by the slope or discrimination parameter and revealed in the item
information curve or function.

Because different items and sequence of items are administered in a CAT, attention needs to be
given to the potential for context effects. Lee and Grant (2009) randomly assigned 1,191
English-language and 824 Spanish-language participants in the 2007 California Health Interview
Survey to different orders of administration of a self-rated health item and a list of chronic
conditions. They found no order effect for English-language respondents but for Spanish-
language respondents self-rated health was reported as worse when it was asked before
compared to after the chronic conditions. The National Health Measurement Study of 3844
individuals found that the responses to the EQ-5D (VAS and U.S. preference-weighted score)
and the SF-36 (SF6D and MCS) were significantly more positive when administered later in a
telephone interview (Hays et al., in preparation). But the magnitude of the order effects was
small.

Person Fit. An issue not addressed by Reeve (2009) that has important implications for
questionnaire evaluation and interpretation is person fit. Person fit evaluates the extent to which
a person’s pattern of item responses is consistent with the IRT model (Reise, 1990). It is in some
sense a micro-level evaluation of DIF. The standardized ZL Fit Index is an example person fit
index (ln = natural logarithm): ZL │θ = ∑[ln L│θ - ∑E(ln L│θ]/(∑V[ln L│θ)). Large negative
ZL values indicate misfit. Large positive ZL values indicate response patterns that are higher in
likelihood than the model predicts. Person misfit can be suggestive of response carelessness or
cognitive errors. These may occur, for example, if the readability of the survey exceeds the
literacy of the respondent (Paz et al., 2009). It is also possible that the IRT model just does not
apply to that individual. Cases with significant person misfit can be excluded or at least flagged
to determine impact on conclusions from analyses.

4
A scatter plot of a person fit index by theta from answers to the PROMIS physical functioning
item bank is provided in Figure 1. [will add more about person misfit cases later]

Unit of analysis. Most applications of IRT in the patient-reported outcomes field analyze the
data at the individual level. But the appropriate unit of analysis for some patient-reported
outcomes is a higher-level aggregate. For example, the CAHPS health plan surveys focus on the
health plan rather than the individual completing the survey. Hence, psychometric analyses need
to be conducted at the health plan level. Group-level IRT analyses of CAHPS health plan survey
data from a sample of 35,572 Medicaid recipients nested within 131 health plans confirmed that
within plan variation dominates between plan variation for the items (Reise, Meijer, Ainsworth,
Morales, & Hays, 2006). Hence, large sample sizes are needed to reliably differentiate among
plans. A plan-level 3-parameter IRT model fit showed that CAHPS items had small item
discrimination parameter estimates relative to the person-level estimates. In addition, CAHPS
items had large lower asymptote parameters at the health plan level. While these results are not
surprising in and of themselves, the performance of the CAHPS survey at the unit of analysis for
which it is being used is what it is most important.

References

Coons, S. J., Gwaltney, C. J., Hays, R. D., Lundy, J. J., Sloan, J. A., Revicki, D. A.,
Lenderking, W. R., Cella, D., & Basch, E. (2009). Recommendations on evidence needed to
support measurement equivalence between electronic and paper-based Patient-Reported
Outcome (PRO) Measures: ISPOR ePRO good research practices task force report. Value in
Health, 12, 419-429.

Farmer, W. L., Thompson, R. C., Heil, S. K. R., & Heil, M. C. (2001, February). Latent
trait theory analysis of changes in item response anchors. U. S. Department of Transportation,
Federal Aviation Administration, National Technical Information Service, Springfield, Virginia
22161.

Fung, C. H., & Hays, R. D. (2008). Prospects and challenges in using patient-reported
outcomes in clinical practice. Quality of Life Research, 17, 1297-302.

Hays, R. D. (in preparation). Do generic health-related quality of life scores vary by


order of administration?

Hays, R. D., Bjorner, J., Revicki, D. A., Spritzer, K., & Cella, D. (2009). Development of
physical and mental health summary scores from the Patient-Reported Outcomes Measurement
Information System (PROMIS) global items. Quality of Life Research, 18, 873-80.

Hays, R. D., & Reeve, B. B. (2008). Measurement and modeling of health-related quality
of life. In K. Heggenhougen & S. Quah (eds.), International Encyclopedia of Public Health (pp.
241-251). San Diego: Academic Press.

5
Lee, S., & Grant, D. (2009). The effect of question order on self-rated general health
status in a multilingual survey context. American Journal of Epidemiology, 169, 1525-1530.

Nunnally, J. (1978). Psychometric theory, 2nd edition. New York: McGraw-Hill.

Paz, S. H., Liu, H., Fongwa, M. N., Morales, L. S., & Hays, R. D. (2009). Readability
estimates for commonly used health-related quality of life surveys. Quality of Life Research,18,
889-900.

Quigley, D., Elliott, M. N., Hays, R. D., Klein, D., & Farley, D. (2008). Bridging from
the Picker Hospital Survey to the CAHPS® hospital survey. Medical Care, 46, 654-661.

Reeve, B. B. (2009,October 21-23). Applying item response theory (IRT) models for
questionnaire evaluation. Workshop on Question Evaluation Methods. National Center for
Health Statistics and National Cancer Institute.

Reise, S. P. (1990). A comparison of item- and person-fit methods of assessing model-


data fit in IRT. Applied Psychological Measurement, 14, 127-137.

Reise, S. P., Meijer, R., R., Ainsworth, A. T., Morales, L. S., & Hays, R. D. (2006).
Application of group-level item response models in the evaluation of consumer reports about
health plan quality. Multivariate Behavioral Research, 41, 85-102.

Rockwood, T. H., Sangster, R. L., & Dillman, D. A. (1997). The effect of response
categories on questionnaire answers. Sociological Methods & Research, 26, 118-140.

Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude. Chicago:


University of Chicago Press.

Ware, J. E., Gandek, B. L., Keller, S. D., and the IQOLA Project Group. (1996).
Evaluating instruments used cross-nationally: Methods from the IQOLA project. In B. Spilker
(ed.), Quality of life and Pharmacoeconomics in Clinical Trials, Second Edition (pp. 681-692).
Philadelphia: Lippincott-Raven.

6
Table 1. Item parameters (graded response model) for global physical health items
in Patient-Reported Outcomes Measurement Information System

Item A b1 b2 b3 b4
Global01 7.37 (na) -1.98 (na) -0.97 (na) 0.03 (na) 1.13 (na)
Global03 7.65 (2.31) -1.89 (-2.11) -0.86 (-0.89) 0.15 ( 0.29) 1.20 ( 1.54)
Global06 1.86 (2.99) -3.57 (-2.80) -2.24 (-1.78) -1.35 (-1.04) -0.58 (-0.40)
Global07 1.13 (1.74) -5.39 (-3.87) -2.45 (-1.81) -0.98 (-0.67) 1.18 ( 1.00)
Global08 1.35 (1.90) -4.16 (-3.24) -2.39 (-1.88) -0.54 (-0.36) 1.31 ( 1.17)
Note: Parameter estimates for 5-item scale are shown first, followed by estimates for 4-
item scale (in parentheses). na = not applicable

Global01: In general, would you say your health is …? Global03: In general, how would
you rate your physical health? Global06: To what extent are you able to carry out your everyday
physical activities? Global07: How would you rate your pain on average? Global08: How would
you rate your fatigue on average?

a = discrimination parameter; b1 = 1st threshold; b2 = 2nd threshold; b3 = 3rd threshold; b4


= 4th threshold

7
Figure 1: Scatterplot of Person Fit Index by Theta for PROMIS Physical functioning Item
Bank

You might also like