You are on page 1of 9

Archives of Clinical Neuropsychology Advance Access published December 8, 2010

Empirical Report
Specificity of Reliable Change Models and Review of the Within-subjects
Standard Deviation as an Error Term
Anton D. Hinton-Bayre*
Department of Surgery, University of Queensland Medical School, Princess Alexandra Hospital, Brisbane, Australia
*Corresponding author at: Department of General Surgery, Princess Alexandra Hospital, Ipswich Road, Buranda, QLD 4120, Australia.
Tel.: +61-7-3394-7111; fax: +61-7-3394-7322.
E-mail address: anton.hb@hotmail.com (A.D. Hinton-Bayre).
Accepted 30 October 2010

Abstract
There is an ongoing debate over the preferred method(s) for determining the reliable change (RC) in individual scores over time. In the
present paper, specificity comparisons of several classic and contemporary RC models were made using a real data set. This included a more
detailed review of a new RC model recently proposed in this journal, that used the within-subjects standard deviation (WSD) as the error term.
It was suggested that the RCWSD was more sensitive to change and theoretically superior. The current paper demonstrated that even in the
presence of mean practice effects, false-positive rates were comparable across models when reliability was good and initial and retest
variances were equivalent. However, when variances differed, discrepancies in classification across models became evident. Notably, the
RC using the WSD provided unacceptably high false-positive rates in this setting. It was considered that the WSD was never intended
for measuring change in this manner. The WSD actually combines systematic and error variance. The systematic variance comes from mea-
surable between-treatment differences, commonly referred to as practice effect. It was further demonstrated that removal of the systematic
variance and appropriate modification of the residual error term for the purpose of testing individual change yielded an error term already
published and criticized in the literature. A consensus on the RC approach is needed. To that end, further comparison of models under varied
conditions is encouraged.

Keywords: practice effects; RC; statistical error

Introduction

The use of reliable change (RC) indices for tracking change in individuals’ test scores continues to proliferate within the
health sciences. Metrics for assessing the statistical significance of individual change scores have been known for several
decades (McNemar, 1962). The classic approach to determining RC is regularly attributed to Jacobson and Truax (1991).
Essentially, RC seeks to determine whether an estimation of true change over time has occurred when standardized by dividing
by the standard error of measurement of the difference. Over the years, there has been an ongoing debate and there remains
considerable disagreement regarding the most appropriate method to use for determining the statistically significant change
in an individual over time. It has been suggested that all RC models can be reduced to a basic formula RC ¼ (Y 2 Y′ )/SE;
where Y is the actual post-test score, Y′ the predicted post-test score, and SE the standard error (Hinton-Bayre, 2005).
A recent article in this journal demonstrated that the majority of popular RC models can be derived through basic test-retest
statistics (Hinton-Bayre, 2010). It was suggested that researchers or clinicians could subsequently derive reliable statistics
according to their preferred model in the absence of a consensus on the most valid model. The study also highlighted
where models could be seen to diverge in their estimate of RC, which will be presented in more detail shortly. The astute fol-
lower of RC literature may have noted that the most recent model proposed by Lewis, Maruff, Silbert, Evered, and Scott (2007)
in this journal had been omitted from the review mentioned. In the Lewis and colleagues article, the authors essentially pro-
posed a new RC model using the within-subjects standard deviation (WSD) as the error term or SE. Moreover, the

# The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
doi:10.1093/arclin/acq087
2 A.D. Hinton-Bayre / Archives of Clinical Neuropsychology

Hinton-Bayre (2010) paper only used two contrived case studies to highlight the discrepancies between the RC models
reviewed. Thus, the current paper had two objectives. The first objective was to compare RC models using an actual data
set, in particular, a group of nonclinical individuals retested on a series of neuropsychological measures. The second objective
was to review the RC model proposed by Lewis and colleagues (2007).

A Brief Chronology of RC Models

The debate over the most appropriate method for determining RC has been ongoing at least since Christensen and Mendoza
(1986) modified the SE utilized in the original RC index reported by Jacobson, Follette, and Reverstorf (1984). The latter
authors had used the standard error of measurement (SEM) to determine whether the difference between scores over time
had changed significantly. Christensen and Mendoza proposed that the standard error of the difference between two test
scores should be the appropriate error term. The calculation of which they provided as SE[1] in Table 1. In their now
classic article, Jacobson and Truax (1991) conceded the inappropriateness of the SEM and proposed the use of the SE[2]
(Table 1) when retest data were not readily available, and initial test and retest variances could be assumed to be equal.
Since that time several models of RC have been proposed. The more salient of these include the model of Chelune,
Naugle, Luders, Sedlak, and Awad (1993), which effectively corrected for the mean practice effects seen in repeated neurop-
sychological assessment. The same group in the same year also proposed a standardized regression-based (SRB) approach for
determining RC (McSweeny, Naugle, Chelune, & Lüders, 1993). In the simple SRB model, the retest scores are regressed onto
initial scores from a control group. The predicted score (Y′ ) for an individual is calculated via the least-squares regression line,
and the SE is simply the standard deviation of the residuals. Many subsequent RC variants have been criticized for leading to
unjustified probability statements (Maassen, 2001). Maassen (2004) went on to point out that Christensen and Mendoza had
misinterpreted the SE formula required for testing individual difference scores, having provided a formula for the standard
deviation of the difference scores. Although discussed in more detail later, it is worth noting here that this is the same error
term used by Temkin, Heaton, Grant, and Dikmen (1999) and Rasmussen and colleagues (2001). In reference to classical
test theory, Maassen has strongly and repeatedly argued for the error term to be calculated as seen in SE[3], Table 1
(Maassen, 2004, 2005; Maassen, Bossema, & Brand, 2006, 2009). It should be noted that Iverson, Lovell, and Collins
(2003) and Maassen (2003) were the first to utilize an RC model using an adjustment for mean practice effect for the predicted
score and the SE[3]. More recent debate has focused on the classic approach with modification for practice effects versus the
standard regression approach (e.g., Hinton-Bayre, 2005; Maassen et al., 2006). A small number of studies have reported on both
methods and compared classification rates, which often agree closely (e.g., Erlanger et al., 2003). Maassen and colleagues
(2006) had suggested a modification to the regression-based approach through adjusting the manner in which the predicted
score is calculated which they argued would provide a more accurate representation of true change. These authors continued
the recommendation for the use of SE[3] and went on to compare several models in both theoretical and actual data sets
(Maassen et al., 2009). Their eventual conclusion was that their approach was generally preferred providing a more conserva-
tive classification rate. Although they conceded that a standard regression approach might be appropriate when sensitivity is of
significant interest. Since then, Hinton-Bayre (2010) demonstrated that most RC models can be derived with the possession of
basic test– retest statistics. More importantly, it was shown that RC models will give different estimates based on the presence
of practice effects, differential practice effects, test– retest reliability, and the individual’s relative position compared with con-
trols at initial testing. Hinton-Bayre also observed that the Maassen and colleagues (2009) approach might not be universally
the most conservative, nor the standard regression approach the most sensitive. Ultimately, a theoretically sound, consensus
approach to RC is desired. As a contribution to this goal, the present paper seeks to empirically compare RC models in con-
temporary research literature, including the RCWSD, as proposed by Lewis and colleagues (2007).

Table 1. SE estimates for RC


Author SE expression Description

SE[1] (Christensen & Mendoza, 1986) SE = S
2 + S2 − 2S S r
X Y X Y XY The standard deviation of difference scores
SE[2] (Jacobson & Truax, 1991) 2S2X (1 − rXY )
SE =   The SEM of the difference score, when variances are equal
SE[3] (Iverson et al., 2003; Maassen, 2004) SE = (S 2 + S2 )(1 − r )
 The SEM of the difference score
X Y XY
SE[4] (Lewis et al., 2007) SE = 
(X − Y) /2n
2
The WSD
SE[5] (McSweeny et al., 1993) SE = SY (1 − rXY 2 ) The standard deviation of the least-squares regression residuals

Notes: SE ¼ standard error; RC ¼ reliable change; SEM ¼ standard error of measurement; X ¼ initial test raw score; Y ¼ retest raw score; SX ¼ initial test
standard deviation in the comparison group; SY ¼ retest standard deviation in the comparison group; rXY ¼ test –retest reliability coefficient; n ¼ comparison
group sample size.
A.D. Hinton-Bayre / Archives of Clinical Neuropsychology 3

Foundations of the RCWSD

Lewis and colleagues (2007) compared four RC methods. The standard error formulas used are presented in Table 1. The
first two methods were originally described by Jacobson and Truax (1991) and Chelune and colleagues (1993). Notably, the
Jacobson and Truax method does not correct for practice and uses only the pretest variance to estimate the standard error,
assuming pre- and post-test variances are equivalent. The Chelune and colleagues method corrects for practice as seen in a
control group, but uses the Jacobson and Truax error term. Next the RCISPOCD was considered (Rasmussen et al., 2001),
this method uses the same predicted score Y′ as the Chelune method, but “uses the standard deviation of the change scores
estimated from the control group used to derive the group mean practice effect” (Lewis et al., 2007, p250). As alluded to
earlier, this error term can actually be traced back to Christensen and Mendoza (1986) as in the absence of raw scores this
error can be calculated by an expression using basic test– retest statistics (Table 1). Furthermore, this is one of the error
terms described by Temkin and colleagues (1999) in a comparison of various RC methods. Readers should thus be aware
that the Rasmussen and colleagues (2001) RCISPOCD is not yet another method. Finally, the authors introduced the RCWSD,
which uses the WSD as the error term. Lewis and colleagues state that the WSD is “drawn from the estimate of residual
error in a repeated measures analysis of variance that has compared performance over multiple time-points in a control
group” (p. 251). They go on to advise that the WSD can be calculated as described by Bland and Altman (1996). Lewis
and colleagues then argued in favor of the RCWSD on the basis that it (i) tends to be more sensitive to impairment post-
operatively, and (ii) it more truly reflects random error in determining RC (Mollica, Maruff, & Vance, 2004). On the latter
point, the authors argued that the other error terms (e.g., RCISPOCD, Rasmussen et al., 2001) were tainted by the inclusion
of systematic error. The present paper elaborates on the theoretical underpinnings and calculation of the WSD as an error
term in RC. Ultimately, this paper will demonstrate that the WSD approach is inappropriate for use in RC in neuropsychology.
The RCWSD as proposed by Lewis and colleagues (2007) can also be reduced to the basic formula RC ¼ (Y 2 Y′ )/SE. The
predicted post-test score (Y′ ) Lewis and colleagues adopted is simply that used earlier by Chelune and colleagues (1993), where
Y′ is simply the pretest score (X) plus the mean difference score or practice effect seen in an appropriate set of controls. The
novelty of the RCWSD approach lies in the derivation of the error term the authors label as WSD. Bland and Altman (1996)
referred to this error as SW, or the square-root of the mean squares (MS) residual for repeated measures, bearing in mind
residual refers to unexplainable variance. Regularly, variances (or Sums of Squares, SS) in any ANOVA can be partitioned
as follows: SS Total ¼ SS between-people + SS within-people. In the repeated measures situation, the SS within-people
can be further partitioned into SS between-treatment + SS residual (Winer, 1971). And ordinarily the F-ratio of interest is
the MS between-treatment/MS residual, with the systematic influence of subject differences (MS between-people) removed.
Doyle and Doyle (1997) make exactly this critique and go on to point out there is indeed a significant between-treatment
effect in the Bland and Altman data. Note that, between-treatment variation is explained variance that is controlled by correct-
ing for the practice effect in the numerator when deriving Y′ and thus should not appear in the error variance. The SW (or WSD)
is calculated by finding the variance of the repeated scores for each subject and then finding the square-root of the average of
these values (Bland & Altman, 1996). This calculation process effectively combines the between-treatment variance with the
residual variance, and thus a systematic variance with a random error. Moreover, the square-root of the MS residual yields the
standard deviation of the difference scores (SE[1]) in a two-level test–retest scenario, and as such its use as an error term for
individual change departs from the classical test theory (see Maassen, 2004). Thus, the theoretical benefit of the RCWSD as
proposed by Lewis and colleagues (2007, p. 255) seems unfounded.
Bland and Altman (1996) described the use of the SW (or WSD) for evaluating differences between a subject’s measured and
true value – measurement error. They then went on to briefly refer to “repeatability,” which is a concept perhaps more familiar
in the biological, chemical, and physical sciences, where multiple measures are made of the same subject, with the same
measure, by the same observer, under the same conditions, over a short period of time (Chinn, 1991). In a situation where
repeated observations are affected only by chance, the WSD may be a reasonable measure of within-people variability.
However, in the setting of repeated neuropsychological assessment, or even repeated respiratory effort, a learning effect is
often seen. This systematic learning effect is manifest through the between-treatment variance, which when present can be
removed from the within-people variance as described above (Doyle & Doyle, 1997). Bland and Altman actually refer to
the SW (or WSD) when examining variability from a true to a measured score. In the absence of a between-treatment or practice
effect, the WSD would be an appropriate error term in the setting of a single assessment. Thus, the use of the WSD as an error
term for RC parallels the mistake made by Jacobson and colleagues (1984), as pointed out by Christensen and Mendoza  (1986).
To interpret the difference between two measured scores Bland and Altman stipulate the following error term, 2S2W . Thus by
extension, it appears that Bland and Altman would suggest that this is the appropriate error term for determining RC. Moreover,
√
it can be readily demonstrated that the SE = 2MSResidual in the two level scenario—usually seen in RC—will exactly equal
the standard deviation of the difference scores, when the MS residual does not incorporate MS between-treatments. Again, this
4 A.D. Hinton-Bayre / Archives of Clinical Neuropsychology

is the same error term proposed by Christensen and Mendoza (1986) and subsequently used by Temkin and colleagues (1999)
and Rasmussen and colleagues (2001) in the RCISPOCD. The increased sensitivity reported by Lewis and colleagues (2007) is
solely due to the usually smaller value of the WSD in comparison to the other error terms evaluated—given that the same
numerator is used in all but the Jacobson and Truax (1991) methods. Thus, the practical benefit of greater sensitivity ascribed
to the WSD is flawed on the basis of the inappropriateness of its use as an error term when examining the difference between
two measured scores in the setting of practice effects.
A research note was published by Maassen (2010) that elaborated on the inappropriateness of the WSD as an error term for
determining RC. Using simulated data sets, Maassen demonstrated that in the absence of practice effects (i.e., difference
between initial and retest variances or means) the WSD tends to under-estimate error and thus over-estimate false-positive
rates. Furthermore, when large practice effects were present Maassen found that the WSD overestimated error and thus
became effectively insensitive to change. This should not be surprising given that the WSD in calculation—as will be demon-
strated—combines systematic variance, as manifest by mean practice effects, into the error estimate. The present paper seeks to
compare the false-positive rate (1-specificity) for a selection of models as reviewed by Hinton-Bayre (2010), while also incor-
porating the Lewis and colleagues (2007) model. In particular, consideration of where models differ in false-positive classi-
fication will be examined.

Method

A sample of 104 male rugby league players completed the Speed of Comprehension Test (SOCT; Baddeley, Emslie, &
Nimmo-Smith, 1992), Digit Symbol Substitution Test (DSST) from the WAIS-R (Wechsler, 1981), and Symbol Digit
Modalities Test (SDMT) written version (Smith, 1982), on two occasions 1 – 3 weeks apart. On average, the sample was
18.5 years of age (SD ¼ 2.5), had 11.9 years of education (SD ¼ 1.0), and an estimated full scale IQ of 88.5 (SD ¼ 8.5)—
based on National Adult Reading Test-Revised (Crawford, 1992). These data were collected in a larger study prospectively
examining the effects of concussion in contact sport. A limited portion of this data has been previously published
(Hinton-Bayre, 2004).

RC Models

RC models can be broadly separated into mean practice and regression-based approaches. Recall that all RC models can be
reduced to RC ¼ (Y 2 Y′ )/SE. The mean practice approaches, as the term suggests, all estimate the Y′ by adding the practice
effect (difference between means of test and retest: MY 2 MX) to the initial test score (X). The four models considered differ
only in terms of the SE implemented. The Chelune and colleagues (1993) models uses SE[2] (Table 1). Temkin and colleagues
(1999) and Rasmussen and colleagues (2001) effectively used SE[1]. The Iverson and colleagues (2003) and Maassen (2004)
used SE[3]. And finally, Lewis and colleagues used SE[4]. The differences between these models vary as a function of the
difference between test and retest variances, or differential practice (Maassen, 2005), and test reliability. The differences
are predictable and will be reconsidered shortly.
Two regression-based models were considered. The SRB model is attributed to McSweeny and colleagues (1993) and is
calculated using a least-squares regression line to obtain Y′ and the standard error of estimate as the corresponding error
term, SE[5]. The predicted score Y′ ¼ b (slope of least-squares regression line) + a (constant). The slope of the line b is

usually calculated as the square-root of covariance of XY divided by variance of X: b = Cov/VarX . The coefficient b
can also be calculated by multiplying the Pearson’s product moment correlation by the standard deviation of retest scores
divided by the standard deviation of the initial test scores: b ¼ rXY × (SY/SX). Maassen and colleagues (2006) suggested an
adjustment to the regression model where the regression coefficients are corrected for proposed bias, where badj ¼ b/rXY and
thus badj ¼ SY/SX. The aadj constant is calculated in the traditional manner using the respective b coefficient, for example,
aadj ¼ MY 2 badjMX. Maassen and colleagues further recommended the use of SE[3] as the corresponding error term.

Results

Table 2 provides the descriptive and inferential statistics generated. Significant practice effects were observed for all three
measures, chiefly due to the statistically large sample size. A significant difference between initial and retest variances, referred
to as differential practice (Maassen, 2005), was observed for the SOCT and SDMT. Retest variability was greater than initial
test variability for SOCT suggesting divergence from the mean. Conversely, retest variance was less than initial test variance
for SDMT, suggesting regression to the mean.
A.D. Hinton-Bayre / Archives of Clinical Neuropsychology 5

Table 2. Descriptive and RC statistics for normative sample (n ¼ 104)

SOCT DSST SDMT

Initial test mean 45.44 57.75 53.98


Retest mean 53.41 63.76 55.88
Initial test (SD) 16.66 11.35 10.88
Retest (SD) 19.47 11.62 9.34
rXY 0.83 0.70 0.65
Mean practice (t) 7.57 6.90 2.26
Differential practice (t) 3.89 0.44 2.61
Standard error
SE[1] CM 10.74 8.88 8.55
SE[2] JT 9.58 8.78 9.07
SE[3] Iverson/Maassen 10.42 8.88 8.46
SE[4] WSD 9.43 7.56 6.16
SE[5] SRB 10.73 8.29 7.08
RC classification: + positive change/2negative change Correlation between ZX and RC scores (in parentheses)
Temkin 7/4 (20.04) 6/6 (20.36) 8/8 (20.56)
Chelune 9/6 (20.04) 6/6 (20.36) 5/7 (20.56)
Iverson 7/4 (20.04) 6/6 (20.36) 8/8 (20.56)
WSD 9/6 (20.04) 6/8 (20.36) 12/10 (20.56)
SRB 7/3 (20.01) 5/4 (20.01) 7/5 (20.01)
Maassen 5/5 (20.29) 6/6 (20.39) 6/7 (20.42)
Notes: RC ¼ reliable change; SOCT ¼ Speed of Comprehension Test; DSST ¼ Digit Symbol Substitution Test; SDMT ¼ Symbol Digit Modalities Test;
CM ¼ Christensen and Mendoza; JT ¼ Jacobson and Truax; WSD ¼ within-subjects standard deviation; SRB ¼ standardized regression-based; ZX ¼ standar-
dized initial test score.

The present paper sought to compare the error terms and ultimately classification rates for the six RC models described
earlier. The Jacobson and Truax (1991) RC model does not account for practice effects and thus will not be further considered
here as estimates of change will be biased. Table 2 also presents SE estimates for the error terms noted in Table 1. It can
be readily appreciated that the SEWSD produces a smaller error value on each test, consistent with the observation by Lewis
and colleagues (2007). The SE[1] and SE[3] were equal when SX was equivalent to SY, as was the case for DSST. SE[3]
was always less than or equal to SE[1] as has been previously demonstrated (Maassen, 2004). In the setting of significant
differential practice (e.g., SOCT and SDMT), SE[2], which assumes equality of variances, provides a lesser or greater estimate
of error respectively, compared with SE[1] and SE[3]. The linear regression error term, SE[5], was relatively large in the setting
of divergence from the mean (SX , SY) seen for SOCT. However, when test– retest variances were equal (DSST) or there
was evidence of regression to the mean (SX . SY; SDMT), SE[5] was relatively smaller compared with other error terms,
excluding SEWSD.
Table 2 goes on to present the classification rates of the six RC models. Theoretically, given the sample size of 104, one
would expect roughly five individuals to produce an RC score greater than +1.645 and another five individuals an RC
score less than 21.645, assuming a confidence level of 90% two-tailed. In the case of the DSST where reliability was
good and variances equal, the classification rates approximated the theoretical distribution across RC models. Where there
was divergence from the mean on the SOCT, most models provided a classification rate commensurate with chance levels.
Notably, the RC model of Chelune and colleagues (1993) and the RCWSD yielded the highest false-positive classifications.
Where there was regression to the mean on retesting on the SDMT accompanied by less impressive reliability, false-positive
rates were most varied across models. Nonetheless, the RCWSD provided the greatest false-positive classification rate among
RC models examined consistently across each of the tests. A series of goodness-of-fit x 2 analyses (p , .05) were conducted to
evaluate whether the classifications of change differed from the theoretical distribution, that is, .10 significantly improved or
deteriorated. The 21% false-positive rate for the RCWSD on SDMT was the only one to reach statistical significance. The
alleged sensitivity of the WSD as an error term (SE[4]) as described by Lewis and colleagues (2007) came at the expense
of consistently reduced specificity in real data.
Further examination of Table 2 classifications suggests that there are differences in the symmetry of classification for RC
models. Under a standard normal distribution of RC scores, one would expect approximately 5% of cases in either tail to be
significantly changed due to chance. When variances were equal (e.g., DSST) there was good symmetry of classification. With
divergence from the mean on retesting (e.g., SOCT), there was a trend toward more positive RC changes. Furthermore, when
there was evidence of regression to the mean on retesting (e.g., SDMT), the asymmetry was more obvious for models where
6 A.D. Hinton-Bayre / Archives of Clinical Neuropsychology

Table 3. ANOVA statistics for worked example

Test Source SS df MS F-value

SOCT Between people 61674.42 103 598.78 10.39


Between treatment 3304.04 1 3304.04 57.33
Residual 5936.46 103 57.64
Between-treatment + residual 9240.50 104 88.85
DSST Between people 23101 103 224.28 5.68
Between treatment 1878.0 1 1878.0 47.58
Residual 4065.50 103 39.47
Between-treatment + residual 5943.50 104 57.15
SDMT Between people 17 417.42 103 169.10 4.63
Between treatment 186.58 1 186.58 5.11
Residual 3763.92 103 36.54
Between-treatment + residual 3950.50 104 37.99
Notes: SOCT ¼ Speed of Comprehension Test; DSST ¼ Digit Symbol Substitution Test; SDMT ¼ Symbol Digit Modalities Test.

only one or the other variance was used to estimate error, namely SE[2] and SE[5], using initial and retest variances, respect-
ively. The most consistent symmetry was seen for the Maassen RC model.
The RC models based on the mean practice effect do not correct for extremeness of the initial test score, whereas regression-
based models do. Individual data have not been presented; however, inspection of the individual RC scores across models
revealed that the standardized deviation or extremeness of the initial (ZX) score [(X 2 MX)/SX] did not appear to predispose
that individual to later significant change on retesting for any RC model on any of the three measures. However, correlations
between initial standardized scores and subsequent RC scores did vary across models for different test conditions. Correlations
between initial test ZX scores and subsequent RC scores, presented in Table 2, approximated zero when there was evidence
of differential practice such that SX , SY, suggesting divergence from the mean (e.g., SOCT), for mean practice effect
models (r ¼ 2.04) and the SRB (r ¼ 2.01). Interestingly, the Maassen model suggested a weak negative relationship
between initial ZX scores and RC scores (r ¼ 2.29) for SOC. When initial and retest variances were equal (e.g., DSST),
the correlation was also weakly negative for mean practice effect models (r ¼ 2.36) and the Maassen model (r ¼ 2.39),
but essentially zero (r ¼ 2.01) for the SRB. Finally, when there was evidence of regression to the mean on retesting
(e.g., SDM), initial standard scores were moderately negatively correlated with RC scores for mean practice effect models
(r ¼ 2.56), but less so with the Maassen model (r ¼ 2.42), and not correlated under the SRB model (r ¼ 2.01). Thus, the
SRB model appears to be the only one that does not appear to influence the outcome when scores are more extreme at the
outset. It should be noted that these statements of relationship perhaps only hold in the presence of significant practice effects.
Table 3 presents the inferential statistics for the each of the three measures. As mentioned, the SS between-people is not of
interest. Now, it can be demonstrated that the sum of SS between-treatment and SS residual divided by the sum of df between-
p within-subjects variance, the squareproot of which yields the WSD, such that SOCT WSD ¼
treatment and df residual gives the
p
88.85 ¼ 9.43, DSST WSD ¼ 57.15 ¼ 7.56, and SDMT WSD ¼ 37.99 ¼ 6.16. This clearly demonstrates that the WSD
incorporates systematic and error variance (cf. SE[4] values in Table 2).

Discussion

The present paper had two aims. The first was to compare the false-positive rate (hence specificity) in a collection of well-
known and recent RC models using real data. This was achieved in a select sample of relatively homogenous individuals, that
is, young male athletes, who were retested a median of 2 weeks apart. The three measures of focus in this study were subject to
practice effects and moderate to good reliability. The tests differed most clearly on differential practice, else termed inequality
of variance. Recent previous studies have either focused on limited case example comparison of models (Hinton-Bayre, 2010)
or theoretical data sets (Maassen, 2010). Thus, this paper may be considered an extension of these earlier works. The second
aim was to highlight the RCWSD of Lewis and colleagues (2007), the most recent edition to the RC family. In particular, this
paper sought to demonstrate that the improved sensitivity to change alleged by its authors is based on faulty assumptions
regarding what constitutes statistical error. In this way, the current paper operationalizes a recent theoretical critique of the
RCWSD (Maassen, 2010). It should also be noted that errors occurring in Table 1 from Lewis and colleagues have been
amended and the data represented in an Appendix.
A series of important findings were elucidated by this study. As suggested in earlier work (Hinton-Bayre, 2010), when initial
and retest variances are equal, RC models will tend to agree regarding classifications of significant RC. Concordantly,
A.D. Hinton-Bayre / Archives of Clinical Neuropsychology 7

departures from this ideal situation witnesses increasing variations in classification amongst the RC models examined. These
data also reinforce the issue that where there is potential for mean practice effects on retesting, as is commonly seen in
performance-based measurement, that there is potential for differential practice. In this journal, Hinton-Bayre (2010) has
already suggested that differential practice appears to play a significant role in providing more varied estimates of RC
across different models. In the present data examined, it was observed that the WSD was consistently the smallest SE, and
furthermore the RCWSD had consistently the highest false-positive rate of all models (13% – 21%). Given the limited
sample size for examining significant deviations from a chance false-positive rate of 10%, the RCWSD on the SDMT was
the only occasion a statistically significant increase in false-positive classification was seen. Lewis and colleagues (2007)
had claimed that the SEWSD provided an unbiased estimate of error. This paper clearly demonstrates that in the setting of prac-
tice effects, the SEWSD actually combines systematic and residual errors, as seen in Table 3. Moreover, when the systematic
effect of error is removed and the error calculated as alluded to by Bland and Altman (1996), this yields an error term already
existent in the literature (RCICSPOD and RCTemkin). This latter error term has been criticized for departing from the classical test
theory (Maassen, 2004; Maassen et al., 2006). In the current data set, the most consistent classification rate with the best sym-
metrical false-positive classification rate was the RCMaassen.
The findings of this paper have significant implications for the implementation of RC statistics in retest analysis of individ-
ual data measured on a continuum. If ideal psychometric properties are present, then the model used is likely of little conse-
quence. A previous work has demonstrated that when reliability is good, variances are equivalent, and the control/normative
sample is adequately matched that the RC model employed is of lesser concern (Hinton-Bayre, 2010). However, departures
from these ideals show discordance between models. In the absence of an agreed approach, one could argue that RC analysis
may not be appropriate unless certain criteria regarding these parameters are met. It has been suggested that reliability be
.0.70, there should not be differential practice, and the case should fall within 1 SD of the control sample at initial testing
(Hinton-Bayre, 2010). Empirical investigation of these suggestions is wanting. When test and retest data are available, the
use of SE[2] is questionable, given the extra information provided by the possession of retest variance. The question
remains over how to proceed when test and retest variances are not equivalent. Should the investigator shy away from RC
analysis altogether, or should they incorporate a model that considers such a discrepancy (e.g., the RCSRB)? Further consider-
ation of the scope of application of RC models is required. It is clear, however, that the RCWSD is not theoretically or practically
suited to the task of RC. As argued recently by Maassen (2010), the WSD is a measure of variation for a single score, not a
change score. Thus, the increase sensitivity proposed by Lewis and colleagues (2007) is illusory, which may even paradoxi-
cally be reversed in the setting of large practice effects (Maassen, 2010). In the current real test– retest data set of performance
measures demonstrating practice effects, the RC model of Maassen and colleagues (2006) provided the most consistent and
symmetrical false-positive rate approximating chance levels. However, the SRB was the only RC model to not demonstrate
a relationship between initial retest performance and subsequent RC values. It should be noted that no obvious link was
seen between actual extreme scores at initial testing and significant RC scores on retest.
Naturally, the limitations of the present study bring to bear significant caveats regarding the results described. The sample,
although well powered to examine for practice and differential practice effects, was not well powered to examine discrepancies
in classification at the 90% confidence level. The pattern of results is also limited to measures with significant practice effects.
This is seen as a plus given it is regularly the norm in performance measurement. We do not see the select nature of the sample
as a limitation given the theoretical focus of this paper, and also the relative homogeneity would have facilitated reduction in
individual difference error. It is noted, however, that this paper comments only on specificity of RC models. A clinical or exper-
imental group is required to compare sensitivity of RC models. Nonetheless, a control sample enables the examination of the
individual change distributions for an RC model. With a confidence interval of 90%, one should expect to see a false-positive
rate of 10% and this provides a rational and testable evaluation of RC model accuracy. Variations in clinical or experimental
effect will naturally provide an added dimension to consistency of classification, with larger effects or more variable effects
producing differing effects on RC models.
It is not presently clear whether one should choose an RC model based on a sensitivity/specificity tradeoff or based on theor-
etical appropriateness to the psychometric conditions at hand. As noted previously, multiple test parameters will affect RC values
(Hinton-Bayre, 2010). Systematic examination of variations in such parameters (i.e., practice effects, differential practice,
reliability, individual comparability to controls) would help to further elucidate discrepancies and consistencies across models.
In this way either a universally appropriate RC model or a decision process for selecting one under certain conditions can be con-
sidered. Such investigation may yield circumstances where it is questionable whether an RC approach is appropriate at all!
This paper has demonstrated that the RC models compared tend to agree in the presence of practice effects, when there is
limited evidence of differential practice. To this end, it is sensible to consider that a minimum reliability be considered when
making decisions about individual change. Whether 0.7 is too lenient or strict requires further consideration. It is also again
clear that differential practice has a profound differential influence on RC models. It is not clear whether RC models are
8 A.D. Hinton-Bayre / Archives of Clinical Neuropsychology

appropriate at all in this setting, or whether a model that accounts for such discrepancy is preferable, for example, a regression-
based model. In the current data set, the Maassen RC model provided the most consistent and symmetric classification. It was
also clear that the WSD as described by Lewis and colleagues (2007) has neither a theoretical nor practical benefit over its
competitor error terms in RC and is in fact inappropriate for the task. The WSD was never intended to be used for comparison
of actual scores for an individual according to its original authors (see also Chinn, 1991). Bland and Altman (1996) were also
not particularly describing a situation where a learning effect was expected, although one was demonstrated in their data (Doyle
& Doyle, 1997). Removal of the between-level variance from the WSD yields a true residual, reflecting unexplained variance.
Use of this latter residual would yield exactly the error term used in RC analyses by Temkin and colleagues (1999) and
Rasmussen and colleagues (2001). Moreover, this error term has been strongly criticized in the literature for its apparent devi-
ation from the classic test theory. For a more contemporary discussion of the various RC models, the interested reader is
directed to Maassen and colleagues (2009), Maassen (2010) and Hinton-Bayre (2010). A consensus on an RC model or
approach to selecting a model for determining individual change would be ideal. To that end, continued comparison of RC
models in theoretical, normative, and clinical/experimental data sets is required. The author is able to provide a spreadsheet
to facilitate the calculation of different RC model scores on request. The author is able to provide a spreadsheet to facilitate
the calculation of different RC model scores on request.

Appendix
Reformatted Table 1 for Lewis and colleagues (2007): Formula for the RCIJ&T, the RCIISPOCD, the RCIChelune, and the RCIWSD, the systematic change observed
in the control group, and the error estimates used in the RCI calculations for each neuropsychological test
RCI Formula Error and group change estimates used in the RCI equations
WLT TMTA TMTB COWAT DSST GPD GPND

DXC 0.8 (3.8) 20.1 (15.6) 8.5 (24.7) 4.5 (10.5) 21.5 (8.3) 27.6 (14.5) 20.1 (15.6)
RCIJ&T DX/SEdifference 3.54 12.85 26.49 11.02 8.90 12.64 12.10
RCIISPOCD (DX − DXC )/SDDXC 3.80 15.59 24.70 10.47 8.28 14.48 15.58
RCIChelune (DX − DXC )/SEdifference 3.54 12.85 26.49 11.02 8.90 12.64 12.10
RCIWSD (DX − DXC )/WSDDXC 2.73 10.96 18.37 8.02 5.92 11.53 10.96
Notes: DX ¼ individual time 2 performance 2 time 1 performance; DXC ¼ mean control group time 2 performance 2 time 1 performance;

 2

SEdifference = 2 SDbaseline control (1 − rxx ) , where rxx ¼ test –retest reliability of the measure; SDDXC ¼ standard deviation of the control group change;
WSDDXC ¼ within subject standard deviation of the control group. Tasks: WLT ¼ CERAD word-learning task; TMTA ¼ Trail Making Task part A; TMTB
¼ Trail Making Task part B; COWAT ¼ Controlled Oral Word Association Task; DSST ¼ Digit Symbol Substitution Task; GPD ¼ Grooved Pegboard
Dominant; GPND ¼ Grooved Pegboard Nondominant.

References

Baddeley, A., Emslie, H., & Nimmo-Smith, I. (1992). The Speed and Capacity of Language Processing Test: Manual. Bury St Edmunds, UK: Thames Valley
Test Company.
Bland, J. M., & Altman, D. G. (1996). Statistics note: Measurement error. British Medical Journal, 313, 744.
Chelune, G. J., Naugle, R. I., Luders, H., Sedlak, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information.
Neuropsychology, 7 (1), 41– 52.
Chinn, S. (1991). Repeatability and method comparison. Thorax, 46, 454– 456.
Christensen, L., & Mendoza, J. L. (1986). A method assessing change in a single subject: An alteration of the RC index. Behavior Therapy, 12, 305–308.
Crawford, J. R. (1992). Current and premorbid intelligence measures in neuropsychological assessment. In J. R. Crawford, D. M. Parker, & W. W. McKinlay
(Eds.), A handbook of neuropsychological assessment (pp. 21– 49). Hove, UK: Lawrence Erlbaum.
Doyle, J. R., & Doyle, J. M. (1997). Measurement error is that which we have not yet explained. British Medical Journal, 314, 147– 148.
Erlanger, D., Feldman, D., Kutner, K., Kaushik, T., Kroger, H., Festa, J., et al. (2003). Development and validation of a web-based neuropsychological test
protocol for sports-related return-to-play decision-making. Archives of Clinical Neuropsychology, 18, 293– 316.
Jacobson, N. S., Follette, W. C., & Reverstorf, D. (1984). Psychotherapy outcome research: Methods for reporting variability and evaluating clinical signifi-
cance. Behavior Therapy, 15, 336–352.
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of
Consulting and Clinical Psychology, 59, 12– 19.
Hinton-Bayre, A. D. (2004). Holding out for a reliable change from confusion to a solution: A comment on Maassen’s “The standard error in the Jacobson and
Truax Reliable Change Index”. Journal of the International Neuropsychological Society, 10, 894–898.
Hinton-Bayre, A. D. (2005). Methodology is more important than statistics when determining reliable change. Journal of the International Neuropsychological
Society, 11, 788 –789.
A.D. Hinton-Bayre / Archives of Clinical Neuropsychology 9

Hinton-Bayre, A. D. (2010). Deriving Reliable Change statistics from test –retest normative data: Comparison of models and mathematical expressions.
Archives of Clinical Neuropsychology, 25, 244–256.
Iverson, G. L., Lovell, M. R., & Collins, M. W. (2003). Interpreting change on ImPACT following sport concussion. The Clinical Neuropsychologist, 17,
460–467.
Lewis, M. S., Maruff, P., Silbert, B. S., Evered, L. A., & Scott, D. A. (2007). The influence of different error estimates in the detection of postoperative cog-
nitive dysfunction using reliable change indices with correction for practice effects. Archives of Clinical Neuropsychology, 22, 249–257.
Maassen, G. H. (2001). The unreliable change of reliable change indices. Behaviour Research and Therapy, 39, 495–498.
Maassen, G.H. (2003). Principes voor de definitie van reliable change (2): Reliable change indices en practice effects [Principles of defining reliable change (2):
Reliable Change Indices and practice effects]. Nederlands Tijdschrift voor de Psychologie, 58, 69–79.
Maassen, G. H. (2004). The standard error in the Jacobson and Truax reliable change index (the classical approach to the assessment of reliable change).
Journal of the International Neuropsychological Society, 10, 888–893.
Maassen, G. H. (2005). Reliable change assessment in sport concussion research: A comment on the proposal and reviews of Collie et al. British Journal of
Sports Medicine, 39, 483–488.
Maassen, G.H. (2010). The two errors of using the within-subject standard deviation (WSD) as the standard error of a Reliable Change Index. Archives of
Clinical Neuropsychology, 25, 451– 456.
Maassen, G. H., Bossema, E. R., & Brand, N. (2006). Reliable change assessment with practice effects in sport concussion research: A comment on
Hinton-Bayre. British Journal of Sports Medicine, 40, 829– 833.
Maassen, G. H., Bossema, E. R., & Brand, N. (2009). Reliable change and practice effects: Outcomes of various indices compared. Journal of Clinical and
Experimental Neuropsychology, 31, 339– 352.
McNemar, Q. (1962). Psychological statistics (3rd ed.). New York: Wiley.
McSweeny, A. J., Naugle, R. I., Chelune, G. J., & Lüders, H. (1993). “T-scores for change”: An illustration of a regression approach to depicting change in
clinical neuropsychology. The Clinical Neuropsychologist, 7, 300–312.
Mollica, C. M., Maruff, P., & Vance, A. (2004). Development of a statistical approach to classifying treatment response in individual children with ADHD.
Human Psychopharmacology, 19 (7), 445– 456.
Rasmussen, L. S., Larsen, K., Houx, P., Skovgaard, L. T., Hanning, C. D., & Moller, J. T. (2001). The assessment of postoperative cognitive function. Acta
Anaesthesiologica Scandinavica, 45 (3), 275–289.
Smith, A. (1982). Symbol Digit Modalities Test – Revised. Los Angeles, CA: Western Psychological Services.
Temkin, N. R., Heaton, R. K., Grant, I., & Dikmen, S. S. (1999). Detecting significant change in neuropsychological test performance: A comparison of four
models. Journal of the International Neuropsychological Society, 5, 357–369.
Wechsler, D. (1981). Wechsler Adult Intelligence Scale – Revised: Manual. New York: Psychological Corporation.
Winer, B.J. (1971). Statistical principles in experimental design. New York: McGraw-Hill.

You might also like