You are on page 1of 7

Physical Therapy in Sport 4 (2003) 122128 www.elsevier.

com/locate/yptsp

Reliability in evidence-based clinical practice: a primer for allied health professionalsq


Alan M. Batterhama,1, Keith P. Georgeb,*
a b

Sport and Exercise Section, School of Social Sciences, University of Teesside, Middlesborough TS1 3BA, UK Department of Exercise and Sport Science, Crewe and Alsager Faculty, Manchester Metropolitan University, Alsager Campus, Alsager, Cheshire ST7 2HL, UK

Abstract The aim of this paper is to provide a tutorial on reliability in research and clinical practice. Reliability is dened as the quality of a measure that produces reproducible scores on repeat administrations of a test. Reliability is thus a prerequisite for test validity. All measurements are attended by measurement error. Systematic bias is a non-random change between trials in a test retest situation. Random error is the noise in the measurement or test. Systematic bias should be evaluated separately from estimates of random error. For variables measured on an interval-ratio scale the most appropriate estimates of random error are the typical error, the percent coefcient of variation, and the 95% limits of agreement. These can be derived via analysis of variance procedures. Estimates of relative, rather than absolute, reliability may be obtained from the intraclass correlation coefcient. For variables that have categories as values the kappa coefcient is recommended. Irrespective of the statistic chosen, 95% condence intervals should be reported to dene the range of values within which the true population value is likely to reside. Small random error implies greater precision for single trials. More precise tests and measurements facilitate more sensitive monitoring of the effects of treatment interventions in research or practice settings. q 2003 Elsevier Ltd. All rights reserved.
Keywords: Reliability; Systematic bias; Random error

1. Introduction 1.1. Aims This is the second in a series of papers designed to address important knowledge, skills and abilities in the eld of research design, methods and statistics. Allied health professionals expected to conduct their own original research, or to use research ndings in practice, require such a foundation to ensure scientic rigour in their own investigations, and to facilitate a critical interpretation of the extant literature. The previous paper introduced the crucial concept of validity. Here we tackle the issue of reliability in research and practice. Our aim is, rstly, to dene key concepts and terms, secondly, to illustrate with example data sets the appropriate statistical methods required in assessing reliability.
q Reprinted from Physical Therapy in Sport Vol. 1, pp. 54 62 (2000), original doi: 10.1054/ptsp.2000.0010. With permission from Elsevier Ltd. * Corresponding author. E-mail address: a.batterham@tees.ac.uk (A.M. Batterham). 1 Tel.: 44-1642-342354; fax: 44-1642-342399.

1.2. Operational denitions Reliability is dened as the quality of a measure that possesses reproducibility. Reproducibility indicates the degree to which a test or measure produces the same scores when applied repeatedly in the same circumstances (Nelson, 1997). A repeatability study, therefore, is required to help establish and quantify reproducibility, and thus provide an indication of the test retest reliability of a measure. Dened in this manner, reliability is clearly an essential prerequisite for test validity. If a test or measurement tool cannot provide reproducibility in repeated administrations, in the same test conditions, then it could never be considered a valid test. The terms reliability, repeatability, reproducibility, and retest reliability, as well as consistency and stability, are often used interchangeably in the literature. In this paper we will refer to reliability to indicate the extent to which scores for a subject sample on one test can be reproduced in subsequent tests or trials conducted with the same study participants in the same circumstances. In clinical measurements two main types of reliability are of interestintrarater and interrater reliability. Intrarater reliability refers to the consistency of one practitioner or

1466-853X/03/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S1466-853X(03)00076-2

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

123

investigator, test or measurement tool. For example, consider a therapist interested in the peak torque produced by the knee extensors on an isokinetic dynamometer at a xed velocity. Ten patients perform the test twice with the same therapist, on the same dynamometer, 2 days apart. An analysis of the reproducibility of the test scores indicates the degree of intrarater reliability. If each patient reproduced exactly the same peak torque score on the second occasion, perfect intrarater reliability would be attained. Interrater reliability refers to the reproducibility of measurement between two or more investigators. For instance, two therapists are asked, independently, to examine 10 ankles using palpation, looking for signs of ankle syndesmosis injury. The therapists are asked to report a positive or negative diagnosis based on the physical examination. The extent to which the two therapists agree indicates the interrater reliability. If the therapists diagnoses were in accord on all 10 ankles the interrater reliability would be perfect. Assessments of intra- and interrater reliability are important if any condence is to be placed in the validity of the test or measurement. To provide a further illustration of these concepts in action we have reproduced below a partial extract of the abstract from a recent study (Schoppen et al., 1999). Objective. To determine the interrater and intrarater reliability and the validity of the Timed up and go test as a measure for physical mobility in elderly patients with an amputation of the lower extremity. Design. To test interrater reliability, the test was performed for two observers at different times of the same day in an alternating order. To test intrarater reliability, the patients performed the same test for one observer on two consecutive visits with an interval of 2 weeks.

2. Reliability and measurement error In clinical practice and research, of course, intra- or interrater reliability is never perfect. Many factors can inuence the reliability of a test or measure. Several of these confounding inuences were presented in the previous article in this series as threats to validity. For example, maturation and history can adversely affect reliability. In a test retest situation, it is often important that there is not too large a gap between repeat administrations of the test. An interval of several weeks, for example, may lead to changes in the variable under study, which are not necessarily indicative of poor reliability. As an illustration, return to our previous isokinetic test illustration of intrarater reliability. Imagine that, instead of a 2-day period, there had been a 6-week period in-between test and retest. The reader can doubtless think of a range of potential confounds caused by the intervening time interval. Recall also that the denitions of reliability provided indicate that the test circumstances should be consistent in repeat trials. Any

factors relating to the test or measurement situation that differ considerably from test to retest can adversely inuence reliability and validity. These may be simple things relating to the study participants such as lack of sleep or minor injuries or illness (Baumgartner and Jackson, 1995). Or, problems may present with the calibration or operation of any measurement instrument. For instance, if the isokinetic dynamometer was calibrated differently or inappropriately on one test occasion the reliability of the test may be compromised. All tests and measurements are attended by measurement error. The evidence-based practitioner requires a working knowledge of measurement error to conduct and interpret research. In clinical practice and research, an examination of reliability requires repeat measurements on a sample of patients. Broadly, two types of error may attend these measurementsrandom error and systematic bias. Random error refers to the noise in the measurement or test. Small random error in repeat administrations of a test indicates good reliability. Systematic bias is a non-random change between trials in a test retest situation, whereby all subjects perform consistently better in one trial. Random error results from several sources (Hopkins, in press). Biological error represents a change in a persons capabilities between test and retest. For example, if a persons muscular strength changes between the rst visit and the second due to physiological adaptations or psychological factors such as motivation. The greater the intervening time period the greater the likelihood of increased biological error. As mentioned previously, instrumentation or equipment problems and uncontrolled confounding variables may also contribute to the noise in the measurements. Systematic bias may result from learning or fatigue effects in repeat testing; this problem relates to the testing threat to internal validity whereby the pretest can inuence the values obtained on the post-test. For example, in a series of repeat, maximal isokinetic dynamometer trials the patients may fatigue signicantly if the time interval between tests is too short. This would result in the values recorded decreasing systematically across repeat measures. In other instances, patients nave to the test protocols may improve systematically across trials due to a learning effect or increased condence in their ability to perform the test. This example highlights the importance of adequately habituating subjects to the test procedures. Only then can a proper assessment of reliability be conducted. Conceptually, the issue of test retest reliability is relatively straightforward. Less clear is how best to assess and quantify reliability. Certainly to date there has been no consensus in the literature. Myriad diverse approaches and statistical techniques are presented in the literature, resulting in confusion for the researcher and practitioner. We refer readers interested in a more in-depth treatment of the competing theories and methods than the one presented here to excellent recent reviews by Atkinson and Nevill (1998);

124

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

Hopkins (in press). We present the approaches and consequent statistical techniques that we consider the most appropriate and instructive. The rst example focuses on intrarater reliability for variables at the interval-ratio level of measurement-variables that have real numbers as values. The second example addresses interrater reliability for nominal variables possessing labels or categories as values.
Fig. 1. Mean peak torque in shoulder exion across three trials.

3. A worked example for intrarater reliability 3.1. Scenario The data used in this example were collected in the second authors laboratory and represent values for peak torque (N m) for shoulder exion at 2408 per s in 10 subjects across three repeated trials. The data are presented in Table 1. 3.2. Assessment of systematic bias: changes in the mean The rst step in assessing the reliability of this isokinetic test is to examine some simple descriptive statistics. Calculation of the mean (average) value for each of the three trials permits an initial screen for any large, systematic bias. In addition, the plotting of a line graph (Fig. 1) is often helpful to visualize any systematic bias trends. The mean [standard deviation (SD)] values for test 1, 2, and 3, respectively, are 38.3 (13.4), 39.3 (9.3), and 39.6 (12.1) N m. From inspection of Fig. 1 one might argue that there is some evidence of a trend for the values to be systematically increasing across trials. However, note that the mean increase between tests is of the order of 1 N mor around 2.5% of the mean value. Given the precision associated with isokinetic testing this does not appear to provide strong evidence of any true systematic bias. Erring on the side of caution one could, however, conduct further trials until the mean peak torque values increased no further. This would provide evidence suggesting that the subjects were fully familiarized with the test and that any learning effect was relatively complete.
Table 1 Peak torque (N m) in shoulder exion at 2408 per s on three separate occasions Subject 1 2 3 4 5 6 7 8 9 10 Test 1 28 20 58 30 27 39 35 46 61 39 Test 2 41 20 46 34 37 46 30 49 49 41 Test 3 37 20 58 31 23 46 52 46 43 40

Inspection of the means and SDs in the current example suggests that a statistic to test for signicant differences between the repeat trials is hardly necessary. In the case of more obvious changes in the mean from test to test, a repeated measures analysis of variance (ANOVA) is required or, if only two repeat tests (columns of data) were involved, a paired t-test. For the data set in Table 1, a repeated measures ANOVA (Table 2) results in a probability P value of 0.897. Statistics of this type test the obtained mean differences between tests against the assumed null hypothesis of no difference between test means. The reported P value in the output is the probability that we would have observed mean differences this large (about 1 N m), or larger, if the null hypothesis were true. The obtained P value of 0.897 suggests that, if there were in reality no true differences between the test means (null hypothesis), we would obtain differences of 1 1.3 N m approximately 90 times in 100. This indicates that our observed mean differences are very likely to have occurred even under the conditions of a true null hypothesis of no real difference. Hence, there appear to be no real (signicant) differences between the mean values for repeat administrations of the test. Conventionally, a P value (or alpha) of 0.05 is used to test for statistical signicance. P values from repeated measures ANOVA, for example, of less than 0.05 would suggest real differences between the mean values of repeat tests and thus reveal a systematic bias. These signicant differences may suggest fatigue or learning processes at work, as discussed previously. The construction of condence intervals for the differences between test means is more instructive, however, than tests against the null hypothesis. A 95% condence interval (CI) for the difference between the means of two trials, for example, provides an estimate of how small (lower limit) or how large (upper limit) the true systematic bias might be in
Table 2 Repeated measures ANOVA output for the data in Table 1 Source of variance Between subjects Test Error Sum of squares 2951.87 9.267 762.733 df Mean squares (variance) 327.98 4.633 42.374 F-Ratio Signicance (P value) 0.897

9 2 18

0.109

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

125

the population. Those who insist upon a signicance test can derive this information from the condence interval. If the limits include the value of zero (no difference) then there is no signicant difference between test means P . 0:05: For example, for the data in Table 1, the 95% CI for the difference between test 1 and test 3 is 2 7.8 to 5.2 N m. These limits are very wide because of the small sample size of n 10: As the condence interval crosses zero, there is no signicant difference between the mean scores. The 95% CI quoted above was derived via a paired ttest. Most statistical software packages provide a condence interval for the difference between two means as part of a t-test output. To illustrate the process, however, we provide a worked example for the difference between test 1 and test 3. The condence interval for a population mean difference is derived using the sample mean difference and its attendant standard error (SE, given by p SD= N). Firstly, calculate the difference (test 1 minus test 3) for each case in Table 1 (e.g. for Subject 1 the difference is 28 2 37 2 9). This results in the following set of difference scores for the 10 subjects: 2 9, 0, 0, 2 1, 4, 2 7, 2 17, 0, 18, 2 1. Next, calculate (by hand, calculator, or statistical software package) the mean difference and SD of the differences. For the above data the mean differences is 2 1.3 N m with a SD of the differences of 9.09 N m. The SE of the differences is thus p p SD= N 9.09/ 10 2.874 N m. The 95% CI for the difference between the two test means is given by Mean Difference 2 t0:975 SE to Mean Difference t0:975 SE; where t0:975 is the appropriate value from the t-distribution with N 2 1 degrees of freedom associated with the 95% condence level (P 0:05; two-tailed). Values for t can be found in statistical tables in most statistical textbooks. For the above data, with 10 subjects, there are 9 degrees of freedom, giving a t value of 2.26. Hence the 95% CI for the difference between test means is 21:3 2 2:26 2:874 to 2 1:3 2:26 2:874 27:8 to 5:2 N m:

inuences the precision of measurements in an experimental study (Hopkins, in press). The smaller the within-subject variance (indicating random noise) the better the measurement. For example, if a therapist is interested in monitoring improvements in muscle strength during an injury rehabilitation programme, a measuring instrument that resulted in high within-subject variation may be unable to reliably detect such changes. Essentially, the changes in strength would have to be large enough to outweigh the noise in the measurement. As Hopkins (in press) argues, a small withinsubject variation facilitates the detection of small, yet potentially clinically meaningful, changes in the dependent variable of interest. Many diverse statistical approaches exist to quantify this random variation in a measure in repeat measurements on the same subjects. This section will deal with what we consider to be the three most appropriate techniques: Typical error (and coefcient of variation), Limits of Agreement, and Intraclass Retest Correlation Coefcient (ICC). 3.3.1. Typical error The typical error is also known as the standard error of measurement. For two or more repeat trials the typical error can be quantied using a repeated measures ANOVA. For the data in Table 1, TEST would be entered as the within subjects factor, dened as having three levels (three columns of data representing the three repeat isokinetic tests). This results in the ANOVA table presented in Table 2. The random error is derived from the mean squares error term in the ANOVA output. This indicates the variance in the random noise component from test to test across the three trials. The square root of this mean squares error term provides an estimate of the typical error or standard error of measurement associated with repeat isokinetic trials. This root mean squares error (RMSE) from Table 2 is 6.5 N m and this provides a quantitative indication of the intrarater reliability of the isokinetic test. This typical error can also be presented as a percentage of the mean peak torque value across the three trialsthe coefcient of variation (CV). A crude, though reasonably accurate, method is to simply divide the obtained typical error by the grand mean peak torque (the average of all 30 data points, 10 subjects three trials). For the current example the CV is (6.5/ 39.1)100% 16.6%. A more complex, though more precise, method requires re-running the ANOVA with the test values log-transformed (natural logarithms). The formula for deriving the CV using this method is: CV 100eRMSE 2 1 (Bland, 1995). For the data in Table 1 the RMSE from the repeated measures ANOVA with the natural log-transformed data is 0.16. Hence, the CV is 100e0:16 2 1 17:4%; broadly analogous to the 16.6% calculated from the crude, yet simpler method. How does one know if a particular typical error is representative of adequate reliability? Atkinson and Nevill (1998) argue that this question has not been adequately addressed in the literature, and present a detailed case for

3.3. Assessment of random error Having tested for, and addressed, systematic bias from test to retest, or across several repeat trials, it remains to quantify and report the random error or noise in the measurements. Of course, random error can also result in changes in the mean value of trials, but random errors (positive and negative) tend to cancel each other out if sufcient data are collected, resulting in no change in the mean trial-to-trial value. The most important type of reliability measure is within-subject variation, as it

126

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

decision-making based upon analytical goals including sample size estimation in experiments, effect size, and individual differences. Hopkins (in press) recommends that 95% condence intervals for the typical error are calculated and reported. These can be derived from the x2 distribution and Hopkins (1997) provides a spreadsheet for this purpose. For the data set in Table 1, with three repeat trials, and 18 degrees of freedom for the error term (Table 2), the 95% CI for the typical error of 6.5 N m is 4.9 9.6 N m. These limits represent the range within which the true typical error for the population is likely to reside. 3.3.2. Limits of agreement The limits of agreement (LOA) method developed by Bland and Altman (1986) is a measure of within-subject variation closely related to the typical error. We will illustrate the basis of the method rst by applying it to a portion of the data in Table 1, test 1 and test 2. The LOA technique is based on an analysis of differences between paired scores on test and retest for each subject (test 1 minus test 2). For example, subject 7 in Table 1 scored 35 N m on test 1 followed by 30 N m on test 2 resulting in a difference score of 5 N m. In the LOA method the difference scores are calculated in this way for each subject, permitting the computation of the mean and SD of the differences. As discussed previously, any signicant mean difference between test and retest indicates systematic bias. For the test 1 and test 2 data in Table 1, the mean difference is 2 1 N m with a SD of the differences of 8.5 N m. Bland and Altman (1986) proposed the calculation of the range of values within which a subjects difference scores would fall for 95% of the time. These 95% LOA are computed by multiplying the SD of the differences by 1.96 (one SD either side of the mean represents 68% of a normal distribution; 1.96 SDs represents 95%). In the current example the 95% LOA would thus be ^ 1.96(8.5) 17 N m. Inasmuch as it is preferable to report systematic and random error separately (Hopkins, in press; Atkinson and Nevill, 1998) the correct reporting of the 95% LOA for intrarater reliability for this example would be 2 1 ^ 17 N m. Strictly, as Hopkins (in press) pointed out, the 95% LOA should be calculated by multiplying the SD of the differences not by 1.96, but by the appropriate cumulative probability and degrees of freedom from the t-distribution. In other words, the gure of 1.96 is appropriate only if the sample size is large (i.e. . 120). As we have 10 subjects, the degrees of freedom (df) in the current example is 9 N 2 1: From standard statistical tables available in many research methods and statistics texts, the critical value of t for 9 df at the 95% condence level is 2.26. Hence, the true 95% LOA are 2.26(8.5) ^ 19 N m. The calculation of 95% LOA from three or more trials is accomplished via the repeated measures ANOVA, as for the typical error. The key statistic is once again the root of the mean squares error from the ANOVA output (RMSE). The 95% LOA are then calculated as

p LOA ^ 1.96(RMSE)( 2) or 2.77(RMSE) (Bland, 1995). For the full data set in Table 1, with three repeat trials, the RMSE is 6.5 N m. The 95% LOA are, therefore, ^ 18 N m. Employing the stricter method derived from the tdistribution described previously the 95% LOA for the three p trials is ^ 2.26(RMSE)( 2) or 3.2(RMSE) ^ 21 N m. The 95% LOA indicates the range within which a subjects peak torque would be expected to fall in repeat administrations of the isokinetic test. For example, a subject gaining a peak torque for shoulder exion of 45 N m in one trial may be expected 95% of the time to produce a value anywhere between 24 N m (45 2 21) and 66 N m (45 21) in a subsequent trial (assuming no systematic bias). Once again the decision as to whether this represents adequate reliability is left to the researcher and practitioner. Much depends on the context in which the measurements are being used and the analytical goals of the user (Atkinson and Nevill, 1998). Should researchers use the typical error, CV, or the 95% LOA to assess reliability for metric (interval-ratio level) variables? Atkinson and Nevill (1998) argue strongly for the 95% LOA whereas Hopkins (in press) rejects this method and makes a case for the preferred use of the typical error and/or CV. Ideally, the decision should be based upon a thorough evaluation of the assumptions that underpin each statistical technique. These assumptions, which differ between the methods, are beyond the scope of this article. Briey, many biological measures have measurement errors that increase as the value of the measurement increases (Atkinson and Nevill, 1998). This phenomenon, known as heteroscedasticity, violates the assumption of constant error variance or homoscedasticity underpinning the LOA and typical error methods. Such heteroscedastic data, however, are suited ideally to the percent coefcient of variation method, which assumes that the measurement error is proportional to the magnitude of the measured values. We refer the reader to Atkinson and Nevill (1998) for a comprehensive treatment of the relevant issues. Based on the available literature, our preference is for the typical error or percent CV, provided all underlying assumptions are satised. 3.3.3. Intraclass retest correlation coefcient The methods presented thus far, typical error, CV, and LOA, are all techniques that quantify the degree of absolute reliability or agreement. Baumgartner (1989) differentiated this from relative reliability, the extent to which subjects maintain their rank order or position in a sample with repeat trials. Relative reliability is assessed with various forms of correlation coefcient. Early literature, especially, is replete with the application of the simple Pearson product moment correlation coefcient (PPM) in test retest reliability studies. This approach, however, has been discredited in the recent literature (Atkinson and Nevill, 1998). A more appropriate type of correlation coefcient for reliability applications is the ICC. Unlike the PPM, the ICC is a univariate, rather than bivariate

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

127

statistic, and also it can deal with N . 2 trials. As we have suggested, any observed (measured) scores are composed of a true score t and an error score e: The variance of the observed scores SD2 is equal to the variance of the true o scores SD2 plus the variance of the error scores SD2 : t e Therefore, reliability is dened as the ratio of the true score variance to the observed score variance, or the observed score variance minus the error variance, divided by the observed score variance: Reliability SD2 2 SD2 =SD2 : o e o Partitioning these variances (mean squares) from a set of repeat measurements is achieved using ANOVA (Baumgartner and Jackson, 1995). Unfortunately, there is widespread debate and confusion regarding exactly how to calculate the ICC. Atkinson and Nevill report that there are at least six methods cited in the literature, all resulting in different values. Those interested in entering the debate are referred to Shrout and Fleiss (1979), Bartko (1966), McGraw and Wong (1996) for an indepth treatment. In this paper, we present one method for calculating the ICC from ANOVA. In line with out previous recommendations, this method eliminates any systematic trial-to-trial variance from the analysis. Systematic bias, therefore, is not treated as measurement error, because measurement error is assumed to be random noise. Moreover, the method presented provides an estimate of the reliability for a single test, rather than for a mean of multiple tests. Our rationale is that researchers and practitioners often administer a single trial to derive measurements in practice (Morrow and Jackson, 1993). The output required to compute the ICC is provided in Table 2. The necessary formula is (Baumgartner and Jackson, 1995): ICC MSBS 2 MSERROR =MSBS k=k 2 1MSERROR
1

sufcient numbers of subjects and trials to yield meaningful results. Morrow and Jackson (1993) argued that an N of at least 30 representative subjects was necessary for adequate precision. Hopkins (in press) suggests that at least 50 subjects performing three or more trials provides adequate precision in estimating the typical error. A key problem with the ICC (and the PPM) is that its magnitude is highly dependent upon sample heterogeneity. Inspection of the formula presented for the ICC reveals that the numerator is strongly inuenced by the magnitude of the observed (measured) variance between subjects (MSBS or SD2 ). The greater the range or spread of scores, therefore, o the greater the magnitude of the ICC. The typical error, CV, and LOA, however, are unaffected by sample heterogeneity. Notwithstanding this important limitation, we believe that the ICC can provide important information regarding relative reliability. As Bland and Altman (1990) noted, the ICC can be regarded as an index of the information context of a measurement, revealing the ability of the test to discriminate between subjects. When testing relatively homogeneous samples, however, for example elite athletes, the ICC would inevitably be a poor choice of statistic due to low between subjects variance. We support the view of Atkinson and Nevill (1998), therefore, that the ICC should not be cited as the sole statistic in a reliability study. Moreover, if researchers choose to employ the ICC the formula used to compute it must be presented and justied, with 95% condence intervals reported to indicate precision of the estimates (Morrow and Jackson, 1993).

4. A worked example for interrater reliability 4.1. Scenario The data used in this example are ctional, and are used to illustrate the estimation of interrater reliability for a criterion-referenced test. In such tests, subjects are assigned a nominal code that classies them as belonging to a particular category. Two therapists were asked to examine 50 ankles independently, using a squeeze test, and to report the presence or absence of ankle syndesmosis injury. The data are presented in Table 3. The data in Table 3 represent a double classication table. The simplest way to estimate interrater reliability is to calculate the proportion of agreement (PA) (Baumgartner
Table 3 Diagnostic judgements of two therapists for ankle syndesmosis injury using the squeeze test (n 50 ankles) Therapist 2 Therapist 1 Injured Injured Non-injured 20 10 Non-injured 5 15

where MSBS is the mean squares between-subjects (or SD2 ; o k is the number of trials administered (in this case 3), and k1 is the number of trials for which the ICC is being estimated (in this case for a single trial). For the data in Table 1, therefore, the required calculation is: ICC 327:98 2 42:374=327:98 3=142:374; resulting in an ICC of 0.69. Ideally, a 95% condence interval for the ICC should be calculated and reported to indicate the likely range of values containing the true population ICC. This is difcult to compute manually and Hopkins (1997) provides a spreadsheet for this purpose. Also, several advanced statistical software programs provide condence intervals for the ICC. In the current example, the 95% condence interval for the ICC is 0.35 0.9. As with the PPM, a correlation of 1 indicates perfect relative reliability. In the present example, the ICC of 0.69 could be interpreted as moderate relative reliability. The 95% CI indicates that the true reliability is likely to range from 0.35 (poor) to 0.9 (good). Clearly, this lack of precision is due to a small N of only 10 subjects in this worked example. Reliability studies must include

128

A.M. Batterham, K.P. George / Physical Therapy in Sport 4 (2003) 122128

and Jackson, 1995). The PA is the number of correct negative and positive diagnoses expressed as a percentage of the total number of ankle examinations. If therapist 1 reports that a particular ankle is injured, and therapist 2 agrees, this would be scored as a correct positive. Similarly, if therapist 1 reports that an ankle is non-injured, and the diagnosis of therapist 2 supports this, then this would be scored as a correct negative. The PA is simply the correct positives plus the correct negatives, divided by the total number of ankles examined. From Table 3, note that there are 20 correct positives and 15 correct negatives. The PA is, therefore (20 15)/50 0.7. A PA of 1 would suggest perfect interrater reliability. The problem with the PA statistic is that it does not account for the fact that, particularly in a yes/no criterionreferenced test, a number of the correct classications may have occurred by chance. To overcome this limitation, a related statistic (Cohens kappa) that corrects for chance is preferred. The formula for kappa is k PA 2 PC=1 2 PC; where PC is the proportion of agreement expected by chance. Baumgartner and Jackson (1995) provide the equation for calculating PC. Alternatively, many statistical software packages provide kappa as part of the crosstabulations output. For the data in Table 3, kappa 0.4, indicating relatively poor interrater reliability. Note that the kappa coefcient value of 0.4 is markedly different from the simple PA of 0.7. A 95% condence interval for kappa can also be constructed using the standard error from the output.

For data recorded at the interval-ratio level of measurement the most appropriate techniques for assessing absolute reliability are the typical error, the coefcient of variation, and the 95% limits of agreement. To assess relative reliability for this level of data, the intraclass correlation coefcient is preferred to the Pearson correlation coefcient. For nominal (categorical) data in criterion-referenced tests, Cohens kappa is the statistic of choice. To provide estimates of precision, we recommend that 95% condence intervals are calculated and reported, irrespective of the statistic employed.

References
Atkinson, G., Nevill, A.M., 1998. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine 26, 217 238. Bartko, J.J., 1966. The intraclass correlation coefcient as a measure of reliability. Psychological Reports 19, 3 11. Baumgartner, T.A., Jackson, A.S., 1995. Measurement for Evaluation in Physical Education and Exercise Science, fth ed, Brown and Benchmark, Dubuque, IW, pp. 113 118. Bland, M., 1995. An Introduction to Medical Statistics, second ed, Oxford University Press, Oxford, pp. 265272. Bland, J.M., Altman, D.G., 1986. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i (8), 307 310. Bland, J.M., Altman, D.G., 1990. A note on the intraclass correlation coefcient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine 20 (5), 337340. Hopkins, W.G., 1997. A new view of statistics. sportsci.org:Internet Society for Sport Science sportsci.org/resource/stats Hopkins, W.G., in press. Measures of reliability in sports medicine and sport science. Sports Medicine (in press). McGraw, K.O., Wong, S.P., 1996. Forming inferences about some intraclass correlation coefcients. Psychological Methods 1 (1), 30 46. Morrow, J.R., Jackson, A.W., 1993. How signicant is your reliability? Research Quarterly for Exercise and Sport 64 (3), 352355. Nelson, M., 1997. The validation of dietary assessment. In: Margetts, B.M., Nelson, M. (Eds.), Design Concepts in Nutritional Epidemiology, second ed, Oxford Medical Publications, Oxford, p. 242. Schoppen, T., Boonstra, A., Groothoff, J.W., de Vries, J., Goeken, L.N., Eisma, W.H., 1999. The timed up and go test: reliability and validity in persons with unilateral lower limb amputation. Archives of Physical Medicine and Rehabilitation 80 (7), 825 828. Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin 86, 420428.

5. Conclusion This article has attempted to tackle the important issue of reliability in repeated tests on a sample of subjects. All measurements in practice are attended by a degree of measurement error. This error can be systematic (bias) or random (noise). We recommend the separate analysis and reporting of bias and random error. Reliability is an essential prerequisite for test validity, since a measure that is inconsistent from trial to trial could not be considered valid. Therefore, reliability estimates should always be calculated and reported, particularly for new tests or measurement instruments.

You might also like