Professional Documents
Culture Documents
Confidence intervals are not commonlyprovidedwith these reported sensitivity values actually overlapped,
analyticalor other data reported in Clinical Chemistry i.e., 22% to 69% and 3.5% to 41%. Test A is not more
although P values are. However, confidenceintervals sensitive; these tests are equivalent.2 In another paper,
providean explicitdemonstrationof the directionand test C was claimed to be superior to test D on the basis
magnitudeof uncertaintyandare intuitivelyeasyto grasp, of a receiver-operating characteristic (ROC) curve anal-
unlike P values. It is thereforeargued that the Journal ysis, but neither the areas under the curve (i.e., accu-
should adopt a policy requiringthe provisionof confidence racy) nor the confidence intervals of these accuracies
intervals. Such a policy wouldimprovethe statisticalrigor were provided. Finally, a set of tests were compared by
of Journalreports. ROC curve analyses of each tests accuracy; one test was
claimed to possessless discriminating power than the
Indexing Terms: statistics parametric vs nonparametric distri- others, but the confidence interval of each estimate of
butions likelihood ratio receiver-operating characteristic test accuracy was not determined and indeed visual
curie inspection of the data suggested that all tests were
When possible, quantify findings and present them with appro- equivalent. These are three examples taken from one
priate indicators of measurement error or uncertainty (such as issue of the Journal.
confidence intervals). Avoid sole reliance on statistical hypothesis How should experimental findings be presented?
testing, such as the use of P values, which fails to convey impor- There are now many sources of good advice. For exam-
tant quantitative information.
ple, Clinical Biostatistics (4) and Statistics in Practice
International Committee of Medical Journal Editors (1) (5), both collections of articles first appearing in Clinical
Pharmacology and Therapeutics and the British Medical
Recently, a reviewer of a paper my colleagues and I Journal, respectively, are invaluable; the latter is par-
submitted to Clinical Chemistry for the P value of
asked ticularly good on the practical aspects. These aspectsare
a difference between sets of data. We had instead pro- also well addressed in Altmans recent book (3), as its
vided the confidence intervals for the data, but this was title suggests. Statistics deals with samples taken from
of no interest: P values were, for the reviewer, the populations consisting of all the possible observations
bottom line in such comparisons. Or are they? It is the that could be made; these samples are assumed to
intent of this Opinion to make the case for more appro- possess the same characteristics as the parent popula-
priate statistical descriptions of experimental data than tion. Altman (3) has pointed out, however, that sam-
axe currently used in Clinical Chemistry but which are pling from a population with a truly gaussian distribu-
in accord with the most recent Uniform Requirements of tion (synonym: normal distribution) may not always
the International Committee of Medical Editors (1) produce samples that are themselves gaussian.
quoted above. The Journal issues very detailed statisti- Samples consist of observed data and possess empiri-
cal guidelines for authors (2), which do include a re- cal distributions. However, both samples and their par-
quirement for the use of appropriate indicators of mea- ent populations may conform to a variety of probability
surement error or uncertainty, but these guidelines are distributions. These mathematical abstractions are
often not observed in practice and frequently the wrong characterized by one or more parameters; for example,
type of statistical information is provided. Examples are the gaussian distribution is completely described by two
readily found on random inspection of the Journal; some
are mentioned here for illustrative purposes only but 1NoJ abbreviations: TP, true positive; TN, true negative;
are not cited in this Opinion, it being my intent to ROC, receiver-operating characteristic and LR, likelihood ratio.
2The details of the calculation of confidence intervals are out-
persuade rather than pillory.
lined in a later section (see Examples of Confidence Inierval
For example, test A was claimed to be more sensi- Calculations). However, it is conceptually easy to grasp, even at
tive [true positive (TP) rate or fraction = 44%] than test this stage, the very obvious overlapping of the two intervals. The
B (TP rate = 17%), but the 95% confidence intervals for more conventional approaches to analyzing data on nominal scales
are the proportion tests such as the z-test, x2 test, Fishers exact
test, or McNemars test, using Yates continuity correction when
Department of Clinical Biochemistry, University Hospital (Uni- the sample sizes are small (3). All of these tests are more involved
versity of Western Ontario), P.O. Box 5339, London, Ontario, than the simple calculation of the confidence intervals; what is
Canada N6A 5A5. more, in the quoted example, they all show that there is no
Received September 9, 1992; accepted January 25, 1993. statistical difference, i.e., P >0.05, between the stated sensitivities.
data may be chosen, but the most usual ones are 90%, percentage point (a) for the appropriate confidence in-
95%, and 99%, for which the following multiples of the terval (e.g., for 99%, a = 0.01; for 95%, a = 0.05; for 90%,
SD on each side of the mean apply: 1.645, 1.96, and a = 0.1). A plot of the 95% confidence interval, using the
2.576, respectively. Whereas SD is a descriptive index, t distribution, is shown in Figure 1 for values of n
the standard error of the mean (SE or, less commonly, between 5 and 250. Thus, for sample sizes between 25
SEM) is a measure of uncertainty. Feinstein (7) com- and 50, the confidence interval is in the range of 0.25 to
ments that neither standard nor error is an appro- 0.5 SDs, whereas for sample sizes >50, the confidence
priate term for this parameter and that these terms can interval will be <0.25 SDs.
only serve to confuse the unwary. What advantage does the knowledge of the confidence
Despite its inappropriate name, the SE is an ex- interval confer over the more traditional use of SD and
tremely important index. It is calculated from the SD SE? The latter parameters are usually used in the
and sample size (SE = SD/V). If the population is traditional process of stating a null, and often an alter-
repeatedly sampled, and each sample has its mean and native, hypothesis and then using a test statistic to
SD calculated, how well do these mean values estimate obtain a P value for rejecting or accepting the hypothe-
the true mean of the parent population? If each of these ses (3, 8). This process gives no indication at all of the
sample means is thought of as an individual value, then magnitude of the effect being studied; it merely pro-
the standard deviation of these means is the SE. It is duces a probability value. (This aspect will be examined
thus a measure of the uncertainty of a single sample in more detail later: see P Values, below.) By contrast,
mean as an estimate of the population mean (3). If the the confidence interval demonstrates, explicitly, the
population distribution is gaussian, then the distribu- magnitude of the uncertainty, and its direction, as well
tion of these sample means will also be gaussian. In as being an intuitively easy concept to grasp. Both
addition, the distribution of the sample means will also Lancet and British Medical Journal have published
approach normality, whatever the distribution of the numerous artides on this topic, which have been gath-
variables in the parent population, provided the sample ered, by the latter journal, into a book (9) with an
is sufficiently large (the Central Limit Theorem). associated computer program-the Confidence Interval
These remarks are a necessary prerequisite to intro- Analysis calculator (1O). it is, of course, accepted that
ducing the concept of a confidence interval. This inter-
val for a mean extends on both of its sides by a multiple
3These references may be obtained from Subscriber Services,
of the SE. This idea is exactly analogous to that previ- American College of Physicians, Independence Mall West, Sixth
ously described for the SD. Thus 1.96 x SE defines the Street at Race, Philadelphia, PA 19106-1572.
930 CUNICALCHEMISTRY,Vol.39, No. 6, 1993
the SD value may also be used to calculate the SE and the zone of uncertainty exceeds 30%. This aspect is
the confidence interval. In practice, this is rarely done, certainly not appreciated by many workers who appear
as a random inspection of this Journal will show. to be seduced by the apparently satisfactory test perfor-
mance indicated by a sensitivity of 90%. The confidence
Examples of Confidence IntervalCalculations interval of proportions, such as sensitivity and specific-
Some examples are provided to demonstrate the im- ity, follows a binomial distribution; therefore, unless the
portance of the explicit description of uncertainty. proportion is exactly 50% or the sample size is large, the
Confidence intervals-means. The simplest example of distribution is asymmetric, as shown in Figure 2. These
the value of using the confidence interval can be seen intervals may be calculated (12, 13) or exact values for
when referring to a mean value (11). Its confidence the 90% and 95% zones may be obtained for population
interval is obtained by calculating SE, obtaining the sizes from n = 2 to 100 from the table of binomial
appropriate t-value, as explained above, and evaluating distributions in the Geigy Scientific Tables (14). When n
the term (t x SE). Thus, when the sample size is 15, the >100, the simple formula given by Gardner and Altman
mean = 10.0, SD = 3.0, SE = 3/\/i = 0.775, and t = (12) suffices. Alternatively, these limits may be ob-
2.145, the 95% confidence interval on each side of the tained by use of the Confidence Interval Analysis pro-
mean is 10 0.775 x 2.145 = 10 1.66 (i.e. 8.34 to gram mentioned earlier (10).
11.66); for a population of n = 100, SE = 3/VI6 = 0.3, Confidence intervals-likelihood ratios. Bayesian
t = 1.984, and the 95% confidence interval is now mean analysis is frequently invoked in Clinical Chemistry.
10 0.595, or 9.405 to 10.595. Such data ifiustrate the The likelihood ratio is the link between the pretest and
profound influence of population size on the confidence the posttest odds of disease (15). Of course, as with all
interval already demonstrated in Figure 1. Again, it is such estimates, the likelihood ratio is subject to error,
not hard to find articles in Clinical Chemistry that which the confidence interval quantifies (16). The 95%
display, for example, mean SD values for several confidence interval of a likelihood ratio (LR) value is
groups but with the sizes of the groups varying from 20 LR to LRb, where a = 1 - (1.96/2) and b = 1 +
to >100! ConfIdence intervals would have given a much (1.96I2). x2 is evaluated by the simplified formula for a
clearer understanding of the variability of the data. 2 x 2 predictive value table. Beck (17) provides a
The calculations described above may be completely detailed example of the calculations. Again, it is uncom-
avoided by use of the Confidence Interval Analysis mon to see likelihood ratios associated with this essen-
program mentioned earlier (10). tial indication of variability in Clinical Chemistry, al-
Confidence intervals-p roportions. Sensitivity (TP though it is surely as important as the provision of SD or
rate) and specificity (true negative, or TN, rate) data are SE.
commonly reported in Clinical Chemistry. However, it is Confidence intervals-.area under the ROC curve. ROC
unusual to see the confidence intervals provided with curve analysis is an important and powerful tool for
such data. The need for these can readily be appreciated evaluating a tests diagnostic accuracy. The essential
by an eximinition of Figure 2, which shows the effect of index of accuracy, when using ROC curve analysis, is the
population size on the 95% confidence intervals for a test area under the curve (18). Swets (18) also suggests that
with a sensitivity of 90%. When the population is <20, areas of 0.5 to 0.7 denote low accuracy, 0.7 to 0.9 moder-
ate accuracy, and >0.9 high accuracy. However, one
must actually measure the area to establish the magni-
100
100 tude of the accuracy. Bamber (19) has shown that the
area under the curve is related to the Mann-Whitney
sensitivity = 90% U-statistic (a nonparametric test based on rank order).
90
90 This is the basis for the Hanley-McNeil procedure for
obtaining these areas (20). Nonetheless,it is rare to
encounter this essential index in the pages of Clinical
80
80 Chemistry, although ROC curves are frequently used.
iwcohdence limit But all such estimates of accuracy also require an indi-
cation of the extent of error of this estimation-which is
70
70 provided by the SE. Unfortunately, the calculations of
both the area under the curve and the SE (20, 21) are
60
60 tedious and prone to error, and are best performed either
by spreadsheet analysis (21) or by a more comprehensive
computer program (22,23). Beck and Shultz (21) give an
extended fflustration of these calculations.
50
0 20 40 60 80 100 50 Confidence intervals-regression.4 Clinical Chemistry
(2) requires an extensive list of statistical parameters
Population size
Fig. 2. RelatIonship between populationsize and the 95% confi-
denceintervalfor an estimate of sensitivity I have avoided discussionofthe advantages of the Deming plot
These limits were obtaIned from the Geigy Scientific Tables (14); the values over the conventional method (24) or of the bias plot
least-squares
were corroboratedby using the Confidence Interval Maiysls program (10) (25) for examining the relationship
between two variables.
zones of uncertainty (26). Thus, it is possible to see (in degrees of freedom and the percentage point (a) for the
appropriate confidence interval (e.g., for 99%, a = 0.01,
the inner zone) that for serum glucose concentrations of
etc.) as previously mentioned.
5 and 25 mmol/L, the 95% confidenceintervals for the
The value of y (y) is calculated for the chosen value
mean blood glucose meter readings are 3.39-10.4 and
of x; thus,y = (0.79 x 5.0) + 2.94, and the SE) is
194-26.0 mmol/L, respectively. The outer zone shows
calculated from the expression:
the uncertainty in predicted values of y for an individual
value of x-the prediction or tolerance interval. Clearly,
graph B provides much more information about the
scatter of the data than does graph A, although the
SE) = V/ (_1 +
(x_)2
(n - 1)S2)
25
A ated thus:
0
S C /11 (5 - 55)2\
20- SE& )2.14%/1+ 1=1.43
V \8 7(6.99)2 /
2
a
5 15-
V
a The value of t, for 6 degrees of freedom and a = 0.05, is
0
U
10- 2.45; the 95% confidence interval is therefore:
C
0
0
2 5. Yest - t SE) thy + t . SE,
0 5 10 15 20 25 30
or 3.39 to 10.39 mmol/L.
Serum glucose (rnmol/L)
25-
B SE& )SX
II
I(i+-+
1 (x-.)2
0 _ IY n (n-1)S2
20-
I-
V
For x = 5 mmolJL, y1, = (0.79 x 5.00) + 2.94 = 6.89,
15- and the expression for SE) is:
V
a
0
C)
10-
C // 1 (5 - 15.5)2\
0 SE& ) = 2.14! I1 + -+ I = 2.57
0
.2 s- V \ 8 7(6.99)2 /
0-
/ / As before, the value oft, for 6 degrees of freedom and a
= 0.05, is 2.45; the 95% confidence interval is therefore:
0 5 10 15 20 25 30
P Values
e-1 e-1
to Information for Authors (2) suggests that sole reli-
e2+l e2+l
ance should not placed on, for example, P values, but the
experience quoted above suggests otherwise. As far back
For the correlation coefficient in the legend to Figure 3, as 1978, Rothman (31) stated, in an editorial in the New
r = 0.94 1, z = 1.7467, z2 = 2.6228, z1 = 0.87058, and the England Journal of Medicine, that P values.. . are not
95% confidence interval for the correlation coefficient is good measures of the strength of the relation between
0.702 to 0.989. study variables. P values serve poorly as descriptive
All of these equations can be evaluated by direct entry statistics. Bailar and Mosteller write (32), in an article
into the sets of z-transformations in the Geigy Scientific originally prepared for the Annals of Internal Medicine,
Tables (27), thus avoiding a set of awkward arithmetic although not cited in Clinical Chemistrys Information
manipulations. Alternatively, the Confidence Interval for Authors, Confidence intervals offer a more informa-
Analysis program may be used (10). tive way to deal with the significance test than does a
Confidence intervals-nonparametric analy8es. When simple P value. Confidence intervals for a single mean
a studied population has a nonnormal distribution-a or a proportion provide information about both magni-
fairly common occurrence in the practice of clinical tude and its variability. Likewise, Gardner and Alt-
chemistry-the commonly used descriptor of the popu- man (33)-professional statisticians-comment that
lation is the median. Assume that 11 observations have even precise P values convey nothing about the
been made, the results of which are listed in ascending sizes of the differences between study groups. However,
order (a necessary first step in nonparametric statistics): a random search through Clinical Chemistry shows an
5, 7, 9, 11, 13, 15, 17, 19, 21, 23, and 25, with a median undue reliance on P value boundaries, e.g., P <0.01,
value of 15. This data set will be used to obtain the 95% >0.05, and so on. Although, in the past, it was necessary
confidence interval for the median. to rely on statistical tables for the values of P, many
The approximate confidence interval for the median commonly available microcomputer statistical pro-
(28) may be calculated as follows: grams can calculate an exact value for P; so why are P
values stifi given in this manner? The undue reliance on
L=--
2\
I 1.96-
_ 2/
I andU=
n/V 11.96-
1+-+
2\ 2
a P value above or below 0.05 has in any case been
savaged by Feinstein (34 )-who as a mathematician
and clinical epidemiologist speaks with considerable
authority-in terms that should be reprinted in all
where L and U are the lower and upper limits, respec- Advice to Authors-type articles:
tively, and the multiplier is 1.645 for the 90% confidence the statistical strategy proposed by Sir Ronald Fisher, who
interval, 1.96 for 95%, and 2.576 for 99%. For the regarded 95% of the inner values [of a distribution] as common and
example above, these limits evaluate to: the remaining 5% as significantly uncommon. Although the strat-