Professional Documents
Culture Documents
Scientic Institute of Public Health, Clinical Biology, J. Wytsmanstraat 14, Brussels, Belgium
University of Lige, Medical Informatics and Biostatistics, CHU Sart Tilman B23, Lige, Belgium
a r t i c l e
i n f o
Article history:
Received 24 July 2011
Received in revised form 27 November 2011
Accepted 28 November 2011
Available online 8 December 2011
Keywords:
External Quality Assessment
Z-scores
Statistics
a b s t r a c t
In EQA programs, Z-scores are used to evaluate laboratory performance. They should indicate poorly performing
laboratories, regardless of the presence of outliers. For this, two different types of approaches exist. The rst type
are outlier-based approaches, which rst exclude outlying values, calculate the average and standard deviation
on the remaining data and obtain Z-scores for all values (e.g., Grubbs and Dixon). The second type includes the robust approaches (e.g., Tukey and Qn or the algorithm recommended by ISO). The different approaches were
assessed by randomly generated samples from the Normal and Student t distributions. Part of the sample data
were contaminated with outliers. The number of false and true outliers was recorded and subsequently, Positive
and Negative Predictive Values were derived. Also, the sampling mean and variability were calculated for location
and scale estimators. The various approaches performed similarly for sample sizes above 10 and when outliers were
at good distance from the centre. For smaller sample sizes and closer outliers, however, the approaches performed
quite differently. Tukey's method was characterised by a high true and a high false outlier rate, while the ISO and Qn
approaches demonstrated weak performance. Grubbs test yielded overall the best results.
2011 Elsevier B.V. All rights reserved.
1. Introduction
In laboratory medicine, External Quality Assessment (EQA) programs for quantitative tests have been running for more than half a century [1]. In such programs, Z-scores have become a way to assess the
quality of clinical laboratories by classifying them on a common continuous scale and agging those with unacceptable results. Besides,
these scores can be interpreted similarly as those derived from the internal quality control procedures. Z-scores are based on a measure of
centre and scale of the distribution of the results, in which the difference
from the centre is expressed as a multiple of the scale: Zscore= (individual result centre) / scale. There is a common agreement to ag Z-scores with absolute values larger than 3, requesting an
action from the laboratory. Z-scores with absolute values larger than 2
and smaller than 3 are regarded as a warning signal, and those with absolute values smaller than 2 as within acceptable limits [2]. Z-scores
obtained from several samples can be combined and the proportion of
such scores exceeding a limit, 3 or lower, may be used as a long-term
evaluation tool. Therefore, the Z-scores method conveys a different
kind of information and is more exible than outlier detection techniques which search only for discordant values.
583
showed not Z-scores beyond 3) and the number of samples for which
no Z-score beyond 3 was found (=True+ False Negatives). Likewise,
the PPV was calculated as the ratio between the True Positives (i.e. the
samples to which an outlier was added and that showed a Z-score beyond 3) and the number of samples for which a Z-score beyond 3 was
found (=True + False positives). For the Grubbs and Dixon testsbased approaches, the P-value for which outliers are excluded () was
changed. For the Tukey and Qn approach, D was changed: lower values
of D result in lower standard deviations, higher Z-scores and hence a
higher Z-citation rate. For the ISO-13528 approach, was changed.
NPV and PPV for each of the different values of the varying parameter
were recorded and graphically displayed. At last, for each simulated
data series of samples generated from the Normal distribution, the variability estimator obtained by every approach was recorded and its
mean and standard error calculated.
3. Results
3.1. False positives
A representative part of the False Positive (FP) rates obtained is
depicted in the upper part of Table 1 (no outlier). Among all approaches,
the Tukey method showed the most distinctive behaviour; while all FP
rates were below 15% for the Normal distribution and below 30% for the
Student's t-distribution, Tukey's approach had for almost all sample
sizes a rate above 20%. The Dixon and ISO approaches showed the lowest FP rates. In addition, it is seen that for samples of size 6 or larger, the
FP rate of each approach (except ISO) stabilised for the Normal distribution. By contrast, all FP rates increased with increasing sample size for
the Student's t-distribution.
3.2. True positives
The True Positive (TP) rates when adding an outlier at a distance
+ 3 or + 5 are shown in Table 1. For all outlier distances, differences between the different approaches were similar for the Normal
and Student's t-distribution. For outliers at + 3, none of the approaches was able to ag the outliers in more than half of the cases
for all sample sizes. Tukey's approach had the highest performance,
reaching a agging rate of nearly 50% as soon as the sample size
was 6 or larger. The other approaches had much weaker performance; the ISO approach had agging rates below 10% for very
small samples. All other approaches exhibited outlier nding rates
of roughly 1030%. In addition, these results point to a clear improvement of agging rates for all approaches with increasing sample size
and outlier distance, with a probability of detection close to 100%
for outliers at + 7. The ISO and Dixon approaches, however, still
had a weak performance for very small sample sizes.
3.3. Negative and positive predictive value
The results of the NPV and PPV for sample size n=6 are shown in
Fig. 1. Like in Receiver Operating Characteristic (ROC) analysis, the perfect
approach would ag no Z-scores larger than 3 in case they do not exist
(negative prediction) and would ag them all in case they would exist
(positive prediction), would correspond to a curve made up by a vertical
line equal to the Y-axis and a horizontal line which intersects the Y-axis at
the value 1. The further the curve departs from the perfect curve, the
worse the performance of the approach. For outliers at +3, curves
were located far from the ideal line so that NPV and PPV did not reach
high levels for any of the approaches: only a combination of positive
and negative predicted values of about 60% was feasible, and although
the Grubbs approach tended to perform better, there was not much difference between the approaches. For increasing outlier distance, however,
it is seen that the curves tend to close to the perfect curve. The Qn approach consistently performed the worst. The outlier searching
584
Table 1
False and true outlier rates, expressed as percentages, in the samples of the ve different approaches, for a representative selection of investigated sample sizes. False outlier rates
are shown next to no outlier, true outlier rates in the other lines.
Sample
size
Outlier
distance
Normal distribution
Grubbs
Dixon
Tukey
ISO
QN
Student's t distribution
Grubbs
Dixon
Tukey
ISO
QN
5
6
7
8
20
5
6
7
8
20
5
6
7
8
20
5
6
7
20
no outlier
no outlier
no outlier
no outlier
no outlier
+ 3
+ 3
+ 3
+ 3
+ 3
+ 5
+ 5
+ 5
+ 5
+ 5
+ 7
+ 7
+ 7
+ 7
10.9
8.1
10.2
9.3
9.8
21.2
25.7
30.0
30.9
35.9
60.0
73.4
83.2
89.7
100
87.7
96.7
99.5
100
5.7
4.0
4.0
4.2
4.5
11.9
13.9
15.5
13.7
17.3
37.1
52.9
62.2
60.0
95.2
66.6
85.1
93.1
100
29.6
22.0
20.7
20.8
21.2
52.1
44.7
47.0
46.5
50.7
85.8
86.6
91.3
90.1
99.0
97.7
99.1
99.1
100
4.2
4.6
5.0
4.7
8.4
8.0
13.1
15.3
16.5
28.6
19.5
43.8
59.2
70.3
99.3
35.2
76.9
91.9
100
15
8.1
10.0
8.5
9.1
25.4
20.3
24.7
20.9
28.5
52.3
56.1
64.2
69.2
98.5
76.7
83.5
91.2
100
12.8
15.1
15.0
18.2
31.6
19.9
22.7
22.6
22.4
20.5
51.5
67.7
73.2
79.4
99.2
80.1
93.8
97.4
100
6.7
9.1
9.0
10.0
24.3
10.2
13.6
12.9
12.0
11.0
33.4
47.6
53.2
48.8
81.1
58.7
78.5
86.3
100
33.5
29.1
28.0
32.2
46.6
44.7
42.5
41.4
40.6
43.2
78.9
83.4
83.8
85.8
96.7
95.3
98.2
98.3
100
5.3
9.7
9.2
11.8
32.2
7.9
12.1
13.8
13.9
19.7
17.4
39.0
50.9
59.2
93.8
29.7
71.6
84.1
100
18.7
14.8
15.9
14.8
31.5
23.1
17.6
23.3
18.4
19.9
46.5
49.7
57.7
60.9
92.9
70.0
82.0
83.2
100
0.9
0.8
0.9
0.6
Tukey
0.4
Qn
0.5
0.7
ISO
Dixon
0.2
0.3
0.8
0.7
0.6
0.5
0.4
0.3
0.2
outliers was further away from the distribution (0.020.1 for outlier at
+ 5, 0.0070.06 for outlier at + 7).
0.1
0.1
Normal, outlier at 5
0.0
0.2
0.4
0.6
0.8
Normal, outlier at 3
1.0
Grubbs
Students t, outlier at 5
Students t, outlier at 3
Fig. 1. Negative and positive predictive values for the ve different approaches, based on samples of size n = 6, for Normal and Student's t-distributions.
0.9
0.8
0.9
0.6
Tukey
0.4
Qn
0.5
0.7
ISO
Dixon
0.2
0.3
0.8
0.7
0.6
0.5
0.4
0.3
0.2
585
0.2
0.4
0.6
0.8
Negative predictive value
Students t, outlier at 3
Normal, outlier at 3
0.0
Students t, outlier at 5
0.1
0.1
Normal, outlier at 5
Gr ubbs
1.0
Fig. 2. Negative and positive predictive values for the ve different approaches, based on samples of size n = 8, for Normal and Student's t-distributions.
sample sizes. Precision was similar for all approaches and increased
with increasing sample size.
4. Discussion
The ndings of the present study illustrate that, as far as symmetric
unimodal distributions are concerned, the behaviour of the different approaches for estimating Z-scores does not really depend on the kurtosis
(peakness) of the distribution: similar performances were found for the
data generated from the Normal and from the Student's t-distribution. Although Normal and t-distributions cover a wide range of distributions
that describe data reported in EQA surveys, distributions may be multimodal or exhibit skewness in some cases and the Z-scores may become
unreliable. Unimodality is a prerequisite to obtain reliable Z-scores and
in the light of the presence of matrix effects [17], the performance of a laboratory should be assessed with respect to its peers by so-called peer
group comparisons, and this can only be assured by grouping data according to equal or similar methodology. As a result, peer groups may be small.
For example, half of the peer groups in Belgian EQA programmes for
chemistry and immunoassays contain 10 laboratories or less.
Apart from avoiding multimodality by the EQA set up, post hoc
controls for unimodality and symmetry may be applied as well. Formal tests have been described to test whether the data are unimodal
[18]. They are based on kernel density estimation and standard errors
and signicance of multimodality can be obtained by bootstrapping.
In addition, asymmetry of the data distribution can be assessed by
measuring skewness after removing spurious results.
Regarding PPV, we observed that, for small sample sizes (n b 10) and
for outliers close to the centre, the Tukey, and to a lesser extent, the
Grubbs approaches performed better than the other ones. Remark
that the ISO approach has, in comparison with other approaches, low
outlier nding capacities for sample sizes below 10. There is however
not much difference between the various approaches when sample
size increases and/or when outliers are located further away from the
centre, so that the question of which approach to select for Z-scores
should only be addressed for small sample sizes.
Table 2
Standard error and average of the estimate of variability obtained by the different approaches, for a representative selection of investigated sample sizes. Better estimates of the
standard error have lower values in the left part of the table, and should tend as much as possible to the original value of , which was in our setting 0.5, in the right part of the
table. Results are obtained from a Normal distribution.
Standard error of estimated standard deviation
Sample size
5
6
7
8
20
5
6
7
8
20
5
6
7
8
20
5
6
7
8
20
Outlier distance
0
0
0
0
0
3
3
3
3
3
5
5
5
5
5
7
7
7
7
7
Grubbs
0.424
0.394
0.393
0.380
0.290
0.561
0.526
0.504
0.477
0.339
0.845
0.741
0.645
0.565
0.281
0.940
0.663
0.461
0.393
0.281
Dixon
0.423
0.394
0.390
0.377
0.287
0.529
0.494
0.471
0.441
0.316
0.847
0.783
0.720
0.681
0.346
1.118
0.896
0.717
0.726
0.278
Tukey
0.454
0.412
0.415
0.397
0.346
0.559
0.513
0.477
0.482
0.350
0.559
0.513
0.477
0.482
0.350
0.559
0.513
0.477
0.482
0.35
ISO
0.475
0.442
0.437
0.418
0.315
0.585
0.562
0.528
0.501
0.331
0.859
0.724
0.599
0.538
0.331
1.128
0.756
0.600
0.538
0.331
Grubbs
0.478
0.483
0.484
0.484
0.488
0.732
0.686
0.658
0.634
0.547
0.822
0.685
0.605
0.547
0.486
0.703
0.535
0.490
0.481
0.486
Dixon
0.490
0.492
0.496
0.492
0.493
0.763
0.721
0.696
0.673
0.566
0.982
0.825
0.740
0.726
0.505
1.001
0.718
0.594
0.618
0.489
Tukey
0.430
0.438
0.455
0.451
0.484
0.549
0.559
0.546
0.548
0.508
0.549
0.559
0.546
0.548
0.508
0.549
0.559
0.546
0.548
0.508
ISO
0.553
0.541
0.538
0.530
0.511
0.857
0.771
0.726
0.682
0.553
1.168
0.871
0.760
0.697
0.553
1.367
0.881
0.760
0.697
0.553
QN
0.571
0.544
0.553
0.531
0.510
0.846
0.771
0.737
0.700
0.566
0.921
0.837
0.760
0.727
0.568
0.923
0.839
0.760
0.727
0.568
586
For the NPV, Tukey's approach demonstrated the worst performance, so that, in line with its underestimation of variability in absence
of outliers, this approach has a much higher agging rate than other approaches, regardless of the contamination of the sample. Further, for
leptokurtic data, more values will be wrongly agged when the sample
size increases. The latter can easily be explained by the higher frequency
of data in the tails of the distribution. This nding is contraintuitive to
Thienpont's suggestion [19] to make the threshold for agging
Z-scores dependent on the sample size. The explanation lies in the fact
that all tests assume a Normal distribution, and increasing the threshold
value with decreasing sample size would only work for normally distributed data [20]. Nevertheless, changing the threshold has an inverse
effect on the NPV and PPV and it is therefore important to consider NPV
and PPV together. A look at the analysis of NPV and PPV shows that the
difference between the algorithms disappears with increasing sample
size and for outliers further away from the centre. For outliers relatively
close to the centre and for smaller sample sizes, however, the outliersearch based algorithms tend to perform better than the robust
algorithms.
When the estimated standard deviation is not only used for Z-scores
but also for a follow up of the performance of the different peer groups,
this standard deviation seems to be overestimated by every approach
when outliers are present at a small distance from the centre. The latter
can be explained by the low performance of all the algorithms with respect to outliers relatively close to the centre. We see however that,
also here, the outlier-search based algorithms perform better than the robust algorithms when the outliers are more distant from the centre of the
distribution. The low efcacy of robust estimators for sample sizes up to 6
has been stipulated already by Rousseeuw [21]. For the particular objective of this study, it can be added that robust estimators underperform
also for samples of larger size, and that the stability of the estimated standard deviation is quite similar across the different approaches.
When considering NPV, PPV, and bias of the variability estimators
together, we would recommend the outlier-search based algorithms
as compared to the robust approaches, certainly when sample sizes
are small. If however a robust approach is preferred, we would recommend the Tukey approach for its simplicity, its unbiased estimator
of variability and high agging rate when outliers are present, although its relatively low negative predictive value may make it useless for punitive EQA programmes.
To provide an answer to the question of the minimal sample size
of a peer group before its members can be evaluated. There are two
antagonistic arguments involved. Firstly, NPV and PPV, and accuracy
of the estimated standard deviation increase with increasing sample
size and hence, larger sample sizes are preferred. Secondly, when
only large peer groups are evaluated, many laboratories will escape
evaluation; hence, from this perspective, smaller sample sizes are
preferred. In our opinion, the Grubbs and Tukey approaches nd the
best compromise between the antagonistic arguments with a minimal sample size of 6, which is in line with previously published