You are on page 1of 5

Clinica Chimica Acta 413 (2012) 582586

Contents lists available at SciVerse ScienceDirect

Clinica Chimica Acta


journal homepage: www.elsevier.com/locate/clinchim

Comparison of different approaches to evaluate External Quality Assessment Data


Wim Coucke a,, Bernard China a, Isabelle Delattre a, Yolande Lenga a, Marjan Van Blerk a,
Christel Van Campenhout a, Philippe Van de Walle a, Kris Vernelen a, Adelin Albert b
a
b

Scientic Institute of Public Health, Clinical Biology, J. Wytsmanstraat 14, Brussels, Belgium
University of Lige, Medical Informatics and Biostatistics, CHU Sart Tilman B23, Lige, Belgium

a r t i c l e

i n f o

Article history:
Received 24 July 2011
Received in revised form 27 November 2011
Accepted 28 November 2011
Available online 8 December 2011
Keywords:
External Quality Assessment
Z-scores
Statistics

a b s t r a c t
In EQA programs, Z-scores are used to evaluate laboratory performance. They should indicate poorly performing
laboratories, regardless of the presence of outliers. For this, two different types of approaches exist. The rst type
are outlier-based approaches, which rst exclude outlying values, calculate the average and standard deviation
on the remaining data and obtain Z-scores for all values (e.g., Grubbs and Dixon). The second type includes the robust approaches (e.g., Tukey and Qn or the algorithm recommended by ISO). The different approaches were
assessed by randomly generated samples from the Normal and Student t distributions. Part of the sample data
were contaminated with outliers. The number of false and true outliers was recorded and subsequently, Positive
and Negative Predictive Values were derived. Also, the sampling mean and variability were calculated for location
and scale estimators. The various approaches performed similarly for sample sizes above 10 and when outliers were
at good distance from the centre. For smaller sample sizes and closer outliers, however, the approaches performed
quite differently. Tukey's method was characterised by a high true and a high false outlier rate, while the ISO and Qn
approaches demonstrated weak performance. Grubbs test yielded overall the best results.
2011 Elsevier B.V. All rights reserved.

1. Introduction
In laboratory medicine, External Quality Assessment (EQA) programs for quantitative tests have been running for more than half a century [1]. In such programs, Z-scores have become a way to assess the
quality of clinical laboratories by classifying them on a common continuous scale and agging those with unacceptable results. Besides,
these scores can be interpreted similarly as those derived from the internal quality control procedures. Z-scores are based on a measure of
centre and scale of the distribution of the results, in which the difference
from the centre is expressed as a multiple of the scale: Zscore= (individual result centre) / scale. There is a common agreement to ag Z-scores with absolute values larger than 3, requesting an
action from the laboratory. Z-scores with absolute values larger than 2
and smaller than 3 are regarded as a warning signal, and those with absolute values smaller than 2 as within acceptable limits [2]. Z-scores
obtained from several samples can be combined and the proportion of
such scores exceeding a limit, 3 or lower, may be used as a long-term
evaluation tool. Therefore, the Z-scores method conveys a different
kind of information and is more exible than outlier detection techniques which search only for discordant values.

Corresponding author at: Scientic Institute of Public Health, Clinical Biology, J.


Wytsmanstraat 14, 1050 Brussels, Belgium. Tel.: +32 2 642 5523; fax: +32 2 642 56 45.
E-mail address: wim.coucke@wiv-isp.be (W. Coucke).
0009-8981/$ see front matter 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.cca.2011.11.030

The estimation of location and scale is of primary importance for


obtaining reliable Z-scores. In some cases, the estimates are known or
xed beforehand as derived by a reference method and/or a t for purpose standard deviation [3]. When mean (centre) and standard deviation (scale) are calculated from the sample values, however, they can
be heavily inuenced by outliers. For example, when the sample size
n b 11, outlying results inuence the estimates in such a way that
Z-scores are never larger than 3 [4]. Estimation of the centre and scale
by avoiding the inuence of outliers can be done by outlier-based approaches or robust statistics approaches, respectively [3,5].
The outlier-based approaches rst exclude outlying values, calculate
the classical mean and standard deviation on the remaining data, and
then compute Z-scores for all values, including those previously excluded.
The robust statistics approaches derive estimates of location and scale
which are less inuenced by outliers. Healy introduced robust statistics
in the eld of laboratory medicine [6] while Rocke described its use in
EQA for the rst time in 1983 [7]. Today, robust statistics are popular in
the domain of EQA, as conrmed by the ISO recommendations for calculating Z-scores [2].
In EQA, standard deviations are not solely used for detecting laboratories that reported out-of-range values, but also serve as quality indicators for analytical methods, in particular with respect to accuracy and
precision. Here, the uncertainty and bias of variability estimators themselves are of importance.
Finally, it is worth mentioning that the distribution of EQA data is
often taken to be symmetric and possibly contaminated with a few
outlying results and most statistical tests in EQA programmes assume

W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582586

a contaminated Normal distribution. Some authors [8,9], however,


claim that the distribution, even in the absence of outliers, may be
leptokurtic, i.e. exhibiting heavier tails than the Normal distribution.
Thus, comparing the two types of estimating approaches for the specic classes of symmetric and unimodal distributions, represented by
the Normal and Student's t-distributions may be of importance.
This study was designed to address three simple questions: (1) What
is the false positive rate of Z-scores estimation methods in noncontaminated samples from the Normal and Student's t distributions?
(2) What is the true positive rate of Z-score estimation methods in contaminated samples from the same distributions? (3) What is the accuracy
and precision of the different variability estimators for the Normal
distribution?
2. Materials and methods
A total of 1000 random samples was generated from a Normal distribution with mean and standard deviation arbitrarily set at =10 and
=0.5. Data were generated for sample sizes ranging from n=3 to 20.
Subsequently, to obtain data from a leptokurtic distribution, similar simulations were performed using a Student's t-distribution with 5 degrees
of freedom. Only samples for which all values were within the interval
[3, +3] were withheld. Next, the samples were contaminated
by adding an outlier at +3, at +5 and at +7 separately, resulting
in a sample of size n+1. Samples of size of n=3 without added outliers
were not taken into account. Z-scores were calculated on each sample following ve different approaches. The rst approach used the Grubbs test
[10,11] to remove outliers in a rst step. This test calculates the distance
between the most extreme point and the centre of the distribution. If
this distance is too large with respect to the standard deviation, this
point was indicated as an outlier. The test uses a predened false alarm
rate which was kept small (=0.05). If an outlier was found, the test
was repeated on the rest of the sample, using the predened level
until no outlier was present any more. In a second step, Z-scores were calculated based on the classical average and standard deviation of the data
that were not marked as outliers in the rst step. The second approach
used the Dixon test [12] to remove outliers in a rst step. The method is
based on the calculation of ranges between lowest and highest sample
values, and subranges between the most extreme sample values on either
side. Like the Grubbs test, it uses a hypothesis test. Also here, outliers were
removed till the null hypothesis of absence of outliers was accepted
(=0.05) and subsequently, Z-scores were obtained using the classical
average and standard deviation based on the data that were not removed.
The third approach, often called the Tukey approach [13,14], calculates a
robust estimator of scale by dividing the interquartile distance by the
interquartile distance of a standard Normal distribution (D=1.34898)
and uses the median as an estimator of the centre of the distribution.
The fourth approach uses Qn [15,16] as a robust estimator of scale and
the median as an estimator of the centre. The Qn estimator is approximately the median value of all pairwise differences between all values,
rescaled to reect the standard deviation of a Normal distribution with
a xed value D. At last, the robust estimators of scale and centre according
to ISO 13528 were calculated. Algorithm A of ISO 13528 [2] is based on
calculating the classical average and standard deviation of a Winsorized
sample. Winsorizing, i.e. replacing values beyond a certain limit by the
limit itself, was applied for values deviating by 1.5 () standard deviations
away from the centre.
The ability of the various approaches of agging outliers when they
exist and of not agging them when they do not exist can be assessed in
a way similar to the evaluation of diagnostic tests. For this purpose, the
Negative Predictive Value (NPV) and Positive Predictive Value (PPV)
were calculated for each approach by letting a specic parameter vary
that may be changed to over- or under-estimate the standard deviation,
and, as a consequence, respectively decrease or increase the number of
Z-scores above 3. The NPV was calculated as the ratio between the True
Negatives (i.e. the samples to which no outlier was added and that

583

showed not Z-scores beyond 3) and the number of samples for which
no Z-score beyond 3 was found (=True+ False Negatives). Likewise,
the PPV was calculated as the ratio between the True Positives (i.e. the
samples to which an outlier was added and that showed a Z-score beyond 3) and the number of samples for which a Z-score beyond 3 was
found (=True + False positives). For the Grubbs and Dixon testsbased approaches, the P-value for which outliers are excluded () was
changed. For the Tukey and Qn approach, D was changed: lower values
of D result in lower standard deviations, higher Z-scores and hence a
higher Z-citation rate. For the ISO-13528 approach, was changed.
NPV and PPV for each of the different values of the varying parameter
were recorded and graphically displayed. At last, for each simulated
data series of samples generated from the Normal distribution, the variability estimator obtained by every approach was recorded and its
mean and standard error calculated.
3. Results
3.1. False positives
A representative part of the False Positive (FP) rates obtained is
depicted in the upper part of Table 1 (no outlier). Among all approaches,
the Tukey method showed the most distinctive behaviour; while all FP
rates were below 15% for the Normal distribution and below 30% for the
Student's t-distribution, Tukey's approach had for almost all sample
sizes a rate above 20%. The Dixon and ISO approaches showed the lowest FP rates. In addition, it is seen that for samples of size 6 or larger, the
FP rate of each approach (except ISO) stabilised for the Normal distribution. By contrast, all FP rates increased with increasing sample size for
the Student's t-distribution.
3.2. True positives
The True Positive (TP) rates when adding an outlier at a distance
+ 3 or + 5 are shown in Table 1. For all outlier distances, differences between the different approaches were similar for the Normal
and Student's t-distribution. For outliers at + 3, none of the approaches was able to ag the outliers in more than half of the cases
for all sample sizes. Tukey's approach had the highest performance,
reaching a agging rate of nearly 50% as soon as the sample size
was 6 or larger. The other approaches had much weaker performance; the ISO approach had agging rates below 10% for very
small samples. All other approaches exhibited outlier nding rates
of roughly 1030%. In addition, these results point to a clear improvement of agging rates for all approaches with increasing sample size
and outlier distance, with a probability of detection close to 100%
for outliers at + 7. The ISO and Dixon approaches, however, still
had a weak performance for very small sample sizes.
3.3. Negative and positive predictive value
The results of the NPV and PPV for sample size n=6 are shown in
Fig. 1. Like in Receiver Operating Characteristic (ROC) analysis, the perfect
approach would ag no Z-scores larger than 3 in case they do not exist
(negative prediction) and would ag them all in case they would exist
(positive prediction), would correspond to a curve made up by a vertical
line equal to the Y-axis and a horizontal line which intersects the Y-axis at
the value 1. The further the curve departs from the perfect curve, the
worse the performance of the approach. For outliers at +3, curves
were located far from the ideal line so that NPV and PPV did not reach
high levels for any of the approaches: only a combination of positive
and negative predicted values of about 60% was feasible, and although
the Grubbs approach tended to perform better, there was not much difference between the approaches. For increasing outlier distance, however,
it is seen that the curves tend to close to the perfect curve. The Qn approach consistently performed the worst. The outlier searching

584

W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582586

Table 1
False and true outlier rates, expressed as percentages, in the samples of the ve different approaches, for a representative selection of investigated sample sizes. False outlier rates
are shown next to no outlier, true outlier rates in the other lines.
Sample
size

Outlier
distance

Normal distribution
Grubbs

Dixon

Tukey

ISO

QN

Student's t distribution
Grubbs

Dixon

Tukey

ISO

QN

5
6
7
8
20
5
6
7
8
20
5
6
7
8
20
5
6
7
20

no outlier
no outlier
no outlier
no outlier
no outlier
+ 3
+ 3
+ 3
+ 3
+ 3
+ 5
+ 5
+ 5
+ 5
+ 5
+ 7
+ 7
+ 7
+ 7

10.9
8.1
10.2
9.3
9.8
21.2
25.7
30.0
30.9
35.9
60.0
73.4
83.2
89.7
100
87.7
96.7
99.5
100

5.7
4.0
4.0
4.2
4.5
11.9
13.9
15.5
13.7
17.3
37.1
52.9
62.2
60.0
95.2
66.6
85.1
93.1
100

29.6
22.0
20.7
20.8
21.2
52.1
44.7
47.0
46.5
50.7
85.8
86.6
91.3
90.1
99.0
97.7
99.1
99.1
100

4.2
4.6
5.0
4.7
8.4
8.0
13.1
15.3
16.5
28.6
19.5
43.8
59.2
70.3
99.3
35.2
76.9
91.9
100

15
8.1
10.0
8.5
9.1
25.4
20.3
24.7
20.9
28.5
52.3
56.1
64.2
69.2
98.5
76.7
83.5
91.2
100

12.8
15.1
15.0
18.2
31.6
19.9
22.7
22.6
22.4
20.5
51.5
67.7
73.2
79.4
99.2
80.1
93.8
97.4
100

6.7
9.1
9.0
10.0
24.3
10.2
13.6
12.9
12.0
11.0
33.4
47.6
53.2
48.8
81.1
58.7
78.5
86.3
100

33.5
29.1
28.0
32.2
46.6
44.7
42.5
41.4
40.6
43.2
78.9
83.4
83.8
85.8
96.7
95.3
98.2
98.3
100

5.3
9.7
9.2
11.8
32.2
7.9
12.1
13.8
13.9
19.7
17.4
39.0
50.9
59.2
93.8
29.7
71.6
84.1
100

18.7
14.8
15.9
14.8
31.5
23.1
17.6
23.3
18.4
19.9
46.5
49.7
57.7
60.9
92.9
70.0
82.0
83.2
100

3.4. Variability and bias of standard deviation


Results concerning the variability and bias of the estimated standard deviations are depicted in Table 2. In absence of outliers, Tukey
and outlier search-based approaches showed a higher distance between the estimated and actual population mean of 0.5, consistently
underestimating the standard error. The reverse occurred when outliers were present: Tukey, Dixon and Grubbs approaches had the
best accuracy, with the latter performing better when outliers became more distant. The Qn and ISO approach tended to overestimate
the standard deviation consistently in presence of outliers and for all

0.9

0.8

0.9

0.6

Tukey

0.4

Qn

0.5

0.7

ISO

Dixon

0.2

0.3

Positive predictive value

0.8
0.7
0.6
0.5
0.4
0.3
0.2

Positive predictive value

outliers was further away from the distribution (0.020.1 for outlier at
+ 5, 0.0070.06 for outlier at + 7).

algorithms showed a slightly better performance, mainly for outliers at


moderate distance from the centre (+5). There was almost no difference between the results of the data generated from the Normal or from
the Student's t-distribution.
A similar trend was seen for a sample size n = 8 (Fig. 2). All algorithms exhibited a weak performance for outliers at + 3. For outliers
at + 5, however, the Grubbs approach performed better than the
other approaches. This difference became less clear for more distant
outliers, where all approaches showed almost perfect positive and negative predictive values. Focussing on the Grubbs approach, the search
for the optimal P-value for excluding outliers () was made for different
combinations of sample size and outlier distance. The optimal decreased when outliers became more distant and with increasing sample
size. In case of outliers at small distance from the distribution, the optimal was 0.2 for all sample sizes. This value decreased when the

0.1

0.1

Normal, outlier at 5

0.0

0.2

0.4

0.6

0.8

Negative predictive value

Normal, outlier at 3

1.0

Grubbs
Students t, outlier at 5
Students t, outlier at 3

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Negative predictive value

Fig. 1. Negative and positive predictive values for the ve different approaches, based on samples of size n = 6, for Normal and Student's t-distributions.

0.9

0.8

0.9

0.6

Tukey

0.4

Qn

0.5

0.7

ISO

Dixon

0.2

0.3

Positive predictive value

0.8
0.7
0.6
0.5
0.4
0.3
0.2

Positive predictive value

585

W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582586

0.2

0.4
0.6
0.8
Negative predictive value

Students t, outlier at 3

Normal, outlier at 3

0.0

Students t, outlier at 5

0.1

0.1

Normal, outlier at 5

Gr ubbs

1.0

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1


Negative predictive value

Fig. 2. Negative and positive predictive values for the ve different approaches, based on samples of size n = 8, for Normal and Student's t-distributions.

sample sizes. Precision was similar for all approaches and increased
with increasing sample size.
4. Discussion
The ndings of the present study illustrate that, as far as symmetric
unimodal distributions are concerned, the behaviour of the different approaches for estimating Z-scores does not really depend on the kurtosis
(peakness) of the distribution: similar performances were found for the
data generated from the Normal and from the Student's t-distribution. Although Normal and t-distributions cover a wide range of distributions
that describe data reported in EQA surveys, distributions may be multimodal or exhibit skewness in some cases and the Z-scores may become
unreliable. Unimodality is a prerequisite to obtain reliable Z-scores and
in the light of the presence of matrix effects [17], the performance of a laboratory should be assessed with respect to its peers by so-called peer
group comparisons, and this can only be assured by grouping data according to equal or similar methodology. As a result, peer groups may be small.

For example, half of the peer groups in Belgian EQA programmes for
chemistry and immunoassays contain 10 laboratories or less.
Apart from avoiding multimodality by the EQA set up, post hoc
controls for unimodality and symmetry may be applied as well. Formal tests have been described to test whether the data are unimodal
[18]. They are based on kernel density estimation and standard errors
and signicance of multimodality can be obtained by bootstrapping.
In addition, asymmetry of the data distribution can be assessed by
measuring skewness after removing spurious results.
Regarding PPV, we observed that, for small sample sizes (n b 10) and
for outliers close to the centre, the Tukey, and to a lesser extent, the
Grubbs approaches performed better than the other ones. Remark
that the ISO approach has, in comparison with other approaches, low
outlier nding capacities for sample sizes below 10. There is however
not much difference between the various approaches when sample
size increases and/or when outliers are located further away from the
centre, so that the question of which approach to select for Z-scores
should only be addressed for small sample sizes.

Table 2
Standard error and average of the estimate of variability obtained by the different approaches, for a representative selection of investigated sample sizes. Better estimates of the
standard error have lower values in the left part of the table, and should tend as much as possible to the original value of , which was in our setting 0.5, in the right part of the
table. Results are obtained from a Normal distribution.
Standard error of estimated standard deviation
Sample size
5
6
7
8
20
5
6
7
8
20
5
6
7
8
20
5
6
7
8
20

Outlier distance
0
0
0
0
0
3
3
3
3
3
5
5
5
5
5
7
7
7
7
7

Grubbs
0.424
0.394
0.393
0.380
0.290
0.561
0.526
0.504
0.477
0.339
0.845
0.741
0.645
0.565
0.281
0.940
0.663
0.461
0.393
0.281

Dixon
0.423
0.394
0.390
0.377
0.287
0.529
0.494
0.471
0.441
0.316
0.847
0.783
0.720
0.681
0.346
1.118
0.896
0.717
0.726
0.278

Tukey
0.454
0.412
0.415
0.397
0.346
0.559
0.513
0.477
0.482
0.350
0.559
0.513
0.477
0.482
0.350
0.559
0.513
0.477
0.482
0.35

ISO
0.475
0.442
0.437
0.418
0.315
0.585
0.562
0.528
0.501
0.331
0.859
0.724
0.599
0.538
0.331
1.128
0.756
0.600
0.538
0.331

Mean of estimated standard deviation


QN
0.579
0.486
0.508
0.445
0.319
0.798
0.639
0.627
0.542
0.339
0.958
0.761
0.678
0.596
0.344
0.969
0.770
0.678
0.596
0.344

Grubbs
0.478
0.483
0.484
0.484
0.488
0.732
0.686
0.658
0.634
0.547
0.822
0.685
0.605
0.547
0.486
0.703
0.535
0.490
0.481
0.486

Dixon
0.490
0.492
0.496
0.492
0.493
0.763
0.721
0.696
0.673
0.566
0.982
0.825
0.740
0.726
0.505
1.001
0.718
0.594
0.618
0.489

Tukey
0.430
0.438
0.455
0.451
0.484
0.549
0.559
0.546
0.548
0.508
0.549
0.559
0.546
0.548
0.508
0.549
0.559
0.546
0.548
0.508

ISO
0.553
0.541
0.538
0.530
0.511
0.857
0.771
0.726
0.682
0.553
1.168
0.871
0.760
0.697
0.553
1.367
0.881
0.760
0.697
0.553

QN
0.571
0.544
0.553
0.531
0.510
0.846
0.771
0.737
0.700
0.566
0.921
0.837
0.760
0.727
0.568
0.923
0.839
0.760
0.727
0.568

586

W. Coucke et al. / Clinica Chimica Acta 413 (2012) 582586

For the NPV, Tukey's approach demonstrated the worst performance, so that, in line with its underestimation of variability in absence
of outliers, this approach has a much higher agging rate than other approaches, regardless of the contamination of the sample. Further, for
leptokurtic data, more values will be wrongly agged when the sample
size increases. The latter can easily be explained by the higher frequency
of data in the tails of the distribution. This nding is contraintuitive to
Thienpont's suggestion [19] to make the threshold for agging
Z-scores dependent on the sample size. The explanation lies in the fact
that all tests assume a Normal distribution, and increasing the threshold
value with decreasing sample size would only work for normally distributed data [20]. Nevertheless, changing the threshold has an inverse
effect on the NPV and PPV and it is therefore important to consider NPV
and PPV together. A look at the analysis of NPV and PPV shows that the
difference between the algorithms disappears with increasing sample
size and for outliers further away from the centre. For outliers relatively
close to the centre and for smaller sample sizes, however, the outliersearch based algorithms tend to perform better than the robust
algorithms.
When the estimated standard deviation is not only used for Z-scores
but also for a follow up of the performance of the different peer groups,
this standard deviation seems to be overestimated by every approach
when outliers are present at a small distance from the centre. The latter
can be explained by the low performance of all the algorithms with respect to outliers relatively close to the centre. We see however that,
also here, the outlier-search based algorithms perform better than the robust algorithms when the outliers are more distant from the centre of the
distribution. The low efcacy of robust estimators for sample sizes up to 6
has been stipulated already by Rousseeuw [21]. For the particular objective of this study, it can be added that robust estimators underperform
also for samples of larger size, and that the stability of the estimated standard deviation is quite similar across the different approaches.
When considering NPV, PPV, and bias of the variability estimators
together, we would recommend the outlier-search based algorithms
as compared to the robust approaches, certainly when sample sizes
are small. If however a robust approach is preferred, we would recommend the Tukey approach for its simplicity, its unbiased estimator
of variability and high agging rate when outliers are present, although its relatively low negative predictive value may make it useless for punitive EQA programmes.
To provide an answer to the question of the minimal sample size
of a peer group before its members can be evaluated. There are two
antagonistic arguments involved. Firstly, NPV and PPV, and accuracy
of the estimated standard deviation increase with increasing sample
size and hence, larger sample sizes are preferred. Secondly, when
only large peer groups are evaluated, many laboratories will escape
evaluation; hence, from this perspective, smaller sample sizes are
preferred. In our opinion, the Grubbs and Tukey approaches nd the
best compromise between the antagonistic arguments with a minimal sample size of 6, which is in line with previously published

results [5]. The Grubbs approach should be applied with a high


(0.2) when sample sizes are small and a lower (0.020.01) when
sample sizes are larger (n 10). If the EQA organiser favours the ISO
approach, we would denitively not recommend it for sample sizes
below 10.
In conclusion, this study focussed on small sample sizes with one
outlier added. When sample sizes increase and the probability of encountering multiple outliers becomes high, the outlier searching algorithms applied here may suffer from the masking effects when the
data contain more outliers, i.e. the presence of an outlier may escape
notice if a larger outlier is present. In this case masking-free modications of the Grubbs and Dixon test may be applied [22,23].
References
[1] Plebani M. External quality assessment programs: past, present and future. Jugoslav Med Biohem 2005;24:2016.
[2] International Organization for Standardization. ISO 13528:2005. Statistical
methods for use in prociency testing by interlaboratory comparisons; 2005.
[3] M. Thompson, S.L.R. Ellison, R. Wood. The International Harmonised Protocol for
the Prociency Testing of Analytical Chemistry Laboratories. Pure Appl. Chem
2006;78:14596.
[4] Shifer RE. Maximum Z scores and outliers. Am Stat 1988;42:7980.
[5] Hund E, Massart DL, Smeyers-Verbeke J. Inter-laboratory studies in analytical
chemistry. Anal Chim Acta 2000;423:14565.
[6] Healy M. Outliers in clinical chemistry quality-control schemes. Clin Chem
1979;25:6757.
[7] Rocke D. Robust statistical analysis of interlaboratory studies. Biometrika
1983;70:42131.
[8] Heydorn K. The distribution of interlaboratory comparison data. Accredit Qual
Assur 2008;13:7234.
[9] Duewer DL. The distribution of interlaboratory comparison data: response to the
contribution by K. Heydorn. Accredit Qual Assur 2008;13:7256.
[10] Grubbs FE. Procedures for detecting outlying observations in samples. Technometrics 1969;11:121.
[11] Rosario P, Martnez JL, Silvn JM. Comparison of different statistical methods for
evaluation of prociency test data. Accredit Qual Assur 2008;13:4939.
[12] Dixon WJ. Analysis of extreme values. Ann Math Stat 1950:488506.
[13] Tukey JW. Exploratory data analysis; 1977. th ed. MA: Reading.
[14] Sciacovelli L, Secchiero S, Zardo L, Plebani M. External Quality Assessment
Schemes: need for recognised requirements. Clin Chim Acta 2001;309:18399.
[15] Rousseeuw PJ, Croux C. Alternatives to the median absolute deviation. J Am Stat
Assoc 1993;88(424):127383.
[16] Wilrich P-T. Robust estimates of the theoretical standard deviation to be used in
interlaboratory precision experiments. Accredit Qual Assur 2007;12:23140.
[17] Miller WG. Specimen materials, target values and commutability for external
quality assessment (prociency testing) schemes. Clin Chim Acta 2003;327:
2537.
[18] Lowthian PJ, Thompson M. Bump-hunting for the prociency tester-searching for
multimodality. Analyst 2002;127:135964.
[19] Thienpont LMR, Steyaert HLC, De Leenheer AP. A modied statistical approach for
the detection of outlying values in external quality control: comparison with
other techniques. Clin Chim Acta 1987;168:33746.
[20] Zhou Q, Xu J, Xie W, Li S, Li X. Use of robust ZB and ZW to evaluate prociency
testing data. Clin Chim Acta 2011;412:9369.
[21] Rousseeuw PJ, Verboven S. Robust estimation in very small samples. Comput Stat
Data Anal 2002;40:74158.
[22] Rosner B. On the detection of many outliers. Technometrics 1975;17:2217.
[23] Jain RB. A recursive version of Grubbs test for detecting multiple outliers in environmental and chemical data. Clin Biochem 2010;43:10303.

You might also like