You are on page 1of 4

THE

ANALYST

FULL PAPER
A new test for sufficient homogeneity

Tom Fearna and Michael Thompsonb

a Department of Statistical Science, University College London, Gower Street, London, UK


WC1E 6BT
b School of Biological and Chemical Sciences, Birkbeck College (University of London), www.rsc.org/analyst

Gordon House, 29 Gordon Square, London, UK WC1H 0PP

Received 27th April 2001, Accepted 23rd May 2001


First published as an Advance Article on the web 30th July 2001

Certified reference materials and materials distributed in proficiency testing need to be sufficiently
homogeneous, that is, the variance in the mean composition of the distributed portions of the material must be
negligibly small in relation to the variance of the analytical result produced when the material is in normal use.
The requirement for sufficient homogeneity suggests the use of a formal test. Such tests as have been formulated
rely on the duplicated analysis of the material from a number of portions, followed by analysis of variance.
However, the outcome is not straightforward. If the analytical method used is very precise, then an undue
proportion of the materials will be found to be significantly heterogeneous. If it is too imprecise, the test may be
unable to detect heterogeneity. Moreover, the Harmonised Protocol Procedure (M. Thompson and R. Wood, Pure
Appl. Chem., 1993, 65, 2123) seems to be unduly prone to the rejection of material that is in fact satisfactory. We
present a simple new statistical approach that overcomes some of these problems.

Testing for sufficient homogeneity Tests for sufficient homogeneity are never likely to be wholly
satisfactory. The main problem is that, because of the high cost
With the exception of well-mixed true solutions, materials of the analysis, the number of samples taken for testing will be
prepared for proficiency tests and other interlaboratory studies small. This makes the power of the statistical test (that is, the
are, despite our best efforts, heterogeneous. When such a bulk probability of rejecting the material when it is in fact
material is split for distribution to various laboratories, the units heterogeneous) relatively low. A further problem is that
produced vary slightly in composition among themselves. heterogeneity is inherently likely to be patchy, and discrepant
Usually the variation is negligible, but we want to be sure of distribution units might be under-represented among those
this. When we test for so-called sufficient homogeneity in selected for test.
such materials, we are seeking to show that this variation in However, given that sufficient homogeneity is a reasonable
composition among the distributed units (characterised by the prior assumption, and that the cost of testing for it is often high,
sampling standard deviation ssam) is negligible in relation to it seems sensible to make the main emphasis the avoidance of
variation introduced by the measurements conducted by the Type 1 errors (that is, false rejection of a satisfactory
participants in the proficiency test. material). Homogeneity tests should be regarded as essential,
As we expect the standard deviation of interlaboratory but not foolproof, safeguards. We argue below that the test
variation in proficiency tests to be approximated to by sp, the suggested in the Harmonised Protocol may be too prone to the
target standard deviation, it is natural to use this criterion as a rejection of good samples, and we suggest an alternative test.
reference value. The ISO/IUPAC/AOAC Harmonised Protocol
for Proficiency Testing1 requires that the estimated sampling
standard deviation ssam should be less than 30% of the target
standard deviation sp, that is ssam/sp < 0.3. Analytical precision required for homogeneity
This condition, when fulfilled, is called sufficient homoge- tests
neity in the Harmonised Protocol. At this limit, the standard
deviation of the resultant z-scores would be inflated by the To test for sufficient homogeneity, we have to estimate ssam
heterogeneity by somewhat less than 5% relative, for example from the results of a randomised replicated experiment using
from 2.0 to 2.1, which is deemed to be acceptable. If the ANOVA. In the experiment, each selected sample is separately
condition were not fulfilled, the z-scores would reflect, to an homogenised and analysed in duplicate. Much depends on the
unacceptable degree, variation in the material as well as quality of the analytical results. If the analytical method is
variation in laboratory performance. Participants in proficiency sufficiently precise, ssam can be reliably estimated, and a lack of
testing schemes need to be reassured that the distributed units of sufficient homogeneity can be detected with reasonably high
the test material are sufficiently similar, and this requirement probability when it is present. If the analytical standard
usually calls for testing. deviation san is not small, however, important sampling
The test specified in the Harmonised Protocol calls for the variation may be obscured by analytical variation. We may
selection of 10 or more units at random after the putative obtain a non-significant result when testing for excess sampling
homogenised material has been split and packaged into discrete variation, not because it is not present, but because the test has
samples for distribution. The material from each sample is then no power to detect it when the analytical variance is high.
analysed in duplicate, under randomised repeatability condi- The Harmonised Protocol does not specify any limits on the
tions (that is, all in one run), using a method with sufficient analytical variance, but it seems desirable to do so. There has to
analytical precision. The value of ssam is then estimated from be a trade off between the cost of specifying very precise
the mean squares after one-way analysis of variance (ANOVA), analytical methods and the risk of failing to detect important
and a statistical test is carried out. sampling variation. Based on an informal consideration of this

1414 Analyst, 2001, 126, 14141417 DOI: 10.1039/b103812p


This journal is The Royal Society of Chemistry 2001
trade off (the power aspect is discussed in detail later), we allowable quantity s2all = 0.09 3 s2p. Then, in testing for
suggest that a reasonable compromise is to require that the homogeneity, it makes sense to test the hypothesis s2sam 5 s2all
analytical (repeatability) precision of the method used in the against the alternative s2sam > s2all. The usual F-test in the one-
homogeneity test should satisfy san/sp < 0.5. way ANOVA tests the rather stricter hypothesis s2sam = 0
against the alternative s2sam > 0. Thus a significant F is
evidence that there is sampling variation, but not necessarily
that it is unacceptably large. Earlier approaches1 have rejected
Handling outlying results homogeneity if the F-test is significant and the estimated
sampling variance s2sam > s2all. While this ensures that
Analytical outliers affect homogeneity test data sets quite often, homogeneity is not rejected unless the estimated sampling
as at least 20 analytical results are produced in each test. For variance exceeds the permitted level, this procedure fails to
example, in a proficiency testing scheme studied by the authors, make any allowance for the variability in the estimate s2sam.
about 13% (18/139) of the homogeneity tests contained a single When the true sampling variance is right on the borderline, so
analytical outlier. Analytical outliers are manifested as an that s2sam = 0.09 3 s2p, the estimate s2sam has a roughly (not
unexpectedly large deviation between the duplicated results on exactly, its distribution is not symmetric) 50% chance of
one of the samples. Regardless of the heterogeneity or otherwise exceeding the limit, thus causing rejection. The probability of
of the original bulk material, if we assume that each sample is rejection will be almost as great for true sampling variances that
properly homogenised before the two test portions are removed are close to but below the borderline. As has been argued above,
from it, any outlying difference between duplicate pairs must be it is reasonable to require of any homogeneity testing procedure
due to the analysis rather than the material. that it should have a low probability of falsely rejecting
The effect of a single (that is, analytical) outlying result is sufficient homogeneity. Therefore we have sought to replace
perhaps unexpected: although it inflates the estimate of the the Harmonised Protocol criterion with one that controls this
between-sample variance, an outlier helps the material pass the false rejection probability in all situations.
F-test because it also inflates the estimate of analytical variance. Fortunately, it is not too difficult to derive an explicit test of
The more extreme the analytical outlier, the closer the F-value the hypothesis H: s2sam 5 s2all. Williams2 shows how to derive
becomes to unity. Although the Harmonised Protocol calls for confidence intervals for the between-group variance in a one-
all results to be retained, there is a clear case for excluding way ANOVA that are conservative and close to being exact.
analytical outliers when they can be unequivocally identified. Using his approach, one can find a one-sided 95% confidence
We therefore recommend that analytical outliers be rejected interval (L, H) for the true sampling variance s2sam and reject H
before tests for sufficient homogeneity, and that the Harmo- when this interval does not include s2all. After a little
nised Protocol should be revised in this respect. A single outlier manipulation, this can be shown to be equivalent to rejecting H
could be detected by a procedure such as Dixons or Grubbs when s2sam > F1 s2all + F2 s2an, where s2sam and s2an are the usual
test on the differences between pairs or, as recommended here, estimates of sampling and analytical variances obtained from
by Cochrans variance test. An alternative approach would be to the ANOVA, and F1 and F2 are constants that may be derived
retain all of the data but accommodate outlying differences (but from standard statistical tables as described below.
not outlying means) by robustifying the ANOVA.

Detailed procedure
Other pathologies of data sets
It is assumed that the data comprise m pairs of duplicate
All of the above considerations depend on the laboratory analyses. The first step is to use these to estimate the analytical
carrying out the test for sufficient homogeneity correctly and, in and sampling variances. If a program to perform a one-way
particular, selecting the samples for test at random, homogenis- ANOVA is available, this may be used. Alternatively, a full
ing them before analysis, analysing the duplicated test portions calculation scheme is given below.
under strictly randomised repeatability conditions, and record- (i) Calculate the sum Si and the difference Di of each pair of
ing the results with sufficient digit resolution to allow the duplicates for i = 1, , m.
analysis of the variation. In the authors experience, data sets (ii) Calculate the sum of squares of the differences SD2i , this
where at least some of these requirements have not been met are sum and all those below being over the range i = 1, , m.
common (25/139 instances in our study). Such infringements (iii) Cochrans test statistic is the ratio of D2max, the largest
may invalidate the outcome of the test. We therefore recom- squared difference, to this sum of squared differences:
mend that: (i) detailed instructions be issued to the laboratory
conducting the homogeneity test; and (ii) the data be checked C = D2max/SD2i .
for discrepancies as a matter of routine. Such a check could be Calculate the ratio and compare it with critical values from
made visually on a simple plot of the data, searching for such tables.
diagnostic features as: (i) trends or discontinuities; (ii) non- (iv) Now use the same sum of squared differences to
random distribution of differences between first and second test calculate
results; (iii) excessive rounding; and (iv) outlying results within
samples. MSW = (SD2i )/2m.
(v) Calculate the variance of the sums Si,
vS = S(Si 2 S)2/(m 2 1),
The new procedure
where
Rather than express the criterion for sufficient homogeneity in S = (1/m)SSi
terms of the estimated sampling variance s2sam, as does the
Harmonised Protocol, it would seem more logical to impose a is the mean of the Si, and use this to find
limit on the true sampling variance s2sam. It is this quantity that MSB = vS/2.
is more relevant to the variability in the (untested) samples sent
(vi) Then estimate the analytical variance as
out to laboratories. Thus our criterion for sufficient homoge-
neity is that the sampling variance s2sam must not exceed an s2an = MSW

Analyst, 2001, 126, 14141417 1415


and the sampling variance as Performance of the new test
s2sam = (MSB 2 MSW)/2,
Here we examine the power of the test by calculating the
or as s2sam = 0 if the above estimate is negative. (If a program for probability of rejecting the hypothesis of sufficient homoge-
one-way ANOVA is available, the quantities MSB and MSW neity when it is indeed true that s2sam > s2all. This probability
above may be extracted from the ANOVA table as the depends, naturally, on the amount by which s2sam exceeds s2all. It
between and within mean squares, respectively.) is convenient to present the results as a function of the ratio q =
(vii) Calculate the allowable sampling variance as s2sam/s2p. The probability of rejection is also affected by the size
s2all = (0.3 3 sp)2, of the analytical variance s2an, because allowance has to be made
for this in the test. Again, it is convenient to quantify this via a
where sp is the target standard deviation. ratio, this time as
(viii) Taking the values of F1 and F2 from Table 1, calculate
r = s2an/s2p.
the critical value for the test as
c = F1s2all + F2s2an. Fig. 2 shows the rejection probability as a function of q for r =
0, 0.125 and 0.25.
Ifs2sam> c, there is evidence (significant at the 5% level) that
The two extreme values of r, r = 0 and r = 0.25, correspond
the sampling standard deviation in the population of samples
to san = 0 and san = 0.5sp, the latter being the suggested
exceeds the allowable fraction of the target standard deviation,
maximum permitted analytical standard deviation. Whenever
and the test for homogeneity has been failed. If s2sam < c, there
is no such evidence, and the test for homogeneity has been 0 < san < 0.5sp,
passed.
the power curve will lie somewhere between the two extreme
curves in Fig. 2.
To interpret q, note that s2p + s2sam = s2p(1 + q), so z-scores
Example
that would have standard deviation sp in the absence of
sampling variability would have this standard deviation in-
The data shown in Table 2 are taken from the Harmonised
Protocol.1 creased by a factor of 1 + q . This increase is roughly 20% at
q = 0.5, 40% at q = 1 and 60% at q = 1.5.
At q = 0.09, corresponding to the allowable sampling
Visual appraisal The data are presented visually in Fig. 1, variance, the rejection probability is exactly 0.05 when r = 0
which shows no suspect results (such as discordant duplicated and approximately 0.05 for r > 0. As q increases, the rejection
results or outlying samples) and no features such as trends or probability rises rapidly when r = 0, i.e., when there is no
discontinuities. analytical error, but rather less rapidly for r > 0. When r =
0.25, so that san = 0.5sp, the probability of declaring lack of
sufficient homogeneity when q = 0.5, so that the z-scores are
Cochrans test The largest value of D2 is 0.36 and the sum being inflated by 20%, is only 0.55. This performance is
of D2 is 1.47, so the Cochran test statistic is 0.36/1.47 = 0.24. somewhat disappointing, but it could only be improved at the
This is less than the 5% critical value of 0.54, so there is no cost of one of the following: increasing the number of samples
evidence for analytical outliers and we proceed with the tested, increasing the number of replicate analyses per sample,
complete data set. increasing the risk of falsely rejecting a sufficiently homoge-
neous sample, or imposing an even stricter limit on the
Estimate of the analytical variance s2an = MSW = 1.47/24 permitted analytical variance. None of these seems desirable.
= 0.061. Table 2 Duplicated results for 12 distribution units of soya flour analysed
for copper (ppm), together with some intermediate stages of the calcula-
tion
Estimate of the between-sample variance The variance of
the sums S = a + b is 0.463, so MSB = 0.463/2 = 0.231, and Sample Result a Result b D = a 2 b S = a + b D2 = (a 2 b)2
s2sam = (0.231 - 0.061)/2 = 0.085.
1 10.5 10.4 0.1 20.9 0.01
2 9.6 9.5 0.1 19.1 0.01
Test for acceptable between-sample variance The target 3 10.4 9.9 0.5 20.3 0.25
4 9.5 9.9 -0.4 19.4 0.16
standard deviation is 1.14 ppm, so the allowable between- 5 10.0 9.7 0.3 19.7 0.09
sample variance is 6 9.6 10.1 -0.5 19.7 0.25
s2all = (0.3 3 1.14)2 = 0.116. 7 9.8 10.4 -0.6 20.2 0.36
8 9.8 10.2 -0.4 20.0 0.16
The critical value for the test is 9 10.8 10.7 0.1 21.5 0.01
1.79s2all + 0.86s2an = 1.79 3 0.116 + 0.86 3 0.061 = 0.26. 10 10.2 10.0 0.2 20.2 0.04
11 9.8 9.5 0.3 19.3 0.09
Since s2sam = 0.085 < 0.26, the test is passed and the material 12 10.2 10.0 0.2 20.2 0.04
is sufficiently homogeneous.

Table 1 Factors F1 and F2 for use in testing for sufficient homogeneitya

m 20 19 18 17 16 15 14 13 12 11 10 9 8 7

F1 1.59 1.60 1.62 1.64 1.67 1.69 1.72 1.75 1.79 1.83 1.88 1.94 2.01 2.10
F2 0.57 0.59 0.62 0.64 0.68 0.71 0.75 0.80 0.86 0.93 1.01 1.11 1.25 1.43
a m is the number of samples that have been measured in duplicate. The two constants are derived from standard statistical tables as

F1 = c2m 2 1,0.95/(m 2 1),


where c2m 2 1,0.95 is the value exceeded with probability 0.05 by a c-squared random variable with m 2 1 degrees of freedom, and
F2 = (Fm 2 1,m,0.95 2 1)/2,
where Fm 2 1,m,0.95 is the value exceeded with probability 0.05 by a random variable with an F-distribution with m 2 1 and m degrees of freedom.

1416 Analyst, 2001, 126, 14141417


(ii) Detailed instructions should be given to the laboratory
carrying out the test with regard to the randomisation and
labelling of the test materials and reporting of the data (see
Appendix).
(iii) Data sets should be inspected for visual pathologies
before use.
(iv) Analytical outliers should be deleted from the data set
before ANOVA is carried out. (Alternatively, the procedure
could be robustified against discrepant duplicate results.)
(v) The test for sufficient homogeneity specified in the
Harmonised Protocol should be replaced by the modified
method described above.

Fig. 1 Example data for the homogeneity testing procedure. Appendix: Example instructions for the analyst in
testing for sufficient homogeneity
(i) Select 10 (or more) of the packaged units strictly at random.
This must be done in a formal way, by assigning a sequential
number to the units, either explicitly (by labelling them) or
implicitly (e.g., by their position in a linear sequence). The
selection is made by use of random numbers from a table or
generated by a computer package (e.g., Excel). It is not
acceptable to select the units in any other way (e.g., by shuffling
them). A new random sequence should be generated for each
experiment.
(ii) Homogenise each selected sample in an appropriate
manner (e.g., in a blender) and from each weigh out two test
portions. Label the test portions as shown below.
Sample Labels

1 1.1 1.2
2 2.1 2.2
Fig. 2 Probability of rejecting the hypothesis of sufficient homogeneity as 3 3.1 3.2
a function of q = s2sam/s2p. The three curves, from left to right, are for values " " "
of r = s2an/s2p of 0, 0.125 and 0.25. 10 10.1 10.2
(iii) Sort the 20 test portions into a random order and carry out
all analytical operations on them in that order. Again, random
When used on real data from homogeneity tests, the new number tables or a computer package must be used to generate
procedure was found to be only slightly less likely than the a new random sequence. An example random sequence (not to
Harmonised Protocol procedure to reject materials when the be copied) is: 7.1, 3.1, 5.2, 5.1, 10.2, 1.1, 2.1, 9.2, 8.2, 1.2, 4.1,
analytical precision was satisfactory and no other data patholo- 2.2, 9.1, 10.1, 7.2, 3.2, 8.1, 6.1, 4.2, 6.2.
gies were detected. The respective failure rates were 0/114 and (iv) The analysis should be conducted if at all possible under
3/114. All of these materials were thought a priori to be repeatability conditions (i.e., in one run) or, if that is impossible,
sufficiently homogeneous. However, when the analytical data in successive runs with as little change as possible, using a
were defective in some way (and this is sometimes unavoida- method that has a repeatability standard deviation of less than
ble), the new procedure was much less likely than the 0.5sp.
Harmonised Protocol to reject materials, the respective failure (v) Return the 20 analytical results, including the labels, in
rates being 2/139 and 22/139. the run order used.

Acknowledgement
Recommendations for tests for sufficient
This work was completed with financial support from the Food
homogeneity Standards Agency.
(i) The precision of the analytical method used in the test should
satisfy
References
san/sp < 0.5
1 M. Thompson and R. Wood, Pure Appl. Chem., 1993, 65, 2123.
if at all possible. 2 J. S. Williams, Biometrika, 1962, 49, 278.

Analyst, 2001, 126, 14141417 1417

You might also like