You are on page 1of 9

2

SOME BASIC STATISTICAL CONCEPTS

POPULATIONS

The true population of a particular ecosystem can be determined only by carrying out a
census of all living organisms within that ecosystem. This applies equally whether one is
concerned with numbers of people in a town, state or country or with numbers of microbes
in a batch of a food commodity or product. Whilst, in the former case, it is possible at least
theoretically to determine the human population in a non-destructive manner, the same
does not apply to estimates of microbial populations.
When a survey is carried out on people living for instance, in a single town or village,
it would not be unexpected that the number of residents differs between different houses;
nor that there are differences in ethnicity, age, sex, health and well-being, personal likes and
dislikes, etc. Similarly, there will be both quantitative and qualitative differences in popula-
tion statistics between different towns and villages, different parts of a country and different
countries.
A similar situation pertains when one looks at the microbial populations of a food. The
microbial association of foodstuffs differs according to diverse intrinsic and extrinsic fac-
tors, especially the acidity and water activity, and the extent of any processing effects. Thus
the primary microbial population of acid foods will generally consist of yeasts and moulds,
whereas the primary population of raw meat and other protein-rich foodstuffs will consist
largely of Gram negative non-fermentative bacteria, with smaller populations of other organ-
isms (Mossel, 1982). In enumerating microbes, it is essential first to define the population to
be counted. For instance, does one need to assess the total population, that is living and
dead organisms, or only the viable population; if the latter, is one concerned only with spe-
cific groups of organisms, for example aerobes, anaerobes, psychrotrophs and psychrophiles,
mesophiles or thermophiles? Even when such questions have been answered, it would still be
impossible to determine the true ecological population of a particular lot of food, since to
do so would require testing of all the food. Such a task would be both technically and eco-
nomically impossible.

Statistical Aspects of the Microbiological Examination of Foods


Copyright 2008 by Academic Press. All rights of reproduction in any form reserved. 3

CH002-N53039.indd 3 5/26/2008 4:15:05 PM


4 STATISTICAL ASPECTS OF THE MICROBIOLOGICAL EXAMINATION OF FOODS

LOTS AND SAMPLES

An individual lot or batch consists of a bulk quantity of food that has been processed
under essentially identical conditions on a single occasion. The food may be stored and
distributed in bulk or as pre-packaged units each containing one or more individual units
of product (e.g. a single meat pie or a pack of frozen peas). Assuming that the processing
has been carried out under uniform conditions, then, theoretically, the microbial population
of each unit should be typical of the population of the whole lot. In practice, this will not
always be the case. For instance, high levels of microbial contamination may be associated
only with specific parts of a lot due to some processing defect. In addition, estimates of
microbial populations will be affected by the choice of test regime that is used.
It is not feasible to determine the levels and types of aerobic and anaerobic organisms, or of
acidophilic and non-acidophilic organisms, or other distinct classes of microorganism using a
single test. Thus when a microbiological examination is carried out, the types of microorgan-
isms that are detected will be defined in part by the test protocol. All such constraints therefore
provide a biased estimate of the microbial population of the lot. Hence, sampling of either
bulk or pre-packaged units of product merely provides a sample of the types and numbers of
microorganisms that make up the population of the lot and those population samples will
themselves be further sampled by our choice of examination protocol. In order to ensure that
a series of samples drawn from a lot properly reflect the diversity of types and numbers of
organisms associated with the product it is essential that the primary samples should be drawn
in a random manner, either from a bulk or as individual packaged units of the foodstuff.
Analytical chemists frequently draw large primary samples that are blended and resam-
pled before taking one or more analytical samples the purpose is to minimize the between-
sample variation in order to determine an average analytical estimate for a particular
analyte. It is not uncommon for several kilograms of material to be taken as a number of
discrete samples that are then combined. Indeed, for some purposes, such multiple sampling
procedures are commonplace. The sampling of foods for microbiological examination can-
not generally be done in this way because of the risks of cross contamination during the
mixing of primary samples.
A population sample (i.e. a unit of product) may itself be subdivided for analytical pur-
poses and it is necessary, therefore, to consider the implications of determining microbial
populations in terms of the number, size and nature of the samples taken. In a few instances
it is possible for the analytical sample to be truly representative of the lot sampled. Liquids,
such as milk, can be sufficiently well mixed that the number of organisms in the analytical
sample is representative of the milk in a bulk storage tank. However, because of problems of
mixing, samples withdrawn from a grain silo, or even from individual sacks of grain, may not
necessarily be truly representative. In such circumstances, deliberate stratification (qv) may
be the only practical way of taking samples. Similar situations obtain when one considers
complex raw material (e.g. animal carcases), or composite food products (e.g. ready-to-cook
frozen meals containing slices of cooked meat, Yorkshire pudding, peas, potato and gravy). It
is necessary to consider also the actual sampling protocol to be used: for instance, in sampling
from a meat or poultry carcase, is the sample to be taken by swabbing, rinsing or excision of

CH002-N53039.indd 4 5/26/2008 4:15:05 PM


SOME BASIC STATISTICAL CONCEPTS 5

skin? Where on the carcase should the sample be taken? For instance, one area may be more
likely to carry high numbers and types of organism than other areas. Hence, standardisation
of sampling protocols is essential. In situations where a composite food consists of discrete
components, a sampling protocol needs to be used that reflects the purpose of the test is a
composite analytical sample required (i.e. one made up from the various ingredients in appro-
priate proportions) or should each ingredient be tested separately. These matters are consid-
ered in more detail in Chapter 5.

AVERAGE SAMPLE POPULATIONS

If a single sample is analysed, the result provides a method-dependent single point estimate
of the population numbers in that sample. Replicate tests on a single sample provide an
improved estimate of population numbers, based on the average of the results, together
with a measure of variability of the estimate for that sample. Similarly, if replicate samples
are tested, the average result provides a better estimate of the number of organisms in the
population based on the inter-sample average and an estimate of the variability between
samples. Thus, we can have greater confidence that the average sample population will
reflect more closely the population in the lot. The standard error of the mean (SEM) pro-
vides an estimate of the extent to which that mean (average) value is reliable. If a sufficient
number of replicate samples is tested then we can derive a frequency distribution for the
counts, such as that shown in Fig. 2.1 (data from Blood, 1974). Note that the distribution
curve has a long left hand tail and that the curve is not symmetrical, probably because the
data were compiled from results obtained in two different production plants. The statistical
aspects of frequency distributions are discussed in Chapter 3.
Adding the individual values and dividing by the number of replicate tests provides a sim-
ple arithmetic mean of the values (x  (  x1  x2  x3  ....  xn ) / n  in1 xi / n where xi
is the value of ith test and n is the number of tests done). However, it is possible to derive

35
30.5
30
25 23
% frequency

21
20
16
15
10
5 4.5
2 3
0 0
0
5

5
3.

4.

4.

5.

5.

6.

6.

7.

7.

Colony count (log cfu/g)

FIGURE 2.1 Frequency distribution of colony count data determined at 30C on beef sausages manufactured in
two factories (modified from Blood, 1974) (reproduced by permission of Leatherhead Food International).

CH002-N53039.indd 5 5/26/2008 4:15:05 PM


6 STATISTICAL ASPECTS OF THE MICROBIOLOGICAL EXAMINATION OF FOODS

other forms of average value. For instance, multiplying the individual counts on n samples
and then taking the nth root of the product provides the geometric mean value (x) :

x  n (x1  x2  x3    xn )

It is simpler to determine the approximate geometric mean by taking logarithms of the orig-
inal values (y  log10 x), adding the log-transformed values and dividing the sum by n to
obtain the mean log value ( y ), which equals log x. This value is then back-transformed by
taking the antilog to obtain an estimate of the geometric mean value:

n n
yi log xi
i 1 i 1
y   log x
n n

The geometric mean is appropriate for data that conform to a log-normal distribution and
for titres obtained from n-fold dilution series. It is important to understand the difference
between the geometric and the arithmetic mean values since both are used in handling
microbiological data. In terms of microbial colony counts, the log mean count is the log10
of the simple arithmetic mean; by contrast, the mean log-count is the arithmetic average of
the log10-transformed counts that, on back-transformation gives the geometric mean count.
The methods are illustrated in Example 2.1.

STATISTICS AND PARAMETERS

A population is described by its parameters: the mean () and the variance (2). But we
cannot know the values of these parameters except for a finite population (e.g. a set of
pipettes). However, we can obtain estimates of these parameters from the statistics that
describe the sample population in terms of its analytical mean value ( x ) and its variance
(s2). We can also provide a measure of the likelihood that the same mean result would be
attained if analyses were repeated on a further set of samples from the same lot. Such esti-
mated values are statistics that can be used as estimates of the true population parameters.

VARIANCE AND ERROR

Results from replicate analyses of a single sample, and analyses of replicate samples, will
always show some variation that reflects the distribution of microbes in the samples tested,
inadequacies of the sampling technique and technical inaccuracies of the method and the
analyst. The variation can be expressed in several ways.
The statistical range is the simplest way to describe the dispersion of values by deriving the
differences between the lowest and the highest estimates, for example, in Example 2.1, the
colony count range is 610 (i.e. 19701360). The statistical range is often used in Statistical
Process Control (Chapter 12) but since it depends solely on the values for the extreme

CH002-N53039.indd 6 5/26/2008 4:15:06 PM


SOME BASIC STATISTICAL CONCEPTS 7

counts, its usefulness is severely limited since it takes no account of the distribution of values
between the two extremes.
The population variance is derived from the mean of the squares of the deviations, viz.
 2  (x  )2 / n , where x is an individual result, the population mean value, n the number
in the population and  indicates sum of. Each individual result (x) differs from the pop-
ulation mean  by a value (x  ), which is referred to statistically as the deviation. But as
the value of is unknown, the sample mean ( x ) is used as an estimate of the population
mean. The sample variance (s2) provides an estimate of the population variance (2) and is
determined as a weighted mean of the squares of the deviations, weighting being introduced
through the application of the concept of degrees of freedom, which assumes that of n observa-
tions, only (n  1) are available since one observation has been used already in determining the
mean value. The unbiased estimate (s2) of the population variance (2) is thus derived from:

n n 2
n xi2  xi
i 1 i1
s2 
n(n  1)

n
The alternative form of this equation s2  (x  x)2 /(n  1) should not normally be
i 1
used in practical calculation of the sample variance since it is based on the square of the
deviations from the mean value. Such deviations are usually only an approximation for
the absolute infinite decimal value; and since the sum of the deviations from the mean value
are squared, any discrepancies are additive and the derived variance may be inaccurate.
The standard deviation(s) of the sample mean is the square root of the variance
(s  s2 ) . The coefficient of variation (CV), often referred to as the relative stand-
ard deviation (RSD), is the standard deviation expressed as a percentage of the mean:
%CV  %RSD  (s/x)  100 .
The term standard error is often used conventionally to mean the standard devia-
tion (described above) and is a statistical measure of the deviation that estimates would
be expected to show in testing repeat samples from the same population. In other words, it
shows how much variation might be expected to occur merely by chance in the character-
istics of samples drawn equally randomly from a single population. However, the SEM is a
measure of the deviation in the mean value which would be expected if repeated analyses
were undertaken on the same lot of product. The SEM is estimated from the square root
of the variance divided by the number of observations used, that is, SEM  s2 / n  s/ n .

THE CENTRAL LIMIT THEOREM

We should pause at this point to consider an important statistical theorem, which underlies
many statistical procedures. The central limit theorem is a statement about the sampling
distribution of the mean values from a defined population. It describes the characteristics of

CH002-N53039.indd 7 5/26/2008 4:15:06 PM


8 STATISTICAL ASPECTS OF THE MICROBIOLOGICAL EXAMINATION OF FOODS

the distribution of mean values that would be obtained from tests on an infinite number of
independent random samples drawn from that population. The theorem states, for a distri-
bution with a population mean  and a variance 2, the distribution of the average tends to
be Normal, even when the distribution from which the average is computed is non-Normal.
The limiting normal distribution has the same mean as the parent distribution and its vari-
ance is equal to the variance of the parent divided by the sample size (2/N).
Individual results from a finite number of independent, randomly drawn samples from the
same population are distributed around the average (mean) value so that the sum of the val-
ues greater than the average will equal the sum of the values lower than the average value. If
sufficient independent random samples are tested then we can derive a statistical distribution
that describes the occurrence of the population (Chapter 3). Now, no matter what form the
actual distribution takes, the distribution of the average (mean) result in repeated tests always
approaches a Normal distribution when sufficient trials are undertaken. In this situation, the
number of trials relates not to the number of samples per se but to the number of replicate trials.

EXAMPLE 2.1 DERIVATION OF SOME BASIC STATISTICS THAT


DESCRIBE A DATA SET

Assume that we wish to determine the statistics that describes a series of replicate col-
ony counts on n samples, represented by x1, x2, x3,, xn, for which the actual values are
1540, 1360, 1620, 1970, 1420 as colony forming units (cfu)/g
The range of colony counts provides a measure of the extent of overall deviation
between the largest and the smallest data values and is determined by subtracting the
lowest count from the highest count; for the example data the range is 19701360  610.
The median colony count is the middle value (in an odd-numbered set of values) or the
average of the two middle values in an even-numbered set of values; for this sequence of
counts the median value of 1360, 1420, 1540, 1620, 1970  1540.

The arithmetic average (mean) colony count is the sum of the individual values divided
by the number of values, that is
n
x  (x1  x2  x3    xn )/ n  xi/n ,
i1

where x  mean value and  means sum of ; for our data the mean count 
 x/n  (1540  1360  1620  1970  1420)/5  1582.
The geometric mean colony count is the nth root of the product obtained by multiplying
together each value of x. Hence, the geometric mean count  n ( x1  x2  x3    xn ) .
Alternately, we can transform the x values by deriving their logarithms so that
y  log10 x: then geometric mean is the antilog of the sum of y divided by n

n n
= antilog log xi / n = antilog yi /n .
i1
i1

CH002-N53039.indd 8 5/26/2008 4:15:06 PM


SOME BASIC STATISTICAL CONCEPTS 9

For our data the geometric mean colony count  antilog(log10 x/n)

 antilog[log 1540  log 1360  log 1620  log 1970  log 1420)/ 5]

 ant ilog[(3 . 1875  3 . 1335  3 . 2095  3 . 2945  3 . 1523)/ 5]

 antilog(15 . 9 7 73 / 5)

 antilog(3 . 19456)  1568 .

The sample variance (s2) is the sum of the squares of the differences between the values
for x and the mean value (x ) , divided by the degree of freedom of the data set (i.e. n  1).
(One value of n was used in determining the mean value, hence there are only n  1 degrees
of freedom (df)). Thus

s2 
( n x 2  ( x)
2
)  ( x 2 2
 ( x ) /n )
n(n  1) (n  1)

Hence for our data,

[(15402  13602  16202  19702  14202 )]  [(1540  1360  1620  1970  1420) 2 /5 ]
s2 
(5  1)
12, 742, 900  12, 513,620 229,280
   57,320
4 4
An alternative form of the equation is:
n
( xi x)
2
s2 = /(n 1)
i=1

Hence, with mean (x )  1582, the variance is given by:

(1540  1582)2  (1360  1582)2  (1620  1582)2  (1970  1582)2  (14 2 0  1582)2
s2 
(5  1)
(42)2  (222)2  382  3882  (162)2

4
1 7 64  49, 284  1444  150, 544  26, 244

4
229, 280
  57, 320
4

Note that in this example, where the mean value was finite, both methods gave the same
result for the variance. However, where the mean value is not finite, rounding errors can
cause serious inaccuracies in the variance calculation.

CH002-N53039.indd 9 5/26/2008 4:15:06 PM


10 STATISTICAL ASPECTS OF THE MICROBIOLOGICAL EXAMINATION OF FOODS

The standard deviation (s) around the mean is the square root of the variance and is
given by:

s2  57, 320  239.4

Thence the Relative Standard Deviation (RSD), which is the ratio between the standard
deviation and the mean value, is given by 100  239.4/1582  15.1%
The variance of the log10-transformed values is derived similarly using the trans-
formed values, that is, y  log10 x, then:

s2  [5(3 . 18752  3 . 13352  3 . 20952  3 . 29452  3 . 15232 )


 (3 . 1875  3 . 1 3 35  3 . 2095  3 . 2945  3 . 1523)2 ]/ (5  4)
 [(5  51 . 07077)  255 . 2741 1 53]/20
 (255 . 35385  255 . 274115)/ 20
 0 . 0039869  0.0040

Using the alternative method with a mean log-count of 3.1946, gives the variance of y as:

s2  [(3 . 1875  3 . 1946)2  (3 . 1335  3 . 1946)2


 (3 . 2095  3 . 1946)2  (3 .22 945  3 . 1946)2
 (3 . 1523  3 . 1946) ]/4
2

 0 . 01577493 /4  0 . 0039438  0 .0039

Note the small difference in the variance estimates determined by the two alternative
methods.
The SD of the mean log-count is 0.00399  0.0631665 0.0632 and the RSD of the
mean log-count is (0.0632  100)/3.19456  1.97%.
The reverse

transformation of the mean log-count is done by taking the antilog of y :
x  10 y  103.1946  1565. But this is not an accurate estimate of the geometric mean x .
The relationship between the log mean count (log x ) and the mean log-count ( y ) is given
by the formula:

y  ln(10)  s2 y  2 . 3025  s2
log x  
10 10

where s2  variance of the log-count.


Hence, for these data where y  3 . 1946 and s2  0.0040, the log mean colony count
is given by log x  3 . 1946  2 . 3025  0 . 0040 /10  3 . 1955

Hence x  103.1955  1568 . 6  1569 .

Note that standard deviations of the mean log-count should not be directly back-
transformed since the value obtained (100.0635  1.1574) would be misleading. Rather,

CH002-N53039.indd 10 5/26/2008 4:15:07 PM


SOME BASIC STATISTICAL CONCEPTS 11

the approximate upper and lower 95% confidence intervals around the geometric mean
would be determined as 10(3.194620.0635) and 10(3.194620.0635), that is, 103.33216  2097
and 103.0676  1168. Hence for these data the geometric mean is 1569 and the 95% upper
and lower confidence limits are 2097 and 1168, respectively. A comparison with the arith-
metic mean and its 95% confidence limits is shown below:

95% Confidence limits

Method Mean Median Lower Upper

Arithmetic 1582 1540 1104 2060


Geometric 1569 1168 2097

For these data the difference between the arithmetic and geometric mean values
is small since the individual counts are reasonably evenly distributed about the mean
value and are not heavily skewed. Note that the median value is smaller than both mean
values because of the small population of results that were examined. The standard
deviation of the arithmetic mean value reflects the level of dispersion of values around
the mean value. Note also that the upper and lower 95% confidence limits are distrib-
uted evenly about the arithmetic mean value (1582  478) but are distributed unevenly
around the geometric mean value (1565  397 and 1565  532).

References

Blood RM (1974) The Clearing House Scheme. Tech Circular No. 558. Leatherhead Food Research
Association.
Mossel, DAA (1982) Microbiology of Foods: The Ecological Essentials of Assurance and Assessment
of Safety and Quality, 3rd edition. University of Utrecht, NL.

Further Reading

Glantz, SA (1981) Primer of Biostatistics, 4th edition. McGraw-Hill, New York, USA.
Hawkins, DM (2005) Biomeasurement Understanding, Analysing and Communicating Data in the
Biosciences. Oxford University Press, Oxford, UK.
Hoffman HS (2003) Statistics Explained: Internet Glossary of Statistical Terms. http://www.animat-
edsoftware.com/statglos/statglos.htm

CH002-N53039.indd 11 5/26/2008 4:15:07 PM

You might also like