Professional Documents
Culture Documents
Part II
The use of statistics in business process improvement
All rights reserved. Nothing from this publication may be copied, stored in an authorised data
file, or made public in any form or any manner whether electronic, mechanical, photocopying,
photography or any other means, without the prior written consent of the author.
23
24
However, one problem arises! If we ask a computer to do this calculation for us, the computer will
solve this problem as follows:
How big are the deviations from the mean? Well for these five persons, the deviations are -3, -1,
0, 2, 2. This implies that the average deviation equals (-3 + -1 + 0 + 2 + 2) / 5 = 0. So the computer
makes a mistake! He doesnt look at the absolute values of the deviations, but takes into account
that some deviations are negative, while others are positive. This implies that the standard deviation
will always be zero if it is calculated in this way (think about this!).
So statisticians have found a solution to this problem. This solution is to start with squaring all
deviations, because if you square a negative number, it becomes positive (and if you square a
2
2
positive number, it stays positive). For example: -4 = 16 and 4 = 16.
In our example, the squared deviations from the mean are 9, 1, 0, 4, 4. This means that the
average squared deviation from the mean equals (9 + 1 + 0 + 4 + 4) / 5 = 3,6. This is the variance.
The variance is the mean of the squared deviations from the mean.
Now, we calculated this number using squared numbers. If we now take the square root of the
variance, we get (in this case) 3,6 1,9. This is the standard deviation.
Note that the standard deviation is not equal to the average absolute deviation from the mean
(which was 1,6 in our example). So we cannot interpret the standard deviation in this way. This
mistake is often made, but note that we cannot say: The standard deviation is the average
deviation from the mean. The only correct explanation is that it is the square root of the average
squared deviation from the mean. In other words: the square root of the variance.
25
Formulae for calculating variance and standard deviation, with the meanings of the symbols used and several other
common symbols.
Relating to a POPULATION:
= population mean,
= population fraction,
= population variance,
= population standard deviation.
Relating to a SAMPLE:
n = sample size,
= sample mean,
p =
k
n
= sample fraction,
s = sample variance,
s = sample standard deviation.
n
2 =
( xi ) 2
i =1
= 2
(x
and
s2 =
and
s = s2
x) 2
i =1
n 1
As in general we take a sample out of the population, rather than observing the entire population,
we generally calculate the sample variance and sample standard deviation.
Summarising: you calculate the standard deviation as follows.
1
2
3
4
5
If we want to estimate the standard deviation of a population, we have to use the sample data. This
general calculation of the standard deviation is not completely valid for small samples (<30 records /
respondents). If the sample size is small, we need to use a correction. In the calculation of the
mean deviation, you do not divide by the sample size (n) but by the sample size minus one (n-1).
This makes the standard deviation somewhat bigger than would be the case without the correction.
26
If we now use the standard deviation to make estimates, there will be more uncertainty in the
estimate, which is reasonable if the estimates are based on a small sample. No further explanation
of this concept is required in this stage.
27
Percentile scores
The central and distribution measurements discussed so far say something about the results as an
entity. Often however, the researcher wants to know what an individual score means. For example,
if you have measured something about a person - for example test results - and they get a score of
say 86, what does this score mean? If you do not know how the other respondents scored, the
result tells you nothing because is 86 high or low? Percentile scores are a handy tool if you want to
say something about the meaning of individual scores. If you know the percentile score of an
observation, it indicates what percentage of the scores are lower than that score.
A percentile score shows the percentage of scores that are lower than a particular score.
To calculate percentiles, you first sort the scores from low to high. Next, the calculation is made of
what percentage of the scores is lower than the score in question for each score.
In fact, the results are divided into one hundred units and it is calculated where the score would be if
the progression should run from 1 to 100. An example: if the 20th percentile score of the test equals
75, it means that 20% of the people that took the test have scored lower than 75. If a score of 86 is
rd
in the 93 percentile, its quite a good score, because 93% of all people have scored lower.
You can use the following formula to determine the location of a percentile score:
Lp = (n+1) p/100
where Lp is the location of the Pth percentile, n is the number of observations and p is the percentile
score.
Suppose we have the following 10 observations of monthly amount of money spent on a mobile
phone subscription by students:
18, 12, 25, 20, 30, 45, 35, 80, 55, 30
th
th
The location of the 25 percentile is ((10+1) 25/100=) 2.75. So, the 25 percentile is equal to the
th
value of the 2.75 observation. Of course, first we need to order the observations from low to high
(12, 18, 20, 25, 30, 30, 35, 45, 55, 80). The 2.75th observation falls between the second and third
observation and is more close to the third than to the second. The 2.75th observation is threequarter of the distance between the second and third observation away from the second
observation. The distance between the second and third observation is (20-18=) 2. Three-quarters
of that distance is (0.75*2=) 1.5. So, the 2.75th observation is the second observation (18) plus
three-quarters of the distance between the second and third observation (1.5), which equals (18+1.
th
th
5=) 19.5. Note that the 50 percentile is the value of the (10+1) 50/100=) 5.5 observation (which is
the mean of the two middle observations), and hence equal to the median.
Quartiles and deciles
Percentile scores are calculated on the basis of dividing into one hundred units, but you will also
meet quartiles and deciles in practice and in the theory books. For quartiles the scores are divided
into four groups of 25% with the aim of working out if a score belongs to the lowest or highest 25%
or is in the middle. A score below the fourth quartile belongs to the quarter with the highest scores,
while a score below the first quartile belongs to the quarter with the lowest scores.
28
There are multiple methods to calculate the first and third quartile, which lead to different results.
Perhaps the easiest method is that the first quartile is equal to the median of the first half of the
data, the second quartile is equal to the median, whereas the third quartile is equal to the median of
th
the second half of the data. Alternatively, the first quartile is equal to the 25 percentile, the second
th
th
quartile equals the 50 percentile and the third quartile is equal to the 75 percentile.
If we take the same observations as we used in determining the percentile scores (12, 18, 20, 25,
30, 30, 35, 45, 55, 80), we determine the median and second quartile to be the mean of the 5th and
6th observation, which is ((30+30)/2=) 30. The first quartile is the median of the first half of the data,
which we get by dividing the sample in two halves. The first quartile is the median of the first half of
observations (12, 18, 20, 25, 30), which is 20. The third quartile is the median of the second half of
observations (30, 35, 45, 55, 80), which is 45. Furthermore, the first quartile is equal to the 25th
percentile
For deciles, the division is into ten groups, which speaks for itself. People will usually want to know
if a score belongs in the highest or lowest 10%. If the IQ is measured of all the students in your
class and you are in the tenth decile, your future at school looks rosy.
You can use the following formula to determine the location of a decile score:
Ld = (n+1) d/10
where Ld is the location of the Dth percentile, n is the number of observations and p is the decile
score.
29
2.6
30
3. What is the average age of the students in your class? What is the variance? What is the
standard deviation?
Average age:
...
Variance:..
...
Standard deviation:.
...
4. Where would the standard deviation of the age distribution be bigger: in an elderly home or at
IKEA on Saturday afternoon? Explain your answer!
...
...
...
...
...
...
...
...
6. Determine the variance and the standard deviation of the following sample data:
9
15
11
31
23
13
15
17
21
...
...
...
...
31
7. A friend calculates a variance and reports that it is -25. How do you know that he has made a
serious calculation error?
...
...
...
...
8. Create a sample of five numbers with a mean of 5 and a standard deviation of 0.
...
...
...
...
32
Correlative statistics
3
As we already know, sample investigations are made in order to be able to make predictions
about a population. For our statistical calculations, there is an important difference between
frequency and probability distributions.
A frequency distribution is the actual distribution of data from factually observed material such as
the results of a sample investigation. Other than sample data, we cannot calculate data about the
population from factual observations; we estimate them from the sample data so there is only a
chance that they are correct. If we put in a graph the probability of each event that is theoretically
possible, we have a probability distribution. Two examples of probability distributions are given
below:
Probability distribution
How is the probability of an event distributed over all
possible events?
Probability distribution of the score when throwing a dice:
1/6
1 2 3 4 5 6
2
10
11
12
33
The probability of something in the centre of all possibilities is high, the probability of something
that is far away from the centre is small. For example: if the IQ of Dutch males is normally
distributed and you randomly choose a man from the Dutch population, the chance that he has
an average IQ is large. The chance that he has a very low or a very high IQ is smaller.
The distribution is symmetrical. This means that the probability that something is X below
average is the same as the probability that something is X above average. For example: the
probability of selecting a Dutch male that has an IQ that is 20 points below average is exactly
the same as the probability of selecting a Dutch male that has an IQ that is 20 points above
average.
34
these means will be a good estimator of the average length in the population. Also, the chance of
finding this average (181 cm in this case) in a sample is bigger than finding 178. But the chance of
finding 178 is bigger than finding 165.
The question of how closely a sample result mirrors a population is expressed as probability.
Calculating the probability is therefore at the centre of statistics.
35
3.2
Z scores
In view of the fixed relationship in a normal distribution and the central role played by the
mathematical mean and the standard deviation, there is the possibility to standardise the
distribution to use a fixed scale. The mean of the distribution we set at 0 (zero) and the standard
deviation at 1. If you look at the graph on the previous page, you will see that is what happened
there.
A z score for an observation is the distance it is from the mean measured in multiples of the
standard deviation.
On the left of the curve we have -1, -2 and -3 times the standard deviation and on the right 1, 2
and 3 times the standard deviation. We now talk about a standard normal distribution or a z
distribution.
It is possible to calculate where each individual sample score in the scale is positioned on the line
of standard deviations. Therefore, we can determine the chance that the value appears in the
population.
Expressing the formula in words: to find the z score of an observation, we subtract the mean of
the sample from the score and divide the result by the standard deviation.
Notice that we use the population symbols and not the sample symbols but use the values from
the sample. Only the sample values are known and we assume the distribution of the sample
corresponds with that of the population.
The probability that a sample value is present in a population is 1.0. The probability that a value is
greater (or less) than the mean is 0.5, because the distribution is symmetrical. Keep this in mind!
If the z score of a value is known we can precisely determine the percentage of the population
that has a value lower than the specified value, higher than the specified value, between the
specified value and the mean, and so on. These percentages are equal to the percentages of the
surface area. The surface areas can be found in a table, you do not need to calculate them
yourself. You will find the table in the appendix.
This all sounds quite complicated, so lets look at an example!
Example 1
Imagine that a production manager takes a sample from his daily production of packets of rice.
He conducts a series of tests weighing the packets with the result they have an average weight of
250 grams. Of course it is important that as many packets of rice weigh around 250 grams as
possible. The standard deviation is determined to be 10 grams and the scores show a normal
distribution. How large is the expected size of the theoretical population (the entire production) in
the range between 240 and 250 grams?
In order to be able to answer this question, a conclusion we made earlier is important. Because it
is a normal distribution, we can assume that the population has a mean of 250 grams and a
standard deviation of 10 grams just like the sample.
We calculate the answer by using z scores. We only have to calculate the value for 240 grams.
The 250 gram score is the mean and the z score of the mean is always zero. Completing the zscore formula for 240 grams: z=( - ) / , leads us to a z score of (250-240)/10 = -1.00.
36
37
38
Appendix 1
Interpretation of this table: The shaded area for a z-value of 0,58 equals 0,2190
39