You are on page 1of 17

STATISTICS IN PRACTICE

Part II
The use of statistics in business process improvement

All rights reserved. Nothing from this publication may be copied, stored in an authorised data
file, or made public in any form or any manner whether electronic, mechanical, photocopying,
photography or any other means, without the prior written consent of the author.

Statistics in practice, Autumn 2010

23

2.3.2 Distribution measurements


In Statistics in Practice, Part I, we used three centre measures (mean, median and mode) to show
which values the data is grouped around. Often though, we cannot correctly interpret survey
results without information about the distribution of the observations as well as having the centre
measurements. For example, if the owner of two discotheques wants to know something about the
age of his customers and after a survey discovers that the mean age at both venues is 22, he
cannot conclude that both customer groups have identical age distributions. It is possible that at one
discotheque the customers are all around 22 whilst the other is more family oriented and both
children and parents go which can still make the mean age 22. To gain insight into the composition
of the ages, a distribution measurement is needed. Below we will discuss the distribution measures
of range, variance and percentile, quartile and decile scores.
Range
Range is the simplest distribution measure. It is the difference between the highest and lowest
observations. At the discotheque with the customers mainly around the age of 22, the range, that is
the difference in age between the oldest and the youngest customer, may only be a matter of four
years, whilst with the family-oriented one the age distribution is much greater. The youngest visitor
might be 10 years old, the oldest 60. The range is then 50 years, although the mean is equal.
The range is often not a very reliable piece of data. The exceptions (imagine a 102 year old visitor),
which are nearly always present, create a large distortion. Therefore many market researchers do
not calculate the range by the highest minus the lowest score but select, for example, the fifth or the
tenth observation above and below for their calculations. The precise spot to select is arbitrary.
People usually look at the data first and then discard the greatest exceptions. Formally, this is not
allowed but it is very justifiable.
Variance and standard deviation
The most common distribution measures are variance and standard deviation. These closely
related indicators show if the values in an investigation are similar or dissimilar to each other. If the
variance (or the standard deviation) is large, then many of the survey scores are far from the mean
value. Alternatively, the other way around, if the variance is small, the majority of the values are
close to the mean. In other words, a high variance (or standard deviation) indicates heterogeneous
scores while a low value shows homogeneous scores.
So how do we calculate and interpret variance and standard deviation? First of all, note that both of
them are measures for the distribution around the mean. So it makes sense to start off with looking
at the average deviation from the mean. An example:
In a group of five people, the respective ages are 17, 19, 20, 22, 22. This means that the mean age
in this group is (17 + 19 + 20 + 22 + 22) / 5 = 20.
How big are the absolute deviations from this mean? Well if we look at the five persons, these
absolute deviations are (20-17=) 3, (20-19=) 1, (20-20=) 0, (22-20=) 2, (22-20=) 2. So the average
deviation from the mean is (3 + 1 + 0 + 2 + 2) / 5 = 1,6.

24

However, one problem arises! If we ask a computer to do this calculation for us, the computer will
solve this problem as follows:
How big are the deviations from the mean? Well for these five persons, the deviations are -3, -1,
0, 2, 2. This implies that the average deviation equals (-3 + -1 + 0 + 2 + 2) / 5 = 0. So the computer
makes a mistake! He doesnt look at the absolute values of the deviations, but takes into account
that some deviations are negative, while others are positive. This implies that the standard deviation
will always be zero if it is calculated in this way (think about this!).
So statisticians have found a solution to this problem. This solution is to start with squaring all
deviations, because if you square a negative number, it becomes positive (and if you square a
2
2
positive number, it stays positive). For example: -4 = 16 and 4 = 16.
In our example, the squared deviations from the mean are 9, 1, 0, 4, 4. This means that the
average squared deviation from the mean equals (9 + 1 + 0 + 4 + 4) / 5 = 3,6. This is the variance.

The variance is the mean of the squared deviations from the mean.

Now, we calculated this number using squared numbers. If we now take the square root of the
variance, we get (in this case) 3,6 1,9. This is the standard deviation.

The standard deviation is the square root of the variance.

Note that the standard deviation is not equal to the average absolute deviation from the mean
(which was 1,6 in our example). So we cannot interpret the standard deviation in this way. This
mistake is often made, but note that we cannot say: The standard deviation is the average
deviation from the mean. The only correct explanation is that it is the square root of the average
squared deviation from the mean. In other words: the square root of the variance.

Statistics in practice, Autumn 2010

25

Formulae for calculating variance and standard deviation, with the meanings of the symbols used and several other
common symbols.
Relating to a POPULATION:
= population mean,
= population fraction,
= population variance,
= population standard deviation.
Relating to a SAMPLE:
n = sample size,

= sample mean,

p =

k
n

= sample fraction,

s = sample variance,
s = sample standard deviation.
n

2 =

( xi ) 2
i =1

= 2

(x

and

s2 =

and

s = s2

x) 2

i =1

n 1

As in general we take a sample out of the population, rather than observing the entire population,
we generally calculate the sample variance and sample standard deviation.
Summarising: you calculate the standard deviation as follows.
1
2
3
4
5

You calculate the mean of all the scores.


You subtract the mean from all the values.
You square the results.
You calculate the mean of all the squared results.
2
This result is called the variance, shown as s .
Take the square root of the variance, which gives the standard deviation,
depicted as s.

If we want to estimate the standard deviation of a population, we have to use the sample data. This
general calculation of the standard deviation is not completely valid for small samples (<30 records /
respondents). If the sample size is small, we need to use a correction. In the calculation of the
mean deviation, you do not divide by the sample size (n) but by the sample size minus one (n-1).
This makes the standard deviation somewhat bigger than would be the case without the correction.

26

If we now use the standard deviation to make estimates, there will be more uncertainty in the
estimate, which is reasonable if the estimates are based on a small sample. No further explanation
of this concept is required in this stage.

Statistics in practice, Autumn 2010

27

Percentile scores
The central and distribution measurements discussed so far say something about the results as an
entity. Often however, the researcher wants to know what an individual score means. For example,
if you have measured something about a person - for example test results - and they get a score of
say 86, what does this score mean? If you do not know how the other respondents scored, the
result tells you nothing because is 86 high or low? Percentile scores are a handy tool if you want to
say something about the meaning of individual scores. If you know the percentile score of an
observation, it indicates what percentage of the scores are lower than that score.

A percentile score shows the percentage of scores that are lower than a particular score.

To calculate percentiles, you first sort the scores from low to high. Next, the calculation is made of
what percentage of the scores is lower than the score in question for each score.
In fact, the results are divided into one hundred units and it is calculated where the score would be if
the progression should run from 1 to 100. An example: if the 20th percentile score of the test equals
75, it means that 20% of the people that took the test have scored lower than 75. If a score of 86 is
rd
in the 93 percentile, its quite a good score, because 93% of all people have scored lower.
You can use the following formula to determine the location of a percentile score:
Lp = (n+1) p/100
where Lp is the location of the Pth percentile, n is the number of observations and p is the percentile
score.
Suppose we have the following 10 observations of monthly amount of money spent on a mobile
phone subscription by students:
18, 12, 25, 20, 30, 45, 35, 80, 55, 30
th

th

The location of the 25 percentile is ((10+1) 25/100=) 2.75. So, the 25 percentile is equal to the
th
value of the 2.75 observation. Of course, first we need to order the observations from low to high
(12, 18, 20, 25, 30, 30, 35, 45, 55, 80). The 2.75th observation falls between the second and third
observation and is more close to the third than to the second. The 2.75th observation is threequarter of the distance between the second and third observation away from the second
observation. The distance between the second and third observation is (20-18=) 2. Three-quarters
of that distance is (0.75*2=) 1.5. So, the 2.75th observation is the second observation (18) plus
three-quarters of the distance between the second and third observation (1.5), which equals (18+1.
th
th
5=) 19.5. Note that the 50 percentile is the value of the (10+1) 50/100=) 5.5 observation (which is
the mean of the two middle observations), and hence equal to the median.
Quartiles and deciles
Percentile scores are calculated on the basis of dividing into one hundred units, but you will also
meet quartiles and deciles in practice and in the theory books. For quartiles the scores are divided
into four groups of 25% with the aim of working out if a score belongs to the lowest or highest 25%
or is in the middle. A score below the fourth quartile belongs to the quarter with the highest scores,
while a score below the first quartile belongs to the quarter with the lowest scores.

28

There are multiple methods to calculate the first and third quartile, which lead to different results.
Perhaps the easiest method is that the first quartile is equal to the median of the first half of the
data, the second quartile is equal to the median, whereas the third quartile is equal to the median of
th
the second half of the data. Alternatively, the first quartile is equal to the 25 percentile, the second
th
th
quartile equals the 50 percentile and the third quartile is equal to the 75 percentile.
If we take the same observations as we used in determining the percentile scores (12, 18, 20, 25,
30, 30, 35, 45, 55, 80), we determine the median and second quartile to be the mean of the 5th and
6th observation, which is ((30+30)/2=) 30. The first quartile is the median of the first half of the data,
which we get by dividing the sample in two halves. The first quartile is the median of the first half of
observations (12, 18, 20, 25, 30), which is 20. The third quartile is the median of the second half of
observations (30, 35, 45, 55, 80), which is 45. Furthermore, the first quartile is equal to the 25th
percentile
For deciles, the division is into ten groups, which speaks for itself. People will usually want to know
if a score belongs in the highest or lowest 10%. If the IQ is measured of all the students in your
class and you are in the tenth decile, your future at school looks rosy.
You can use the following formula to determine the location of a decile score:
Ld = (n+1) d/10
where Ld is the location of the Dth percentile, n is the number of observations and p is the decile
score.

Statistics in practice, Autumn 2010

29

2.6

Questions and assignments


Distribution measures

1. Calculate from the population data below:


a
the mathematical mean
b
the mode
c
the median
d
the range
e
the variance
f
the standard deviation
The scores are:
98
96
102
104
99
103
97
101
98
99
100
100
100
99
101
100
102
101
2. How can we interpret the standard deviation?

30

3. What is the average age of the students in your class? What is the variance? What is the
standard deviation?
Average age:
...
Variance:..
...
Standard deviation:.
...

4. Where would the standard deviation of the age distribution be bigger: in an elderly home or at
IKEA on Saturday afternoon? Explain your answer!
...
...
...
...

5. Calculate the variance of the following population data:


2

...
...
...
...
6. Determine the variance and the standard deviation of the following sample data:
9

15

11

31

23

13

15

17

21

...
...
...
...

Statistics in practice, Autumn 2010

31

7. A friend calculates a variance and reports that it is -25. How do you know that he has made a
serious calculation error?
...
...
...
...
8. Create a sample of five numbers with a mean of 5 and a standard deviation of 0.
...
...
...
...

Percentiles and quartiles


9. A market researcher gives a score of 45 to a respondent during a survey. The question is how
to place a value on this score? To make a judgement he calculates the percentile score of this
value against the scores of the other respondents. He determines that a score of 45 falls in the
55th percentile. What does this result mean and what is the conclusion of the market researcher?
...
...
...
...
10. Imagine the market researcher had calculated quartiles and the score of 45 fell in the first
quartile. What would be his conclusion then?
...
...
...
...

32

Correlative statistics

3
As we already know, sample investigations are made in order to be able to make predictions
about a population. For our statistical calculations, there is an important difference between
frequency and probability distributions.
A frequency distribution is the actual distribution of data from factually observed material such as
the results of a sample investigation. Other than sample data, we cannot calculate data about the
population from factual observations; we estimate them from the sample data so there is only a
chance that they are correct. If we put in a graph the probability of each event that is theoretically
possible, we have a probability distribution. Two examples of probability distributions are given
below:

Probability distribution
How is the probability of an event distributed over all
possible events?
Probability distribution of the score when throwing a dice:

1/6

1 2 3 4 5 6
2

Probability distribution of the sum of


two dices

10

11

12

Statistics in practice, Autumn 2010

33

3.1 Normal distribution


One probability distribution is used very often in statistics. This probability distribution is called the
normal distribution. It has the following characteristics:
-

The probability of something in the centre of all possibilities is high, the probability of something
that is far away from the centre is small. For example: if the IQ of Dutch males is normally
distributed and you randomly choose a man from the Dutch population, the chance that he has
an average IQ is large. The chance that he has a very low or a very high IQ is smaller.

The distribution is symmetrical. This means that the probability that something is X below
average is the same as the probability that something is X above average. For example: the
probability of selecting a Dutch male that has an IQ that is 20 points below average is exactly
the same as the probability of selecting a Dutch male that has an IQ that is 20 points above
average.

In books on statistics, you seldom see the graph


displayed as a bar chart. People normally draw a line
through the centre of the classes to create a smooth
graph. This is called a curved graph, which has
happened in the graphs alongside. The frequencies of
two groups can be seen. In both cases, they show a
normal distribution in which seven is the mean score. In
the example on the left, this number is nearly the only score that occurs, there is hardly any
distribution. The numbers have a strongly homogenous character. In the graph on the right, you see
other answers appearing (5, 6, 8 and 9) as well as seven. The distribution is wider and the answers
are therefore heterogeneous. Note that this also means that in the graph on the right, the variance
and standard deviation are bigger than in the graph on the left.
The statistics discussed in this reader are based on answer patterns that give a normal distribution.
This is not always the case in practice. A normal distribution can be recognised by:
mean, mode and median have the same value;
there is only one mode and;
the frequency graph has a symmetrical bell shape.
A good understanding of normal distribution is needed to be able to understand anything about
statistics. The meaning of a normal distribution (also known as a Gaussian Curve) in statistics is
based on an important theory, namely that if we take a large number of surveys of a population, the
results of the surveys will have the tendency to resemble each other. The average of the survey
means will (approximately) equal the mean of the population. This means that the chance of taking
a sample in which the mean is similar to the mean of the population is much greater than taking a
sample in which the mean deviates from that of the population.
An example: If we want to estimate the average length of a Dutch male person, we can take a
sample of 250 Dutch males, which could show an average length of 181 cm. If we would take
another sample of 250 Dutch males, this sample could show an average of 180,5 cm, while a third
and fourth sample could show an average of 180 and 182,5 cm respectively. The average of all

34

these means will be a good estimator of the average length in the population. Also, the chance of
finding this average (181 cm in this case) in a sample is bigger than finding 178. But the chance of
finding 178 is bigger than finding 165.
The question of how closely a sample result mirrors a population is expressed as probability.
Calculating the probability is therefore at the centre of statistics.

3.2 The normal distribution (2)


Normal distribution was introduced in the previous section where it was used for the sample
distribution. As previously said, the concept is that if there is a normal distribution in the sample, it
will also be the case for the probability distribution in the population.
The normal distribution in the sample is characterised by the similar values for the mean, mode
and median, the presence of only one mode and a Gaussian curve for the graph. In viewing the
normal distribution as a probability distribution, we modify the description somewhat. The criteria
become:
1 a completely symmetrical model with single peak;
2 a fixed relationship between the height and width of the figure;
3 a variance width that extends to infinity on both the left and right extremities.
The last criterion is necessary because we do not know the distribution in the population and all
values are possible in principle. If all values are possible, the variance is infinite at least in theory.
In graphs of a normal distribution used for a probability distribution, you will always see that the
extremes of the curve never reach the horizontal axis. Look at the graph below.
The most important number for a normal distribution is the standard deviation or the average
difference from the mean. How do we calculate that again? Imagine eight people take an
examination and their grades are 3, 4, 5, 6, 6, 7, 8 and 9. The mathematical mean is 6.0. We
subtract the mean from each grade and then square each result to remove the minus signs. Then
we divide the sum of these results by the number of observations (8). This gives us the variance
(2). To calculate the standard deviation (), we take the square root of the variance which gives
us 1.87 as the result.
Note: this is not a sample. Therefore, the denominator in the formula is 8 (n) and not 7 (n-1).
What does this 1.87 represent? It means that most of the scores lie between plus or minus 1.87
of the mean of 6.0, that is, between 4.13 and 7.87. In the example, only the people with scores of
three or four on the low side or the smart cookies with eight or nine are outside this range. But
how many is that when measured in percentage terms? What can we do with this information?
These questions are interesting especially for samples with a large number of units.
The construction of a normal distribution decrees by definition that 68% of the observations lie
between plus or minus one standard deviation. With twice the standard deviation, we include
95.4% of all the observations by definition. If we use three times the standard deviation, we cover
99.6% of the observations. Refer to the graph below.

Statistics in practice, Autumn 2010

35

3.2

Z scores

In view of the fixed relationship in a normal distribution and the central role played by the
mathematical mean and the standard deviation, there is the possibility to standardise the
distribution to use a fixed scale. The mean of the distribution we set at 0 (zero) and the standard
deviation at 1. If you look at the graph on the previous page, you will see that is what happened
there.
A z score for an observation is the distance it is from the mean measured in multiples of the
standard deviation.
On the left of the curve we have -1, -2 and -3 times the standard deviation and on the right 1, 2
and 3 times the standard deviation. We now talk about a standard normal distribution or a z
distribution.
It is possible to calculate where each individual sample score in the scale is positioned on the line
of standard deviations. Therefore, we can determine the chance that the value appears in the
population.

The formula for z scores is: z = ( - ) / .

Expressing the formula in words: to find the z score of an observation, we subtract the mean of
the sample from the score and divide the result by the standard deviation.
Notice that we use the population symbols and not the sample symbols but use the values from
the sample. Only the sample values are known and we assume the distribution of the sample
corresponds with that of the population.
The probability that a sample value is present in a population is 1.0. The probability that a value is
greater (or less) than the mean is 0.5, because the distribution is symmetrical. Keep this in mind!
If the z score of a value is known we can precisely determine the percentage of the population
that has a value lower than the specified value, higher than the specified value, between the
specified value and the mean, and so on. These percentages are equal to the percentages of the
surface area. The surface areas can be found in a table, you do not need to calculate them
yourself. You will find the table in the appendix.
This all sounds quite complicated, so lets look at an example!
Example 1
Imagine that a production manager takes a sample from his daily production of packets of rice.
He conducts a series of tests weighing the packets with the result they have an average weight of
250 grams. Of course it is important that as many packets of rice weigh around 250 grams as
possible. The standard deviation is determined to be 10 grams and the scores show a normal
distribution. How large is the expected size of the theoretical population (the entire production) in
the range between 240 and 250 grams?
In order to be able to answer this question, a conclusion we made earlier is important. Because it
is a normal distribution, we can assume that the population has a mean of 250 grams and a
standard deviation of 10 grams just like the sample.
We calculate the answer by using z scores. We only have to calculate the value for 240 grams.
The 250 gram score is the mean and the z score of the mean is always zero. Completing the zscore formula for 240 grams: z=( - ) / , leads us to a z score of (250-240)/10 = -1.00.

36

We now look in the z-score table (in the


appendix) for the surface area related to a
z score of 1.00. The minus sign can be
ignored because what happens to the left
or the right of the mean is identical plus
1 and minus 1 give the same result and
the surface area to the left or right of the
mean is the same.
From the table we read that a z score of
1.00 in the central column (the grey
shaded area under the arc to the left of
the mean) gives a value of 0.3413. From the chapter about probability calculations, we know the
maximum probability value is 1.0. A chance of 0.3413 gives is a probability score of 34.13%. It is
clear from glancing at the graph that it is 34.13% of the population. In other words: from all the
sample units in the population 34.13% probably weigh between 240 and 250 grams (refer to the
top graph in the diagram alongside).
The production manager is shocked at this result because it means that the goal of each package
containing exactly 250 grams is a long way off.
Example 2
As the next step, the production manager sets himself the question what percentage of the
population weighs between 240 and 270 grams?
The interval 240 to 270 grams does not
have the mean of 250 grams as a border.
Therefore, we will split the interval into two
parts each with the mean of 250 grams as a
border. The section 250-270 has a z score
of 2 for 270. According the table the
proportion (p) for this is 0.477. The 240-250
section has 240 as its border and we
already calculated in example 1 this has a z
score of 1.0 and a proportion of 0.341.
Adding both proportions together (0.477 + 0.341) give us a proportion of 0.818. On a statistical
basis we can make the statement that nearly 82% of the population weighs between 240 and 270
grams. The statement is an expectation; it is probable, we are not completely certain.

Statistics in practice, Autumn 2010

37

3.3 Assignments for Z scores


1 A market researcher measures ages in a population and finds the following values: 15, 16, 16,
17, 18 and 20. He calculates the standard deviation to be 1.789. This is the correct result.
false
2 The variance of the investigation in question 1 is 0.408.
true
false
3 A frequency distribution shows a mean of 10 and a standard deviation of 5 as core
measurements. The market researcher made a mistake with the calculations because when
entering the sample scores he forgot to place a 1 before all the figures. Instead of 151 he entered
51; 121 was 21, et cetera.
He corrected his error by adding 100 to his mean of 10. This gave a correct final result.
true
false
4 Whilst making his corrections (question 3), he forgot to recalculate the standard deviation. The
result (5) is still the correct answer.
true
false
5 In another investigation, the market researcher calculated the z scores for his observations. He
used the formula z=( - ) / . Is this correct?
true
false
6 To calculate a z score, we subtract the sample mean from the value for which we want the z
score and then divide the result by the standard deviation.
true
false
8 There is a population proportion of 43.43% between 1.5 and +1.5 times the standard
deviation.
true
false
9 Suppose that a sequence of weight measurements is taken for a sample and the mean is 50
grams. The standard deviation is determined to be 5 grams and the observations give a normal
distribution curve. The question is what is the expected size of the theoretical population weighing
between 40 and 50 grams. Would it be correct for our market researcher to calculate a result of
95.4%?
true (=calculated correctly)
false (=not calculated correctly)
10 Suppose that a sequence of weight measurements is taken for a sample and the mean is 50
grams. The standard deviation is determined to be 5 grams and the observations give a normal
distribution curve. The question is:
What is the expected size of the theoretical population weighing between 50 and 60 grams?
Would it be correct for our market researcher to calculate a result of 95.4%?
true (=calculated correctly)
false (=not calculated correctly)
11 A z score of 5 is not possible. Consequently, the table only covers values up to 4.
true
false

38

Appendix 1
Interpretation of this table: The shaded area for a z-value of 0,58 equals 0,2190

Statistics in practice, Autumn 2010

39

You might also like