Professional Documents
Culture Documents
Figure 1
will become the smallest value in the subsequent group. A less than cumulative
frequency column can be constructed where the values in this column indicate the
position up to which a specific value or group of values lie. The values in the column are
determined by finding a running total of the frequency column after ensuring that the
values in the set of data have been organised in ascending order.
Example 1
A nurse reviewed 50 patient records for the day and then constructed the 2 3
contingency table seen below concerning the symptoms that the patients experienced and
the diagnosed disease.
Diagnosed Disease
Symptom
Influenza
Measles
Chicken Pox
24
Excessive Itching
Solution 1
a.
b.
c.
d.
Example 2
A nurse recorded the diastolic blood pressure readings (mmHg) for 20 patients. The raw
data is shown below:
65
70
72
72
77
69
72
80
75
70
69
78
73
70
75
65
78
80
74
72
a.
b.
Solution 2
a.
From the frequency distribution below it is seen that the value of 75 mmHg stops
at the 15th position. Thus there are 15 patients that have a diastolic blood pressure
of 75 mmHg or lower.
Tally
||
||
|||
||||
|
|
||
|
||
||
Frequency
2
2
3
4
1
1
2
1
2
2
Cumulative Frequency
2
+
4
7
11
12
13
15
16
18
20
b.
The upper boundary for a group can be determined by finding the midway value
between the upper limit of the group and the lower limit of the following group.
The upper boundary of that group will become the lower boundary for the
following group.
From the frequency distribution below it is seen that the upper boundary of 71.5
mmHg occurs at the 7th position. Thus, an estimated 7 patients have a diastolic
blood pressure below 71.5 mmHg. Consequently, it can be estimated that there are
20 7 = 13 patients that have a diastolic blood pressure of at least 71.5 mmHg.
Diastolic Blood
Pressure
(mmHg)
64 67
68 71
72 75
76 79
80 83
Boundaries
63.5
67.5
71.5
75.5
79.5
x
x
x
x
x
67.5
71.5
75.5
79.5
83.5
Tally
Frequency
||
||||
|||| |||
|||
||
2
5
8
3
2
Cumulative
Frequency
2
+ 7
15
18
20
Grouped data
For a grouped set of data
fx
fx
or X
The Median
Ungrouped data
If the number of values in the set of data is odd then the median is determined by finding
the middle value after first arranging the values in numerical order (whether ascending or
descending). If the number of values in the set of data is even then the median is
determined by finding the arithmetic mean of the two middle values after first arranging
the values in numerical order. (Recall: The middle position can be determined using a
cumulative frequency column).
Grouped data
The formula used to determine the median for grouped data is:
f
Median
cf m
fm
c.w.
where
L is the lower boundary of the median class
f
The Mode
Ungrouped data
For ungrouped data, the mode is the value that occurs most frequently in the set of data. It
is possible to have more than one modal value.
Grouped data
The formula used to determine the mode for grouped data is:
Mode
L
1
c.w..
2
where
L is the lower boundary of the modal class
1
fm
fa
The difference between the frequencies of the modal class and the class ABOVE
fm
fb
The difference between the frequencies of the modal class and the class BELOW
cw
The difference between the upper and lower boundaries of the modal class
Measures of Dispersion
In order to concisely summarise a set of data, in addition to measures of central tendency,
measures of dispersion are required. A measure of dispersion is a measure that can indicate the
variability within a set of data or how the data in a set is spread or dispersed. There are
five measures of interest:
The range
The range is only applicable with ungrouped data. It is determined by finding the difference
between the largest and the smallest values in the set of data. As such, a weakness of the
range is that it only involves two values and does not provide information about how the values
in between vary.
The interquartile range
The interquartile is determined by finding the difference between the upper quartile and the
lower quartile for the set of data. The lower quartile (Q1) is defined as that value below which
the first 25% of the data lies whilst the upper quartile (Q3) is defined as that value below which
the first 75% of the data lies. It is from this premise that it should be understood that the median
is actually Q2 that value below which the first 50% of the data lies. In order to determine the
quartiles for ungrouped data, the values must first be arranged in numerical order and then the
1
1
value that lies in the (n 1)th position represents Q1, the value that lies in the (n 1)th position
4
2
3
represents Q2 and the value that lies in the (n 1)th position represents Q3.
4
.
The variance & the standard deviation
The variance is defined as the second moment about the mean of a distribution. Due to the
variance being a squared measure, of more importance is the standard deviation. The standard
deviation is determined by finding the square root of the variance. As such, the units for the
standard deviation would be the same as the units used for the data. By definition, the standard
deviation is that measure that gives an indication of how data is clustered (spread) around the
arithmetic mean X . Since all the values in the data set are used to determine the standard
deviation, it is the most commonly used measure of variability.
For a population, the formulae used to determine the variance and the standard deviation are:
)2
f (x
)2
f (x
and
respectively.
N
N
For a sample, the formulae used to determine the variance and the standard deviation are:
f ( x x) 2
f ( x x) 2
and s
respectively. Once again, for the purposes of this
n 1
n 1
course, the problems concerning variance and standard deviation will concern a sample. When
considering grouped data, the values for x are the midpoints of the classes.
s
11
a.
Determine:
i.
The mean
ii. The median
iii. The mode
iv.
The range
v.
The interquartile range
vi.
The variance
vii.
The standard deviation
viii. The coefficient of variation
b.
If another sample of infants from the community resulted in the mean age being 9.2
months with a standard deviation of 2.8 months, comment on the difference in the
variability for the two sets of data.
Solution 1
a.
i.
Mean age =
x 7 8 8 9 11 8 9 8 9 9
x
n
10
ii.
86
10
8.6 months
Median age:
When arranged in ascending order the set of data is:
Median age =
8 9
2
17
2
11
8.5 months
iii.
iv.
Range = 11 7 = 4 months
v.
Interquartile range:
When arranged in ascending order the set of data is:
7
2.75th position
Q1 is in the
1
(10 1)
4
1
11
4
2.75th position.
3
(10 1)
4
3
11
4
Thus Q1= 8
Q3 is in the
Thus Q3= 9
Interquartile range = 9 8 = 1 month
11
8.25th position
x x
( x x) 2
1.6
0.6
0.4
2.4
2.56
0.36
0.16
5.76
f ( x x) 2
7
8
9
11
vi.
b.
Variance = s
1
4
4
1
f 10
f ( x x) 2
n 1
vii.
Standard deviation = s
viii.
Coefficient of variation =
Standard deviation
CV
Arithmetic mean
2.56
1.44
0.64
5.76
f ( x x)2 10.4
10.4 10.4
1.2 months 2
10 1
9
f ( x x) 2
n 1
100 %
10.4
9
1.1
100 % 12.8%
8.6
2.8
100 % 30.4%
9.2
Since the coefficient of variation for the second sample is much larger than that for
the first sample then it can be concluded that the variability in the second sample is
greater. That is, the values in the second set of data show greater spread.
Example 2
Given the set of data below concerning the age (years) for the onset of leukaemia,
determine:
Age of onset
f
a. The mean age of onset
15
5
b. The median age of onset
6 10
9
c. The modal age of onset
11 15
8
d. The standard deviation in the ages of onset
16 20
3
f 25
Solution 2
For each section, the additional columns required in the frequency table will be
indicated but note that one table showing all columns can be provided.
a.
The additional columns required to determine the mean are the midpoint x column
and the fx column:
Age of
onset
15
1 5
2
6 10
6 10
2
11 15
11 15
13
2
104
16 20
16 20
18
2
54
Mean age = x
fx
f
25
245
25
fx
15
72
fx
9.8 years
245
b. The additional columns required to determine the median are the boundaries column
and the cumulative frequency column:
0.5 5.5
5.5 10.5
10.5 15.5
15.5 20.5
Cumulative frequency
5
9
8
3
5
5 + 9 = 14
14 + 8 = 22
22 + 3 = 25
f
f
2
fm
Median age = L
Median age = 5.5 4.2
cf m
25
25
cw 5.5
2
9
5 5.5
7.5
5
9
9.7 years
c. The additional column required to determine the mode is the boundaries column:
Age of onset Boundaries
15
6 10
11 15
16 20
0.5 5.5
5.5 10.5
10.5 15.5
15.5 20.5
5
9
8
3
Modal age = L
1
cw 5.5
2
4
4 1
25
9 5
9 5
9 8
d. The additional columns required to determine the standard deviation are the
2
5
9
8
3
3
8
13
18
fx
15
72
104
54
fx 245
25
Standard deviation = s
x x
( x x) 2
f ( x x) 2
-6.8
-1.8
3.2
8.2
46.24
3.24
10.24
67.24
231.2
29.16
81.92
201.72
f ( x x) 2
n 1
f x x
544
25 1
544
24
22.7
544
4.8 years
Negatively Skewed
Distribution
Positively Skewed
Distribution
Figure 1
In order to measure skewness, Pearsons coefficient of skewness can be determined using the
Mean - Mode
3(Mean Median)
formula: sk
. This relationship is valid since the
Standard deviation Standard deviation
median generally lies between the mode and the mean upholding the relationship
Mean Mode
3 Mean Median . In addition, the value for Pearsons coefficient of
skewness must uphold the inequality 3 sk
sk
3 and if
0; then the distribution is negatively skewed
0; then the distribution is symmetrical
.
0; then the distribution is positively skewed
Graphical Illustrations
A graphical illustration provides image that can be used as a representation for a set of data.
There are various graphical illustrations that can be employed but it is crucial that the selected
illustration is appropriate for the type of data.
Qualitative data
For qualitative (categorical) data, use of the following graphical illustrations is
appropriate:
1. Bar charts
2. Pie charts
Quantitative discrete data
For quantitative discrete data, use of the following graphical illustrations is appropriate:
1. Bar charts
2. Pie charts
3. Line graphs
4. Dot charts
5. Stem-and-leaf plots
6. Box-and-whisker plots
Quantitative continuous data
For quantitative continuous data, use of the following graphical illustrations is
appropriate:
1. Histograms
2. Stem-and-leaf plots
3. Box-and-whisker plots
4. Cumulative frequency curves
Example 1
Use a pie chart to represent the data below concerning the age of patients who registered at a
clinic.
Age group
Frequency
0 10
23
11 18
19
19 25
26 35
36 50
Over 50
14
Solution 1
Since the data concerns qualitative data, a pie chart is suitable.
Pie chart showing distribution for ages
14
23
19
Age group
0 - 10 years
11 - 18 years
19 - 25 years
26 - 35 years
36 - 50 years
Over 50 years
Example 2
Represent the data below using a suitable chart.
Diagnosed Disease
Symptom
Influenza
Measles
Chicken Pox
24
Excessive Itching
Solution 2
Since the data concerns qualitative data, a clustered bar chart is suitable.
Clustered bar chart
25
Frequency
20
15
10
0
Fever above 101 F
Excessive itching
Symptom
Diagnosed Disease
Influenza
Measles
Chicken Pox
Example 3
A drug company held a clinical trial in order to determine the effect of a new drug on various
participants. The reaction times (minutes) for the drug to take effect for 27 participants are
recorded below:
20 34 11 5 28 22 43 37 50
49 10 40 47 13 20 25 8 29
41 21 16 12 38 15 23 36 42
Use a box-and-whisker plot (box plot) to illustrate the information.
Solution 3
A box plot provides a five-point summary for a set of data. The data set must first be arranged in
numerical order:
1
27 1
1
4
1
27
3. Median = 25 which is the measure in the
2
3
27 1
4. Q = 40 which is the measure in the
3
4
5. Maximum value = 50
2. Q = 15 which is the measure in the
1
28
4
1
3
28
4
7th position
1
28
4
14th position
21st position
The box plot should be constructed where the length of the plot (including the whiskers)
represents the range of the distribution and the length of the box represents the interquartile
range. The median divides the box into two sections.
Box-and-whisker plot showing reaction times
40
20