You are on page 1of 18

Descriptive Statistics

In any course concerning statistics, biostatistics or otherwise, performing exploratory data


analysis initially involves being able to effectively represent the relevant data as well as to
concisely describe said data. These actions representing and summarising data are the key
concepts under the branch of statistics known as descriptive statistics. Descriptive statistics can
therefore be defined as that branch of statistics that is concerned with describing and
summarising the main characteristics of a sample of data that has been collected. Some measures
that are commonly used to describe a data set are measures of central tendency, measures of
dispersion and measures of deviation from normality. Some tools that are commonly used when
presenting data include contingency tables, frequency distributions and graphs.
Types of data
Data can be classified as primary data or secondary data. Primary data is defined as data that the
researcher uses after gathering it for him/herself. Secondary data is defined as data that the
researcher uses after someone else has gathered it. Since data can take on different values, data
can also be classified as concerning qualitative variables or quantitative variables. A variable is
defined as a characteristic or a quantity whose value may vary from one observation to another.
A qualitative variable is a characteristic that is described using varying words or phrases. For
example, the race of a patient is a qualitative variable. This variable may also be referred to as a
categorical variable. A quantitative variable is a quantity that is inherently a numerical value.
Thus, this variable may also be referred to as a numerical variable. There are two types of
quantitative variables quantitative discrete variables and quantitative continuous variables. A
quantitative discrete variable is a quantitative variable that can assume only natural numbers or
whole numbers. For example, the number of patients suffering from influenza is a quantitative
discrete variable. A quantitative continuous variable is a quantitative variable that can assume
any real number. For example, the blood glucose reading (mmol/l) for a diabetic is a quantitative
continuous variable. Figure 1 illustrates how all types of data can be classified.

Figure 1

Scales of measurement for data


Different scales (levels) of measurement exist for data. As such varying scales are used for
measurement of data:
Nominal Scale
This scale is used for the simplest level of measurement. That is, it is applicable with qualitative
data in which the values only fit into categories. For example, the symptoms that a patient suffers
from can be coded using the following scale: 1 Fever, 2 Nasal Congestion, 3 Watery eyes.
Ordinal Scale
This intermediate level of measurement is used with qualitative data that has inherent order
within the categories. For example, the degree of pain experienced by a burn patient can be
coded using the following scale: 1 Mild, 2 Moderate, 3 Severe. As opposed to the nominal
scale example, an inherent order exists for the categories.
Numerical Scales
These scales are used only with quantitative data. There are two types of numerical scales:
Fixed Ratio Numerical Scale - This scale is one in which there is a fixed zero point. For
example, the weight of a patient.
Interval Numerical Scale - This scale is one in which the zero point of the scale is
arbitrary. For example, the temperature of a patient when taken in C or F. On either of
these scales, the value of zero does not coincide with absence of temperature thus the
zero point is said to be arbitrary.
Contingency Tables & Frequency Distributions
The frequency of a value is defined as the number of observations recorded for that value.
Consequently, two ways of presenting data using a table can be defined:
A contingency table (cross tabulation) is a table that displays the observed frequencies
of qualitative variables in a matrix format. An m n contingency table indicates that there
are m rows and n columns such that one variable has m categories and the other has n
categories. The two variables must either be measured using a nominal scale or an ordinal
scale. Thus, this format for presenting data provides a count of the number of
observations that exhibit two specific characteristics.
A frequency distribution is a list of the possible values for a quantitative variable and
their frequencies. There are two types of frequency distributions an ungrouped
frequency distribution and a grouped frequency distribution. An ungrouped frequency
distribution provides the exact values within the set of data and their related frequencies.
A grouped frequency distribution provides groups of values within the set of data and
their related frequencies. In grouped frequency distributions boundaries should be
defined. The lower boundary is the smallest possible real number that could be placed in
a group and the upper boundary is the limiting real number for that group. This number

will become the smallest value in the subsequent group. A less than cumulative
frequency column can be constructed where the values in this column indicate the
position up to which a specific value or group of values lie. The values in the column are
determined by finding a running total of the frequency column after ensuring that the
values in the set of data have been organised in ascending order.
Example 1
A nurse reviewed 50 patient records for the day and then constructed the 2 3
contingency table seen below concerning the symptoms that the patients experienced and
the diagnosed disease.
Diagnosed Disease
Symptom

Influenza

Measles

Chicken Pox

Fever above 101 F

24

Excessive Itching

How many patients experienced


a.
A fever above 101 F?
b.
Chicken pox?
c.
Excessive itching and influenza?
d.
A fever above 101 F and measles?

Solution 1
a.
b.
c.
d.

The number of patients that experienced a fever above 101 F is equal to


24 + 8 + 2 = 34.
The number of patients that experienced chicken pox is equal to 2 + 7 = 9.
The number of patients that experienced excessive itching and influenza is equal
to 3.
The number of patients that experienced a fever above 101 F and measles is
equal to 8.

Example 2
A nurse recorded the diastolic blood pressure readings (mmHg) for 20 patients. The raw
data is shown below:
65

70

72

72

77

69

72

80

75

70

69

78

73

70

75

65

78

80

74

72

a.

Construct an ungrouped frequency table including the cumulative frequency


column. Use this table to determine the number of patients that have a diastolic
blood pressure of 75 mmHg or lower.

b.

Construct a grouped frequency table commencing with the class 64 67 mmHg.


Use equal widths and include columns concerning boundaries and the cumulative
frequency. Use this table to estimate the number of patients with a diastolic blood
pressure at least 71.5 mmHg.

Solution 2
a.

From the frequency distribution below it is seen that the value of 75 mmHg stops
at the 15th position. Thus there are 15 patients that have a diastolic blood pressure
of 75 mmHg or lower.

Diastolic Blood Pressure (mmHg)


65
69
70
72
73
74
75
77
78
80

Tally
||
||
|||
||||
|
|
||
|
||
||

Frequency
2
2
3
4
1
1
2
1
2
2

Cumulative Frequency
2
+
4
7
11
12
13
15
16
18
20

b.

The upper boundary for a group can be determined by finding the midway value
between the upper limit of the group and the lower limit of the following group.
The upper boundary of that group will become the lower boundary for the
following group.
From the frequency distribution below it is seen that the upper boundary of 71.5
mmHg occurs at the 7th position. Thus, an estimated 7 patients have a diastolic
blood pressure below 71.5 mmHg. Consequently, it can be estimated that there are
20 7 = 13 patients that have a diastolic blood pressure of at least 71.5 mmHg.

Diastolic Blood
Pressure
(mmHg)
64 67
68 71
72 75
76 79
80 83

Boundaries

63.5
67.5
71.5
75.5
79.5

x
x
x
x
x

67.5
71.5
75.5
79.5
83.5

Tally

Frequency

||
||||
|||| |||
|||
||

2
5
8
3
2

Cumulative
Frequency
2
+ 7
15
18
20

Measures of Central Tendency


A measure of central tendency is a measure of that can indicate a type of average or central
value for the set of data. There are three measures of interest:
The arithmetic mean is the most commonly used type of average. It is only appropriate
for use data measured using a numerical scale. However, it is extremely sensitive to
outliers and thus is not usually appropriate with skewed data.
The median is appropriate with data where the least level of measurement is the ordinal
scale. Unlike the arithmetic mean, it is insensitive to outliers.
The mode can be used with data using any scale of measurement.
The Arithmetic Mean
Ungrouped data
The arithmetic mean for a population is represented by the symbol . The arithmetic
mean for a sample drawn from a specific population is represented by the symbol X . For
x
x
an ungrouped set of data
or X
where x represents a particular value in
N
n
the set of data, N represents the size of the population and n represents the size of the
sample. For the purposes of this course, most problems concerning the mean will
consider the data as sample.

Grouped data
For a grouped set of data

fx

fx

or X

where x represents the midpoint of a


N
n
f the total
group, f represents the frequency related to the particular midpoint, N
number of values in the population and n
the sample.

f represents the total number of values in

The Median
Ungrouped data
If the number of values in the set of data is odd then the median is determined by finding
the middle value after first arranging the values in numerical order (whether ascending or
descending). If the number of values in the set of data is even then the median is
determined by finding the arithmetic mean of the two middle values after first arranging
the values in numerical order. (Recall: The middle position can be determined using a
cumulative frequency column).
Grouped data
The formula used to determine the median for grouped data is:

f
Median

cf m

fm

c.w.

where
L is the lower boundary of the median class
f

n is the number of values in the data set

cf m 1 is the cumulative frequency of the class PREDCEDING the median class


f m is the frequency of the median class
c.w. is the difference between the upper and lower boundaries of the median class

The Mode
Ungrouped data
For ungrouped data, the mode is the value that occurs most frequently in the set of data. It
is possible to have more than one modal value.

Grouped data
The formula used to determine the mode for grouped data is:

Mode

L
1

c.w..
2

where
L is the lower boundary of the modal class
1

fm

fa

The difference between the frequencies of the modal class and the class ABOVE

fm

fb

The difference between the frequencies of the modal class and the class BELOW

cw

The difference between the upper and lower boundaries of the modal class

Measures of Dispersion
In order to concisely summarise a set of data, in addition to measures of central tendency,
measures of dispersion are required. A measure of dispersion is a measure that can indicate the
variability within a set of data or how the data in a set is spread or dispersed. There are
five measures of interest:
The range
The range is only applicable with ungrouped data. It is determined by finding the difference
between the largest and the smallest values in the set of data. As such, a weakness of the
range is that it only involves two values and does not provide information about how the values
in between vary.
The interquartile range
The interquartile is determined by finding the difference between the upper quartile and the
lower quartile for the set of data. The lower quartile (Q1) is defined as that value below which
the first 25% of the data lies whilst the upper quartile (Q3) is defined as that value below which
the first 75% of the data lies. It is from this premise that it should be understood that the median
is actually Q2 that value below which the first 50% of the data lies. In order to determine the
quartiles for ungrouped data, the values must first be arranged in numerical order and then the
1
1
value that lies in the (n 1)th position represents Q1, the value that lies in the (n 1)th position
4
2
3
represents Q2 and the value that lies in the (n 1)th position represents Q3.
4
.
The variance & the standard deviation
The variance is defined as the second moment about the mean of a distribution. Due to the
variance being a squared measure, of more importance is the standard deviation. The standard
deviation is determined by finding the square root of the variance. As such, the units for the
standard deviation would be the same as the units used for the data. By definition, the standard
deviation is that measure that gives an indication of how data is clustered (spread) around the
arithmetic mean X . Since all the values in the data set are used to determine the standard
deviation, it is the most commonly used measure of variability.

For a population, the formulae used to determine the variance and the standard deviation are:
)2

f (x

)2

f (x

and
respectively.
N
N
For a sample, the formulae used to determine the variance and the standard deviation are:

f ( x x) 2

f ( x x) 2

and s
respectively. Once again, for the purposes of this
n 1
n 1
course, the problems concerning variance and standard deviation will concern a sample. When
considering grouped data, the values for x are the midpoints of the classes.
s

The coefficient of variation


The coefficient of variation (CV) is a standardised measure of spread which is mainly used to
compare the variability in more than one data set. It is usually expressed as a percentage. The
formula used to evaluate the coefficient of variation is:
Standard deviation
CV
100
Arithmetic mean
Example 1
The set of data below concerns the age (months) when the first word was uttered for a sample of
10 infants from a community
7

11

a.

Determine:
i.
The mean
ii. The median
iii. The mode
iv.
The range
v.
The interquartile range
vi.
The variance
vii.
The standard deviation
viii. The coefficient of variation

b.

If another sample of infants from the community resulted in the mean age being 9.2
months with a standard deviation of 2.8 months, comment on the difference in the
variability for the two sets of data.

Solution 1
a.
i.

Mean age =
x 7 8 8 9 11 8 9 8 9 9
x
n
10

ii.

86
10

8.6 months

Median age:
When arranged in ascending order the set of data is:

Median age =

8 9
2

17
2

11

8.5 months

iii.

Modal ages = 8 months and 9 months

iv.

Range = 11 7 = 4 months

v.

Interquartile range:
When arranged in ascending order the set of data is:

7
2.75th position

Q1 is in the

1
(10 1)
4

1
11
4

2.75th position.

3
(10 1)
4

3
11
4

8.25th position. Thus Q3 = 9

Thus Q1= 8
Q3 is in the
Thus Q3= 9
Interquartile range = 9 8 = 1 month

11
8.25th position

On organising the data in a frequency distribution, we get the following:


Age (months)
x
f
7
1
8
4
9
4
11
1
f 10
In order to determine the variance and the standard deviation, three
additional columns should be constructed.
Age
(months)

x x

( x x) 2

1.6
0.6
0.4
2.4

2.56
0.36
0.16
5.76

f ( x x) 2

7
8
9
11

vi.

b.

Variance = s

1
4
4
1
f 10

f ( x x) 2
n 1

vii.

Standard deviation = s

viii.

Coefficient of variation =
Standard deviation
CV
Arithmetic mean

2.56
1.44
0.64
5.76
f ( x x)2 10.4

10.4 10.4
1.2 months 2
10 1
9

f ( x x) 2
n 1
100 %

10.4
9

1.2 1.1 month

1.1
100 % 12.8%
8.6

2.8
100 % 30.4%
9.2
Since the coefficient of variation for the second sample is much larger than that for
the first sample then it can be concluded that the variability in the second sample is
greater. That is, the values in the second set of data show greater spread.

The coefficient of variation for the second sample = CV

Example 2
Given the set of data below concerning the age (years) for the onset of leukaemia,
determine:
Age of onset
f
a. The mean age of onset
15
5
b. The median age of onset
6 10
9
c. The modal age of onset
11 15
8
d. The standard deviation in the ages of onset
16 20
3
f 25

Solution 2
For each section, the additional columns required in the frequency table will be
indicated but note that one table showing all columns can be provided.
a.

The additional columns required to determine the mean are the midpoint x column
and the fx column:

Age of
onset
15

1 5
2

6 10

6 10
2

11 15

11 15
13
2

104

16 20

16 20
18
2

54

Mean age = x

fx
f

25

245
25

fx

15

72

fx

9.8 years

245

b. The additional columns required to determine the median are the boundaries column
and the cumulative frequency column:

Age of onset Boundaries


15
6 10
11 15
16 20

0.5 5.5
5.5 10.5
10.5 15.5
15.5 20.5

Cumulative frequency

5
9
8
3

5
5 + 9 = 14
14 + 8 = 22
22 + 3 = 25

f
f
2
fm

Median age = L
Median age = 5.5 4.2

cf m

25
25

cw 5.5

2
9

5 5.5

7.5
5
9

9.7 years

c. The additional column required to determine the mode is the boundaries column:
Age of onset Boundaries
15
6 10
11 15
16 20

0.5 5.5
5.5 10.5
10.5 15.5
15.5 20.5

5
9
8
3

Modal age = L
1

Modal age = 5.5

cw 5.5
2

4
4 1

25

9 5
9 5
9 8

5 5.5 4 9.5 years

d. The additional columns required to determine the standard deviation are the
2

x, fx, x x , x x and f x x columns:


Age of
onset
15
6 10
11 15
16 20
f

5
9
8
3

3
8
13
18

fx

15
72
104
54
fx 245

25

Standard deviation = s

x x

( x x) 2

f ( x x) 2

-6.8
-1.8
3.2
8.2

46.24
3.24
10.24
67.24

231.2
29.16
81.92
201.72

f ( x x) 2
n 1

f x x

544
25 1

544
24

22.7

544

4.8 years

Measures of Deviation from Normality


A measure of deviation from normality is a measure that can indicate how data in a set deviate
from the shape of a normally distributed set of data. One such measure of deviation is skewness.
The shape for a normally distributed set of data is bell-shaped and symmetrical. For this
distribution, the values for the mean, median and the mode coincide. When deviation to this
shape exists because the left tail is pulled in the negative direction, then the shape is said to have
a negative skew. The data for this distribution consists mostly of larger values and the measures
of central tendency uphold the relationship: mean < median < mode. When deviation to this
shape exists because the right tail is pulled in the positive direction, then the shape is said to have
a positive skew. The data for this distribution consists mostly of smaller values the measures of
central tendency uphold the relationship: mode < median < mean. Figure 1 illustrates these
shapes.
Symmetric
Normal Distribution

Negatively Skewed
Distribution

Positively Skewed
Distribution

Figure 1

In order to measure skewness, Pearsons coefficient of skewness can be determined using the
Mean - Mode
3(Mean Median)
formula: sk
. This relationship is valid since the
Standard deviation Standard deviation
median generally lies between the mode and the mean upholding the relationship
Mean Mode
3 Mean Median . In addition, the value for Pearsons coefficient of
skewness must uphold the inequality 3 sk

sk

3 and if
0; then the distribution is negatively skewed
0; then the distribution is symmetrical
.
0; then the distribution is positively skewed

Graphical Illustrations
A graphical illustration provides image that can be used as a representation for a set of data.
There are various graphical illustrations that can be employed but it is crucial that the selected
illustration is appropriate for the type of data.
Qualitative data
For qualitative (categorical) data, use of the following graphical illustrations is
appropriate:
1. Bar charts
2. Pie charts
Quantitative discrete data
For quantitative discrete data, use of the following graphical illustrations is appropriate:
1. Bar charts
2. Pie charts
3. Line graphs
4. Dot charts
5. Stem-and-leaf plots
6. Box-and-whisker plots
Quantitative continuous data
For quantitative continuous data, use of the following graphical illustrations is
appropriate:
1. Histograms
2. Stem-and-leaf plots
3. Box-and-whisker plots
4. Cumulative frequency curves

Example 1
Use a pie chart to represent the data below concerning the age of patients who registered at a
clinic.
Age group

Frequency

0 10

23

11 18

19

19 25

26 35

36 50

Over 50

14

Solution 1
Since the data concerns qualitative data, a pie chart is suitable.
Pie chart showing distribution for ages

14

23

19

Age group
0 - 10 years
11 - 18 years
19 - 25 years
26 - 35 years
36 - 50 years
Over 50 years

Example 2
Represent the data below using a suitable chart.
Diagnosed Disease
Symptom

Influenza

Measles

Chicken Pox

Fever above 101 F

24

Excessive Itching

Solution 2
Since the data concerns qualitative data, a clustered bar chart is suitable.
Clustered bar chart
25

Frequency

20

15

10

0
Fever above 101 F

Excessive itching

Symptom

Diagnosed Disease
Influenza
Measles
Chicken Pox

Example 3
A drug company held a clinical trial in order to determine the effect of a new drug on various
participants. The reaction times (minutes) for the drug to take effect for 27 participants are
recorded below:
20 34 11 5 28 22 43 37 50
49 10 40 47 13 20 25 8 29
41 21 16 12 38 15 23 36 42
Use a box-and-whisker plot (box plot) to illustrate the information.
Solution 3
A box plot provides a five-point summary for a set of data. The data set must first be arranged in
numerical order:

The following five points must then be ascertained:


1. Minimum value = 5

1
27 1
1
4
1
27
3. Median = 25 which is the measure in the
2
3
27 1
4. Q = 40 which is the measure in the
3
4
5. Maximum value = 50
2. Q = 15 which is the measure in the

1
28
4
1
3
28
4

7th position
1
28
4

14th position

21st position

The box plot should be constructed where the length of the plot (including the whiskers)
represents the range of the distribution and the length of the box represents the interquartile
range. The median divides the box into two sections.
Box-and-whisker plot showing reaction times

40

20

Reaction time (minutes)

You might also like