You are on page 1of 31

Module 1: Statistical Theory and Introduction to R

Contents
1 Introduction 2
1.1 Learning objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Required reading for Module 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Unit overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Module 1 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Defining and displaying data 4


2.1 Required reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Graphical summaries for single variables . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Populations and samples 12


3.1 Required reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Summarising distributions 20
4.1 Required readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Key summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Basic summaries using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Confidence interval calculations in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Error and bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Statistical inference 23
5.1 Required Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Additional resources for Module 1 26


6.1 Statistical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Getting started with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Defining and displaying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Exercises 28

1
1 Introduction

1.1 Learning objectives

This Module aims to enable you to:


1. Recognise the difference between different variable types.
2. Use R to produce graphical summaries.
3. Use R to generate summary statistics.
4. Explain the difference between population parameters and sample statistics.
5. Explain the effect of sample size on the sampling distribution of the mean.
6. Determine if data are normally distributed.
7. Calculate and interpret confidence intervals.
8. Recognise possible sources of error and bias in a sample that can make a sample unrepresentative
of the population.
9. Explain the process of statistical inference.

1.2 Required reading for Module 1

This module draws on chapters 1 to 8, and chapter 14 of Essential


Medical Statistics 2nd Ed. by Betty Kirkwood and Jonathan Sterne

1.3 Unit overview

Biostatistics is concerned with applying statistical principles and methods to issues in public health,
medicine, and epidemiology. This includes:
• Designing studies
• Collecting data
• Presenting summary statistics and graphs
• Detailed statistical analysis
• Interpreting and making inferences about populations

1.4 Module 1 overview

Our goal in CAM625 Introduction to Biostatistics is to introduce statistical techniques for examining
exposure-outcome associations. We will focus on regression modelling which is the most common
method of analysis. To get there we first cover the basic principles of statistics (Module 1), data
management principles (Module 2), and then focus on two regression techniques: linear regression
(Module 3) and logistic regression (Module 4). This course will be limited to methods for variables
where the response (outcome) is either continuous (scaled) or binary (two-level category). Methods
for other variable types will be covered in CAM627 Extended Epidemiology and Biostatistics.

2
Module 1 provides an introduction to fundamental biostatistical concepts whilst introducing coding
using R statistical software (more on this below). The required readings for this module are Chapters
1 to 8 and Chapter 14 of the text, Essential Medical Statistics by Betty Kirkwood and Jonathan
Sterne. This text is one of the more accessible biostatistics textbooks, and readings from it are
supplemented with the notes below, which expand or clarify some concepts, and introduce relevant
R code. These module notes should be used in conjunction with the textbook. In addition, there
are support materials for using R, including R tutorial videos in the Module 1 content folder. At the
end of the module, we also provide web links to additional readings, resources, and information that
you may find helpful. We include exercises that will help you consolidate your understanding of the
theoretical concepts and your knowledge of R. It will be helpful to you and your fellow students
to discuss the answers to these exercises on the discussion boards provided. The course tutor and
lecturers will engage on the discussion boards as required but in the first instance, students should
discuss relevant issues and attempt to answer questions as you would in a face-to-face tutorial.
R software
This course uses the R statistical package which is powerful, versatile, comprehensive, and free. In
addition, it provides a code-based method of analysis, as opposed to “point and click”. Advantages
of code-based analysis are:
• efficiency: saved code can easily be re-run without the need to repeat long sequences
of data management and analysis steps every session
• repeatability: analysis can be replicated, and errors more readily detected and
corrected
• transparency: code can be shared with colleagues and reviewers.
Once you learn to use this statistical package it is unlikely that you will ever need any other software
for summarising, graphical presentation of data, statistical analysis, or data management. Some
of you will find this software challenging, and while we have tried hard to make it accessible, it is
not possible to tailor this course to your individual requirements, so you will need to be resourceful.
The challenge of learning this software is eloquently summed up by Hadley Wickham (a well
respected statistician and writer of R packages/add-ons) during his “dplyr” tutorial at the useR2014
conference:
“the bad news is whenever you’re learning a new tool, for a long time you are going to
suck, its going to be very frustrating, but the good news is that is typical, it happens to
everyone and its only temporary. Unfortunately, there is no way of going from knowing
nothing about a subject to knowing something about a subject and being an expert in it,
without going through a period of great frustration and much suckiness.”
We teach the use of R software through examples. We recommend initially reading through the R
code presented in the examples in Module 1, observing the logic and structure of the code. Once
you comprehend these examples, we suggest replicating them yourself to get some practice using
R, and then move on to complete the exercises within each module. Whilst we encourage you to
seek out new code through books and online resources, this is not a requirement of the course. It
is sufficient to understand the code we have shown you, and to adapt it to complete the analyses
required in the course. You should be able to complete all the required tasks using code that is
provided in the notes. There are often several ways of completing the same task but if you are new
to coding we recommend you stick with the methods that we demonstrate. Information to help you
get started with R and R Studio is provided at the end of the module notes. In addition, a series of
video labs are also provided in Module 1 contents in MyLo. Information to help you get started

3
with R is provided in the content folder for Module 1. We suggest that you follow the instructions
and install R software and R Studio before you begin to read the notes.
The assessment for Module 1 consists of two quizzes, consisting of multiple choice and short-answer
questions to test your understanding of statistical theory and how to apply it using R.

2 Defining and displaying data

Summarising and displaying data is an essential starting point for any statistical analysis. This
is how statisticians get a “feel” for a dataset they are working with. , These techniques are also
used for checking the consistency of the data, something that will be covered in more detail in
Module 2. Study data typically consists of observations, measurements, or questionnaire responses
for an individual or unit. Depending on the context of the study, any of the following could be
thought of as an “individual” or unit: a person, a mouse, a family, a biological cell, a hospital, etc.
Properties that are recorded (observations/measurements/responses) for each unit are known as
variables. Formally, in statistics, we make a distinction between variable types, and each variable in
a study is classified as belonging to one of the following types:
• continuous (scaled), takes on any value along a continuum, eg. height and weight; blood
measurements such as haemoglobin, pH, and glucose; lung measurements such as FEV1 and
FVC;
• binary (dichotomous), can only be one of two possible values or states, eg. positive or
negative for a test or trait; exposed or not exposed; diseased or not diseased; alive or dead;
• discrete unordered, more than two possible values with no natural ordering, also called
categorical, nominal, or multinomial, eg, marital status, ethnicity, area of residence;
• discrete ordered, more than two possible values/categories, which have a natural ordering,
also called ordinal, eg. BMI categories, socio-economic status, education level.
It is important to distiguish between these distinct types of variables because it determines how
these data are analysed when they are treated as outcome variables. For example, binary and
discrete (ordered or unordered) variables are analysed in different ways when they are outcome
variables, but when these data are are summarised they are usually lumped together as “categorical”
and treated in the same manner.
The first step in presenting the results of a scientific study is to prepare a table of participant
characteristics, such as that shown below in Figure 1. It is usually presented as Table 1 in an
academic publication. This table displays summary statistics, presented separately for males and
females (stratified by sex), for a small selection of variables from a study of respiratory outcomes
for a sample of Chronic Obstructive Pulmonary Disease (COPD) patients. The purpose of the
table is to provide summary information about the number and type of participants in the study; it
provides a description of the study sample. This informs interpretation of the results with respect to
whom the findings apply, and the generalisability of the findings. Continuous variables are typically
summarised as the mean and standard deviation (SD) when they are normally distributed (more
about this later) or by the median and interquartile range when the distribution is skewed. In the
example below, age and height, and the respiratory measures FEV1, FVC and FEV1 % predicted
are summarised as mean (SD). Binary and categorical variables are summarised as the frequency
and percentage for each level of the variable, eg. marital status and highest education. Ordinal
variables may also summarised by frequency and percentage for each level; in the table shown below,

4
MRC dyspnoea is a 1-5 scale used to establish clinical grades of breathlessness for patients with
COPD. It ranges from: 1 “Breathless only with strenuous exercise”, to 5 “Too breathless to leave
the house”. Where clinical guidelines specify, or clinical practice dictates thresholds or cut-points of
significance, ordinal or continuous variables may be presented as derived variables. Table 1 will
display the number and frequency of cases in each category of the derived variable. See section 2.3
in Kirkwood and Sterne for examples.
In the following sections we demonstrate how to prepare graphical and numerical summaries using
R.

Figure 1: Example of a table of participant characteristics

5
2.1 Required reading

Kirkwood & Sterne: Chapters 2 and 3

2.2 Graphical summaries for single variables

There are numerous graphical methods available to summarise data, ranging from simple bar charts
and scatter plots available in Microsoft Excel, through to complex data visualisations in statistical
software such as R. In this module we will restrict our graphical summaries to the following methods.
• Boxplots.
• Histograms.
• Bar plots.
Other graphical summaries are shown in Kirkwood and Sterne but those shown do not provide
much more insight into the data than the basic ones covered here. Figure 3.2 in Kirkwood and
Sterne shows how to create Pie charts. We do not recommend this method of graphical display:
it is not as easy to interpret as a bar plot because the areas within a pie chart are not as easy to
compare as lengths of bars in a bar plot. Please do not use pie charts in this course, and we strongly
urge you to avoid using them in scientific work.

2.2.1 Boxplot

A useful graphical plot for summarising the distribution of a scaled (continuous) variable is the
boxplot also known as a box and whiskers plot. Boxplots depict both the distribution (the spread)
and centrality (the middle) of scaled data. A boxplot consists of: a box which is bordered by the
25th and 75th percentiles (or 1st and 3rd quartile), within the box is a line showing the median,
whiskers which extend beyond the box covering what would be the expected range of the data, and
outliers which are deemed to be a long way from the median. As we will see later, boxplots are
extremely useful for data checking. Figure 2 displays a boxplot generated in R using the boxplot()
function. The boxplot summarises age in years for 636 children aged 7-10 years from the Peru lung
function study (described briefly on page 27 in K&S). The figure is annotated to show the important
features of a boxplot. These are also described on page 24 of Kirkwood & Sterne. Figure 3 displays
a boxplot of FEV1 from the same study. Notice the small circles at the top and bottom of the figure.
These are outliers in the data: values that fall outside 1.5 times the interquartile range (IQR: 3rd
quartile -1st quartile). We will consider outliers further in Module 3.

6
Figure 2: Boxplot of age in years for children in the Peru lung function study

Figure 3: Boxplot of FEV1 for children in the Peru lung function study

7
In the example shown below, we create a fictitious variable (or vector) in R, called “dt”. The
variable contains the values 4, 5, 6, 5, 2, 9, 5, 19, 4, 3. We then use the boxplot()
function to produce a boxplot shown in Figure 4. Notice the value of 19 is an extreme outlier! We
will manipulate this variable in the Exercises at the end of this module.

dt <- c(4, 5, 6, 5, 2, 9, 5, 19, 4, 3) #create a vector called dt


dt # print the vector dt to the screen

[1] 4 5 6 5 2 9 5 19 4 3
boxplot(dt) #use the boxplot function to create a boxplot of dt
15
10
5

Figure 4: Boxplot of “dt” variable

2.2.2 Histogram

The histogram is another graphical method for displaying the distribution of a continuous (scaled)
variable. The range of values are divided into a specified number of equally spaced “bins” on the
x-axis, and the frequency of values in each bin is displayed on the y-axis. The bar length in a
histogram is directly proportional to the frequency of values falling within the range covered by
each bar. In R a histogram is produced using the hist() function. R will decide on the number of
bins to present but this can be changed using the breaks = argument. Figure 5 shows histogram of
age from the Peru Lung Data. We can see that the bins are 0.5 years wide.
Figure 6 displays a histogram for the “dt” variable created in the section above using the command
hist(dt) :

8
Figure 5: Histogram of age from the Peru lung function study.

dt

[1] 4 5 6 5 2 9 5 19 4 3
hist(dt)

Histogram of dt
0 1 2 3 4 5 6 7
Frequency

0 5 10 15 20

dt
Figure 6: Histogram of “dt” variable using default settings

9
Examine Fig. 6. You will see that the left-most bar is 7-units high and covers data in the range
> 0 to ≤ 5, i.e. the vector dt contains seven values between zero and five. The second bar (from
the left) indicates that there are two values between > 5 and ≤ 10, and the last bar indicates
one value between > 15 and ≤ 20. The hist() function has automatically divided the data
above into ranges of values, called bins. The function chooses the width, in this case 5 units,
however, this may not be the best way to visualise the data. To control the width of the bins we
can use the breaks= argument which requires a vector of numbers defining the borders of the
bins. The sequence function, below, creates a vector of numbers from 0 to 20 increasing by 2.
These numbers become the borders of the bins so that each bin is two units wide as shown in Figure 7.

hist(dt, breaks = seq(0, 20, 2))

Histogram of dt
4
3
Frequency

2
1
0

0 5 10 15 20

dt
Figure 7: Histogram of “dt” variable with user-defined “breaks”

2.2.3 Bar plot/bar chart/bar diagram

A bar plot (as distinct from a histogram) is a method for summarising the frequency distribution of
categorical (nominal or ordinal) data. The bar plot is drawn with one axis containing the names of
the categories and the other axis showing the frequency. Bars extend from the names axis with
their length parallel to the frequency axis. Bars are typically separated by a space so they do
not imply continuity. Figure 8 shows a bar plot of age group categories for subjects in a study of
Onchocerciasis (‘river blindness’) in Sierra Leone. The study is briefly described on page 191 of
Kirkwood and Sterne.
Below we provide the code used to read in the Onchocerciasis data, to prepare the data for plotting,
and to create the bar plot. For the moment, read the code and the accompanying annotation to
follow the process. You will encounter more of this type of coding as we progress through the unit.
Note the use of the factor() function which allows us to create a new “factor” variable for age

10
group. This variable contains the original data but with the labelling ordered in a way as defined by
the levels= argument. This ensures the bars are plotted in the desired orderand labelled accordingly.

# Read in the data, specifying the path where the file is


# stored
onchodata <- read.csv("S:/Menzies/Biostatistics/MPH/KirkwoodSterneData/oncho_ems.csv")
# Create a factor variable for age groups
onchodata$agegrp_f <- factor(onchodata$agegrp, levels = c(0,
1, 2, 3), labels = c("5-9", "10-19", "20-39", ">=40"))
# Create a bar plot with labelled axes
barplot(table(onchodata$agegrp_f), xlab = "Age group", ylab = "Frequency")
400
300
Frequency

200
100
0

5−9 10−19 20−39 >=40

Age group
Figure 8: Bar plot of age group categories for subjects in the Onchocerciasis study

Exercises

We suggest that now would be a good time to try some of the example code provided
above, and to work through Exercises 1.1 to 1.5 which you will find at the end of the notes. If you
need some help with R, watch the R video labs and look at the resources for R in the Module 1
content folder on MyLO.

11
3 Populations and samples

In statistics it is rare that we have information about every member of a population (defined below).
Typically, we are only ever working with a sample (subset) of its members. Our goal in research is
to make decisions about the entire population, but only under certain circumstances is the sample
useful for this purpose. This section examines the link between samples and populations and how it
is possible to make decisions about populations based on samples of its members.

3.1 Required reading

Kirkwood and Sterne: Chapters 5 and 14 (sections 14.1 to 14.5)

3.2 Sampling

Section 2.1 of Kirkwood and Sterne briefly introduces the concepts of populations and samples.
This concept is extended in chapters 4 and 5 where sampling error and the normal distribution are
introduced. In this section of the notes we provide additional information about populations and
sampling along with an exercise to demonstrate the effect of sample size on a sampling distribution.
A population is the collection of all subjects or objects of scientific interest. Some examples of
populations of interest are: hospitals in Australia, adult male diabetics in Tasmania, families in
Hobart with a history of breast cancer. The population of interest is determined by the research
questions. The distribution of a characteristic (a variable) for all members of a population is referred
to as a population distribution. Examples of population distributions include, members of the
population who have ever smoked, BMI for all members of the population, and survival times
for members of the population who have been diagnosed with lung cancer. A parameter is a
summary value that is used to describe or summarise the population. The population mean is one
such parameter, another is the population proportion. Continuing the examples from above the
proportion of ever smokers, the mean BMI, and the mean survival time would be parameters of the
population.
It is usually not possible to measure the entire population of interest, and determine the popula-
tion parameter for each characteristic, thus a representative sample is taken. The measured
characteristics of the sample are then used to make statistical inferences about the population
distribution. The word inference here is important; we infer the mean of the population from the
mean of the sample. Making inferences about unknown population parameters based on sample
statistics is an important application of statistics.
The sampling distribution refers to the distribution of sample statistics of a given sample size
(n). It is crucial concept in statistics, that is important to understand. Section 4.5 of Kirkwood
and Sterne intorduces sampling variation and the sampling distribution, and provides a simple
example. A sample statistic is a summary value that is calculated to describe the sample, the
sample mean is one such sample statistic. A sampling distribution is a theoretical construct built
up of many samples (taken from the same population) of the same size, each calculating the same

12
sample statistic. It is a theoretical construct because in practice when conducting a study we never
take more than one sample from a population, but this theory is used to describe the variability of
the sample statistic. The sample statistic does have some inherent variability due to sampling, so it
is not necessarily equal to the population parameter to which it is related. The standard deviation
of a sampling distribution of a statistic, such as the sample mean, is referred to as the standard
error of the statistic: it is a measure of how precisely the population parameter is estimated by the
sample statistic.

3.2.1 Important properties of the sampling distribution of the sampling mean

1. The sample mean is an unbiased estimator of the population (on average, the sample mean is
correct - the mean of the sample means will closely approximate the population mean).
2. The standard error of sampling becomes smaller as the sample size increases (as more of the
population is included in the sample).
3. The sampling distribution of the sample mean becomes more like a “normal” distribution as
the sample size increases.

3.2.2 Representative and random sampling

A representative sample is one that represents the population. This may seem rather obvious but
what is not obvious is how to get a representative sample. Using probabilistic sampling is a way
to get a representative sample: this means that the probability of selecting each member of the
population is known. There are a number of probabilistic sampling techniques: mostly these rely on
some form of random sampling.
A simple random sample is a probability sample. The probability of selecting each member of the
population is exactly the same in a truly random sample. This is a very strict definition, and it is
very difficult to ensure that the probability is exactly the same. A list of the entire population is
required and drawing from that list must use a truly random process. While there are process that
are truly random (e.g. radioactive decay), in practice we usually only have available processes that
are pseudorandom (e.g. Mersenne Twister algorithm). This algorithm is used in both Excel and R
to generate random numbers, and is sufficient for epidemiology.
There is a problem! A single random sample may not necessarily be representative. It is entirely
possible that through the random process, the selected sample is not a good representation of the
population. This is confusing since we have just told you that all representative processes use
random sampling. There is a theoretical basis for using random sampling which does ensure that
in the long-run the process is representative. That is a waffly way of saying that random samples
have no inherent systematic bias. What is more important is that statistical techniques rely on the
theory of random sampling. It is a requirement for inferential statistics.

3.3 Probability

In biostatistics, probability theory provides the foundation for statistical inference. A basic knowledge
of probability is therefore essential for understanding statistics at an introductory level. A probability
is a number that indicates the chance or likelihood that a particular event will occur. Probabilities
are expressed as either a proportion, ranging from 0 to 1, or a percentage from 0% to 100%. A

13
probability of 0 indicates that there is no chance that the event will occur; a probability of 0.5
(50%) indicates that the event is likely to occur half of the time. Chapter 14 of Kirkwood & Sterne
introduces probability and describes the rules on which probability calculations are based. Ensure
that you understand the concepts in sections 14.1 to 14.5. You may read section 14.6 but we will
consider probability and odds in more detail in Module 4 when we study the analysis of binary
outcomes. For students who are unfamiliar with, or wish to refresh their understanding of, the basic
concepts of probability, we recommend examining the Additional Resources for this section, listed
at the end of the module notes.

3.4 Probability distributions

A probability distribution assigns a probability of occurrence to every possible outcome for a random
variable. In this course, we introduce the binomial and normal probability distributions which are
widely used in biostatistics. When there are only two possible outcomes, such heads or tails for
a coin toss, the binomial distribution model is used. This is an important probability model in
health and medical research where binary outcomes are common: for example, the presence or
absence of disease, patients responding to treatment or not, survival after an adverse event. The
outcome is often treated as “diseased” versus “healthy”, with the presence of the outcome of interest
usually labelled as “diseased”, although the outcome need not be adverse. We will consider the
probability and the binomial distribution in more detail in Module 4, which covers the analysis of
binary outcome data.
For continuous outcome measures, the distribution of outcomes frequently approximates a nor-
mal (Gaussian) distribution. When a frequency histogram of a continuous outcome variable is
approximately symmetrical and bell-shaped, the normal probability model can be used. A normal
distribution is fully characterised by two parameters: the mean (µ (pronounced mu)) and the
standard deviation (θ (theta)). The mean tells you the centre of the distribution, and the standard
deviation, the width (or spread). Many human physical characteristics have distributions that are
approximately normal, including adult height and weight, systolic and diastolic blood pressure, heart
rate and body temperature. The normal distribution is an important and widely used distribution
in statistics. It is central to statistical inference. The properties of the normal distribution are
described in sections 5.4 to 5.6 of Kirkwood and Sterne.

3.4.1 Area under the normal probability curve

The standard normal distribution is a normal distribution that has been standardised to have a
mean of zero and a standard deviation of 1. A normally distributed variable can be standardised
by subtracting the mean from each value, then dividing by the standard deviation. Values on the
standard normal distribution are called z-scores, and they indicate the number of standard deviations
a value is from the mean. Figure 9 below shows the standard normal distribution represented in
a boxplot and a line plot, annotated to show the correspondence between standard deviations,
quartiles and area under the curve. The area under the whole curve (over its entire range from −∞
to ∞) is equal to 1 (or 100%) so we can talk in terms of proportion/ percentage/ probability when
referring to the area above or below a certain value of the standard normal deviate (SND) or z-score.

14
Figure 9: Characteristics of the standard normal distribution. (from: https://radiant-rstats.github.
io/docs/data/visualize.html)

3.4.1.1 Statistical Tables


Table A1 in the Appendices of Kirkwood and Sterne can be used to calculate areas in the tails of
standard normal distribution. Figure 10 below shows the first page of the table. Circled (in blue if
you are viewing in colour), you can see that the proportion of area above a z-score of 0 (the mean) is
0.5000, or 50% of the area, as you would expect. Similarly, the area above a z-score of 1.5, indicated
by the red squares, is 0.1469, or 14.7%. Note: to calculate the area below the z-score you would
subtract the area from 1.

15
Figure 10: Table A1 from Kirkwood and Sterne.

3.4.2 Normal distribution functions in R

R has functions for calculating aspects of the normal distribution. The pnorm() function calculates
the area under the curve of the standard normal distribution. The default behaviour of this function
is to calculate the area under the curve at the value of the z-score accumulated from −∞ (i.e. the
left side of the distribution). The only argument required is the actual value of the z-score, see the
workings for the example calculations shown on p.46 of Kirkwood and Sterne. We replicate these
below using R.

16
#### Kirkwood and Sterne Section 5.5

# Area in the UPPER tail of a distribution (pg. 46) SND for


# height x=180cm
z1 <- (180 - 171.5)/6.5
z1

[1] 1.307692
# Area can be found using the pnorm function, this function
# is cumulative from the left so to get the area in the
# region to the right (the UPPER tail) we need to subtract
# the SND from 1
1 - pnorm(z1) #9.55% (slightly different to the book due to rounding error)

[1] 0.09548885
# Area in the LOWER tail of a distribution SND for height
# x=160cm
z2 <- (160 - 171.5)/6.5
z2

[1] -1.769231
# Area can be found using the pnorm function, this function
# is cumulative from the left so it will directly compute the
# area in the region to the left (the lower tail)
pnorm(z2) #3.84% (slightly different to the book due to rounding error)

[1] 0.03842769
Remember that the pnorm() function accumulates from the left (−∞) so if you supply a negative
value of the SND it will return an area <0.5 (<50%), whereas if you supply a positive value it will
return an area >0.5 (>50%).
The qnorm function does the exact opposite of the pnorm() function and returns the SND from the
area under the curve. Commonly the SND value for the 95% confidence interval is computed as
shown in Fig. 5.5 of Kirkwood and Sterne.
Here is how you would use R to calculate the SND values for each limit:
# The qnorm function can be used to convert probabilities
# into SND (z)
qnorm(0.025) #SND for an area 2.5% BELOW the mean

[1] -1.959964
qnorm(1 - 0.025) #SND for an area 2.5% ABOVE the mean

[1] 1.959964

17
3.4.2.1 Sampling distribution simulation exercise
Please follow the link below to a Shiny app for teaching sampling distributions. Note: Shiny is an R
package for building interactive web application in R. The app was written by Kyle Hardman at
the University of Missouri, and is described in this blogpost.
https://shiny.rit.albany.edu/stat/sampdist/
In brief, the app plots the sampling distribution for different sample sizes and number of samples.
It plots the following:
1. All of the individual values that have been sampled
2. The sampling distribution of the statistic that is calculated
3. The 4 most recent samples of data.
There are three built-in distributions in the app: normal, exponential and uniform. In this unit we
will only consider the normal distribution but please feel free to explore the other distributions. The
screenshots below show the two components of the app. The first screenshot shows the input panel
and the second, the output display. Shown is an example of the plots displayed for 100 samples,
with a sample size of 10, for a normal distribution with a mean of 10 and a standard deviation
of 15. The top left plot shows a histogram of all sampled values, that is, from the 100 individual
samples of 10. The top right plot shows the sampling distribution of the statistic, in this case, the
mean of each sample. The lower plots show the last four samples.

18
Sampling Simulation Tasks:
1. Configure the sampling procedure as shown in the panel above:
• Population distribution: normal
• Statistic: mean
• Mean (Mu for normal): 100
• SD (sigma for normal): 15
• Sample size: 10
• Samples to draw at a time: 1
• Check options: “Match scales”, “Show stats”, “Show parent distribution” (more
information about these available on the blog).
2. Click on “Clear All Samples” to begin a new procedure.

3. Click on the tab “Draw/Add Sample(s)” once. Note what is displayed in the different panels.

4. Add 3 more samples and note how the plots change.

5. Add 6 more samples to reach 10 samples overall. Once again, not how the plots change.

6. Now clear the samples. Change the “Samples to draw at a time” to 100. Click on “Draw/Add
Sample(s)”. Provide comments about the plots.

7. Repeat Step 6 but this time drawing 1000 samples at time. What do you note about the plots.

8. Share your findings on the Discussion Board.

19
Exercises

Work through Exercises 1.6 to 1.9 which you will find at the end of the notes.

4 Summarising distributions

So far, we have examined graphical methods for creating summaries, and while these are useful
visual methods, more formal numerical summaries are also required to describe certain aspects of
data distributions.

4.1 Required readings

Kirkwood & Sterne: Chapters 4 to 6

4.2 Key summaries

Basic numerical summaries can be broadly divided into two types, those that give information about
the middle of the data, the “central tendency”, and those that give information about the spread or
distribution of the data. The following is a list of the most useful numerical summaries for single
variables:
• Mean
• Median (50th percentile)
• Variance/Standard deviation
• Interquartile range (difference between 75th and 25th percentiles)
• Confidence interval
Chapter 4 of Kirkwood and Sterne describes the methods used to summarise characteristics of the
distribution of a numerical variable. This includes definitions of the mean and standard deviation
of a distribution and the relationship between sample and population values. Ensure that you
understand how these values are calculated, and how to interpret the values. Note that on page
35 Kirkwood and Sterne introduce the concept of degrees of freedom. The number of degrees of
freedom refers to the number of independent pieces of information that were used to calculate the
estimate. This is different from the number of observations in the sample. For example, if we
calculate a mean for 10 values, we know that the sum of the values must be 10 multiplied by the
mean. When 9 of the values are known, we know exactly what the 10th number must be: it is not
free to vary. Thus, we have 10-1=9 degrees of freedom. The use of n-1 in the denominator for
the sample standard deviation, instead of n, is to improve the statistical properties of the sample
estimator when trying to estimate the population standard deviation.
The confidence interval is a method used to estimate the likely range of values for the population
parameter based on a sample statistic, such as the mean. Chapter 6 of Kirkwood and Sterne

20
describes the calculation of a confidence interval for a mean. You are advised to pay particular
attention to Section 6.3 where the interpretation of a confidence is discussed. Confidence intervals
are frequently misinterpreted and your ability to interpret them will be assessed in this unit. Please
refresh your understanding of section 6.3. Strictly speaking, a 95% confidence interval means that
if we were to take 1000 different samples and compute a 95% confidence interval for each sample,
then approximately 95% of those confidence intervals (ie. 950) will contain the true population
value. A confidence interval does not predict that the true value of the population parameter has
a 95% chance of being in the confidence interval, given the observed data. Confidence intervals
depend on using the normal distribution, so you may also wish to re-read Chapter 5 to refresh your
understanding of the normal distribution and the central limit theorem.

4.3 Basic summaries using R

R has a suite of appropriately named functions for creating numerical summaries. In the following
code example, we create a simple variable (vector) using the assign operator (->) and use inbuilt
functions to generate summary statistics.
dt <- c(4, 5, 6, 5, 2, 9, 5, 19, 4, 3) #create a vector called dt
median(dt) #calculate the median

[1] 5
mean(dt) #sample mean

[1] 6.2
sd(dt) #sample standard devation

[1] 4.871687
var(dt) #sample variance (same as sd^2)

[1] 23.73333
# Additionally some other useful statistics
min(dt) #minimum value

[1] 2
max(dt) #maximum value

[1] 19
There is one function which combines many of the above:
summary(dt)

Min. 1st Qu. Median Mean 3rd Qu. Max.


2.00 4.00 5.00 6.20 5.75 19.00
We will make more use the summary() function in the next module.

21
4.4 Confidence interval calculations in R

There is no function to calculate confidence intervals for single variables in R (there is one for use
with regression which we will see later). Instead, we calculate it from the Standard Normal Deviate
(SND) which was introduced in the previous section and Chapter 5 of Kirkwood & Sterne. Using
the same vector dt as above, assuming it is a sample of values from a population, we can calculate
the confidence interval around the mean in this way:
dt <- c(4, 5, 6, 5, 2, 9, 5, 19, 4, 3) #create a vector called dt
n_dt <- length(dt) - sum(is.na(dt)) #the overall length of the vector minus the
# number of missing values in the vector
mu_dt <- mean(dt) #sample mean
sd_dt <- sd(dt) #sample standard deviation
z_l95 <- qnorm(0.05/2) #SND for the lower bound 95% CI
z_u95 <- qnorm(1 - 0.05/2) #SND for the upper bound 95%CI

# Lower CI
lci_mu_dt <- mu_dt + z_l95 * sd_dt/sqrt(n_dt)
lci_mu_dt

[1] 3.180553
# Upper CI
uci_mu_dt <- mu_dt + z_u95 * sd_dt/sqrt(n_dt)
uci_mu_dt

[1] 9.219447
# Or if we wish to put it together in a vector (with names)
c(Mean = mu_dt, LowerCI = lci_mu_dt, UpperCI = uci_mu_dt)

Mean LowerCI UpperCI


6.200000 3.180553 9.219447
Note: The example above shows a vector with named elements. This is a small example of some
more advanced functionality in R. You can assign attributes to a named vector. It isn’t something
to be overly concerned about and we won’t use it much, but it makes presentation of the results a
little better for this example.

4.5 Error and bias

There are three issues that should be considered before accepting that an association observed in a
data set between an exposure and an outcome is a real association. It is possible that the association
is a result of random error, systematic error (bias) or confounding. In this module we consider the
two types of error but will leave confounding until Module 3 where we consider multiple regression
for the analysis of continuous response data.
Random error is related to precision and is also known as sampling variability or “noise”. It is
the loss of precision that arises by chance due to:

22
• biological variation: for example, natural variability in an individual’s blood pressure, choles-
terol or weight
• measurement error: for example, imprecision in measuring devices
• sampling error: a sample that is non-representative by chance
Random error causes a sample to be different from the underlying population. The impact of
random error reduces with increasing sample size.
Systematic error, or bias, is related to accuracy and refers to whether an estimator tends to
systematically overestimate or underestimate the parameter. Systematic error is not affected by
sample size; it is a function of faults in study design, data collection and data analysis. Two main
sources of bias are:
• selection bias: when there is a systematic difference in between people included in the study,
and those who are not included

• measurement bias (or measurement error): resulting from inaccurate measurement or classifi-
cation of outcome or exposure variables
The effects of accuracy and precision are often summarised using a target shooting analogy as shown
in Figure 11. Studies with low random error will provide a more precise estimate of the population
parameter, similar to the second and third images which show targets with shots that are close
together. In contrast, shots in the first and fourth images are further apart. If the study also has
low systematic error, the estimate will also be accurate, similar to the target in the third image
below where the shots are tightly clustered around the target. This represents and estimator that
is precise and accurate. Systematic error results in reduced accuracy, as shown in the second and
fourth images. With high precision (low random error) but low accuracy (through systematic error)
the shots will be clustered together but are not near the target; this represents and estimator that
is precise but not accurate, as shown in the second image.

Figure 11: Precision and accuracy

5 Statistical inference

Statistical inference is the process of generating conclusions about unknown characteristics of a


population, based on sample data. For example, we may wish to estimate the mean systolic blood
pressure for Tasmanian males, aged 18 years or over, with Type 2 diabetes. Based on your readings
about populations and sampling, you will recall that the statistical population is defined at the start
of the study. Since it is usually impractical (or impossible) to measure the entire study population,

23
observations would be made on a representative sample of Tasmanian males, aged 18 years or
over, with Type 2 diabetes. The measured numerical characteristics of the sample, such as the
mean, proportion or variance, are sample statistics which would be used to provide estimates of the
underlying population parameters.
The two components of statistical inference are estimation and hypothesis testing.

5.1 Required Readings

Kirkwood & Sterne: Chapters 7 to 8

5.2 Estimation

There are two types of estimates that can be calculated for each population parameter: a point
estimate and an interval estimate. Point estimates provide single values which are estimates of a
population parameter. It is usually of interest provide information about the two aspects of the
population: location (or central tendency) and spread. For a continuous a outcome, the sample
mean and median are commonly used to estimate the centre of the distribution. Measures of spread
include the range, variance and standard deviation. Confidence intervals are frequently used as
the interval estimate, with 95% confidence intervals (95% CI) the most commonly used in the
research literature. As described in Chapter 6 of Kirkwood and Sterne, the 95% CI is calculated
so that under repeated sampling it will contain the true population parameter 95% of the time.
However, in reality a single random sample is selected, providing a single estimate and a single
confidence interval, which may or may not contain the true population mean. The interval estimate
provides a measure of accuracy of the point estimate by providing a likely range of values. The
width of the confidence interval depends on the sample size so that larger sample sizes provide more
accurate estimates.
Although 95% confidence intervals are the most frequently used in health research, this is an
arbitrary value. We will consider this issue in more detail below in the context of hypothesis testing.
Examples of confidence intervals for other percentages are given on page 51 of Kirkwood and Sterne.
Whilst the 95% confidence interval corresponds to a z-score of 1.96, a 90% confidence interval uses a
z-score of 1.64 and a 99% confidence interval uses a z-score of 2.58. Recall from section 5.6 of the
text that the percentage points of the normal distribution correspond to z-values, which indicate
the number of standard deviations away from the mean. Table A2 (p. 272) in Kirkwood and Sterne
presents commonly used percentage points. Revise Chapters 4 to 6 of Kirkwood and Sterne if
required, to ensure you have a good understanding of confidence intervals and their interpretation.
The most commonly used methods for estimating parameters are ordinary least squares (OLS) and
maximum likelihood (MLE). Although we will not be considering the methodologies in detail in
this course, at this stage it is useful to know that these methods use mathematical algorithms to
provide the best fit of the predictor variables to the outcome variable. Least squares estimation
is limited to continuous response variables with linear estimation and normally distributed data.
The most commonly used regression models use OLS estimation, including simple and multiple

24
linear regression, which we will introduce in Module 3. Maximum likelihood estimation has fewer
limitations and can be applied to non-normal and non-linear estimation. In Module 4 we introduce
Generalized Linear Models (GLM) for logistic regression, which use MLE. For many common
parameters the two methods will provide the same estimation when the assumptions of OLS are
met. These estimation methods are classical approaches that are termed “frequentist” methods. An
alternative approach, which we will not consider in this course, is a Bayesian approach to estimation
which incorporates prior knowledge about the value of the population parameter.

5.3 Hypothesis testing

Chapter 7 of Kirkwood and Sterne introduces the concept of hypothesis testing in the context of
comparison of two means. Chapter 8 considers how to use P-values and confidence intervals to
interpret the results of statistical analyses.
Hypothesis testing is a statistical procedure for testing whether an experimental finding could
plausibly have occurred by chance. Classical statistical hypothesis testing involves defining a
null hypothesis, which is a specific statement about a population parameter (usually of no
effect/difference). The null hypothesis says ‘nothing’s happening’. For example, in a comparison of
the effect of smoking on lung function levels in a group of COPD patients, the null hypothesis might
be: “There is no difference in lung function between current smokers and ex-smokers”. Using sample
statistics, the hypothesis test produces a P-value that indicates the probability of an experimental
finding at least as extreme as observed, if the null hypothesis is true. This concept can be difficult to
grasp and P-values are frequently misinterpreted. A common misinterpretation is that the P-value
indicates the probability that the null hypothesis is true - this is incorrect!

Interpretation of a P-value:
The probability of an experimental finding at least as extreme as the one observed,
assuming that the null hypothesis is true.
A common misunderstanding is the idea that P-values are a dichotomous indicator of “significance”.
The P-value can be viewed as a measure of the strength of evidence against the null hypothesis but
it should not be reduced to simply “significant” versus “not significant”. The convention of using a
significance level of 0.05 is a somewhat arbitrary level that has incorrectly become the “holy grail”
of scientific findings. There is quite a body of literature discussing the problems with this approach.
A good commentary by Professor Geoff Cumming, writing for The Conversation, may be found at
the link below. Be sure to watch the video, Dance of the p values, for an enlightening (and musical)
look at null hypothesis significance testing.
http://theconversation.com/the-problem-with-p-values-how-significant-are-they-really-20029

25
6 Additional resources for Module 1

6.1 Statistical concepts

Statistics at Square One is a basic medical statistics book published by the BMJ. It covers basic
statistical concepts and tests, providing clear descriptions and explanations. The BMJ provide and
electronic version here: Statistics at Square One

6.2 Getting started with R

The Module 1 content folder on MyLO include a series of R videos to help you get started with R.
In addition, the following general resources may also be helpful. These resources, along with others,
are also available in the “Resources for learning R” folder in MyLO.
R cookbook
Marin Stats Lectures

6.3 Defining and displaying data

An overview of basic graphs in R: Graphs by Quick-R


A short video on producing boxplots using R from Marin Stats: Marin Stats on making boxplots
Khan Academy video on reading boxplots: reading boxplots
Two entries from R-bloggers presenting the basics of histograms and how to create them:
1. Basics of histograms
2. How to make a histogram with basic R
Optional: For those wishing to extend themselves, we introduce ggplot2 – a data visualization
package created by Hadley Wickham, offering a powerful graphics language for creating complex,
elegant plots.

6.4 Populations and samples

For students who are unfamiliar with, or who wish to refresh their understanding of, the basic
concepts of probability, we suggest the resources below to support your learning.
The following 5-10-minute videos from Kahn Academy videos may useful if the concepts were new
to you or were difficult to understand.
1. Basic probability
2. Sample variance
3. Sample standard deviation and bias

26
4. Normal Distribution
5. Central Limit Theorem
6. Sampling distribution of the sample mean

6.5 Statistical Inference

This Khan Academy video uses a clinical trial example to discuss hypothesis testing and p-values:
Hypothesis testing and p-values

27
7 Exercises

Exercise 1.1 Produce a boxplot and histogram of the “dt” data as shown in Section 4.2.1 on page
6 of the notes.
Exercise 1.2 Produce a boxplot and histogram of the dt data without the value of 19. Discuss
this on the discussion board with your fellow classmates.
Exercise 1.3 How is the inter quartile range defined, and how is it different to the range?
Exercise 1.4 What do the numbers on the y-axis of a barplot refer to?
Exercise 1.5 The data used for this exercise are a subsample from a study of respiratory outcomes
for a sample of Chronic Obstructive Pulmonary Disease (COPD) patients, which was introduced in
Section 2 of the module notes. You will find the “Respiratory_subset.csv” file in the Module 1
content folder.
This is your first introduction to working with variables in a data frame (an R dataset). To refer to
a single variable within a data frame: type the name of the data frame followed by $ symbol and
the variable name without any spaces. For example, to refer to the age variable in the data frame
for this exercise you would type copd$age. You can now treat this “compound” name in the same
way as any object that you have already worked with.
The data frame has 103 observations on the following 11 variables:
copd$id : ID code
copd$study_entry : Date of study entry
copd$MRC_Dyspnoea : MRC bretahlessness score (1-5)
copd$weight : Weight in kgs
copd$FEV1 : FEV1 - forced expiratory volume in one second
copd$FVC : FVC - forced vital capacity
copd$FEV1_pcnt_pred : FEV1 percent predicted
copd$DOB : DOB- Date of Birth
copd$sex : Sex 1 = Male, 2 = Female
copd$marital : Marital status 1 = Single, 2 = Married or Defacto, 3 = Separated, Divorced or
Widowed
copd$education : Highest education level 1 = Primary School, 2 = High School, 3 = Certificate or
Diploma, 4 = Year 12, 5 = University
Please run the following code to load the dataset into R, and run some basic data cleaning steps.
Note: You will need to replace the path (S:/Menzies/Biostatistics/MPH/sample data/) in the
first line of code with your own path to locate the file on your computer or use file.choose() as
introduced in the R labs.

28
# read data into R and assign to data.frame named 'copd'
copd <- read.csv("S:/Menzies/Biostatistics/MPH/sample data/Respiratory_subset.csv",
header = T, stringsAsFactors = F, quote = "\"", dec = ".",
na.strings = "")

# Display structure of data


str(copd)

# DATA MANAGEMENT

# Format dates
copd$study_entry_f <- as.Date(copd$study_entry, "%d/%m/%Y")
copd$dob_f <- as.Date(copd$DOB, "%d/%m/%Y")

# Calculate Age
copd$age <- interval(start = copd$dob_f, end = copd$study_entry_f)/duration(num = 1,
units = "years")

# Create new variables to recode categorical variables as


# factors
copd$sex_f <- factor(copd$sex, levels = c(1, 2), labels = c("male",
"female"))

copd$marital_f <- factor(copd$marital, levels = c(1, 2, 3), labels = c("Single",


"Married/deFacto", "Sep/Div/Wid"))
copd$education_f <- factor(copd$education, levels = c(1, 2, 3,
4, 5), labels = c("Primary", "High School", "Cert/Dipl",
"Year 12", "University"))

str(copd)

Note: you will need to use the newly created factor variables in your analyses: sex_f, marital_f and
education_f.
Your tasks are:
a) Examine each variable (excluding ID and date variables) and classify them into
continuous/scaled and categorical, further classify the categorical variables into
binary/discrete, nominal and ordinal.
b) Present an appropriate graphical display for each distribution.
c) Provide a brief summary features of the distribution for each variable.
Exercise 1.6 What happens when you supply a SND value of zero to the pnorm function? Explain
your answer.
Exercise 1.7 Use the pnorm() and ?dnorm() functions in R to calculate the following. Use Table
A1 in Kirkwood and Sterne to verify your answer.
a) The proportion of the standard normal distribution between the z-values of -2.5758

29
and 2.5758?
b) the proportion of the standard normal distribution between the z-values of -1 and
1?
c) the proportion of the standard normal distribution above the z-values of 1.96
Exercise 1.8 What is the difference between a standard deviation and a standard error?
Exercise 1.9 As a little primer to this exercise please watch the Kahn Academy video on Reasonable
samples. Now give some thought to the following: Political opinion polls in Australia and other
parts of the world are often conducted using a random-digit-dialling sampling method and generally
call 2000 telephone numbers. Are these polls are likely to be representative? Please contribute to
the discussion board.
Exercise 1.10 Revisit the COPD data from Exercise 1.5 and calculate the following summary
statistics for each of the continuous variables in this dataset:
a) mean and SD
b) median and IQR
c) 95% confidence interval
Exercise 1.11 Discuss the interpretation of a 95% confidence interval on the discussion board.
Exercise 1.12
In the population, a parameter has a value of 14. Based on the means and standard errors of their
sampling distributions, which of these estimates shows the most bias?
a) Mean=11, SE=3
b) Mean=14, SE=7
c) Mean=19, SE=4
d) Mean=8, SE=3
Exercise 1.13 In the population, a parameter has a value of 12. Based on the means and standard
errors of their sampling distributions, which of these statistics estimates this parameter with the
least sampling variability?
a) Mean=12, SE=4
b) Mean=10, SE=1
c) Mean=11, SE=3
d) Mean=14, SE=2
Exercise 1.14 A clinical trial to compare a mouthwash against a control found a difference in
plaque score after one year of 1.1 units, P=0.006 (two-sided). The following are true or false:
a) The probability that the null hypothesis is true is 0.006
b) If the null hypothesis were true, the probability of getting the observed result or
greater is 0.003
c) The alternative hypothesis is a mean difference of 1.1
d) The probability of the alternative hypothesis being true is 0.994
e) The probability that the true mean 1.1 is 95%.

30
Exercise 1.15 The 95% confidence interval for the mean difference in scores was found to be (0.3
to 1.9 units). The following are true or false:
a) We are 95% sure that the true mean lies between 0.3 and 1.9 units.
b) If the study were repeated many times, the 95% confidence interval would include
the true mean 95% of the time.
c) If we repeated the study with the same sample size, we would expect the mean
difference to be within 0.3 to 1.9 units 95% of the time. d)The study is clinically
important.
d) The power of the study is greater than 80%.
Note: Questions 1.11-1.14 Statistics at Square One

31

You might also like