You are on page 1of 4

MTH4106 Introduction to Statistics

Notes 1 Spring 2014


What is Statistics about?
We collect data, then analyse the data, and then interpret the results, to nd out about
real-world phenomena. There is always variability in the data: we need to extract
meaningful patterns. Because of the variability, our conclusions cannot be certain. We
need to quantify the uncertainty, so that we can make denite decisions and judge how
likely it is that we are right.
Here are some areas of application, with some typical problems.
Agriculture Which varieties of wheat make the best bread?
Manufacturing Can we make a cheaper detergent that is just as effective
as the current one?
Health Does an aspirin a day protect against stroke? If so, are
there any side-effects?
Education What is the best way to teach young children mental arith-
metic?
Biology How does biodiversity affect the enviroment?
Social science Do people live in better houses than they did 20 years
ago?
Economics How is the credit crunch affecting food prices?
Market research What sort of advertising campaign is most effective?
Environmental studies Are people who live near mobile-phone masts more likely
to get cancer?
Meteorology Is global warming a reality?
Psychology Are shyness and loneliness related?
Before starting any of these investigations, we need to stop and ask:
What do we want to investigate?
What should we measure?
How should we measure it?
1
Populations and Samples
When we carry out a statistical investigation we want to nd out about a population.
Denition A population is the collection of items under discussion. It may be nite
or innite; it may be real or hypothetical.
Sometimes although we have a target population in mind the study population we
can actually nd out information about may be different.
We are interested in measuring one or more variables for the members of the
population but to record observations for everyone would be costly. The government
carries out such a census of the population every ten years but also carries out regular
surveys based on samples of a few thousand.
Denition A sample is a subset of a population.
The sample should be chosen to be representative of the population because we
usually want to draw conclusions or inferences about the population based on the
sample. Samples will vary and the question of whether the data in the sample is
compatible with hypotheses we may have about the population will be considered
both in this course and MTH5122 Statistical Methods.
For each member of the sample we will measure one (or more) random variable.
We usually assume something about the distribution of the random variables. For
example, if our data were counts of radioactive particles from a sample of radioactive
sources manufactured to have the same mean it would be reasonable to assume that
X
i
Poisson().
This assumption is called a model, and is called a parameter.
We will not concern ourself much with the mechanics of how the sample is chosen,
but the following examples give you some idea of the sorts of problems:
(a) A city engineer wants to estimate the average weekly water consumption for
single-family dwellings in the city.
The population is single-family dwellings in the city. The variable we want to
measure is water consumption. To collect a sample if the dwellings have water
meters it might be best to get lists of dwellings and annual usage directly from
the water company. If not then the local authority should have lists of addresses
which can be sampled from. Note we should collect data through the year as
water consumption will be seasonal. Note also that if there is no water meter
measuring how much water the household uses may be problematical.
2
(b) A political scientist wants to determine if a majority of voters favour an elected
House of Lords.
The population is voters in the UK. Electoral rolls provide a list of those eligible
to vote. What we want to measure is their opinion on this issue using a neutral
question. (It would be easy to bias the response by asking a leading question.)
We could choose a sample using the electoral roll and then ask the question by
post, on the telephone or face to face but all these methods have problems of
non-response and/or cost.
(c) A medical scientist wants to estimate the average length of time until the recur-
rence of a certain disease.
The population is people who are suffering from this disease or have done in
the past. What we want to measure are the dates of the last bout of disease
and the new bout of disease. We could take a sample of patients suffering the
disease now and follow them until they have another bout. This may be too slow
if the disease doesnt recur often. Alternatively we could use medical records
of people who suffered the disease in one or more hospitals but records can be
wrong and there may be biases introduced.
(d) An electrical engineer wants to determine if the average length of life of transis-
tors of a certain type is greater than 5000 hours.
The population is transistors of this type. We want to record the length of time
to failure by putting a sample of transistors on test and recording when they
fail. Note that for such experiments where the items under test are very reliable
it may be necessary to use an accelerated test where we subject the items to
higher currents than usual although this might introduce biases.
In other parts of the course we may not emphasize the underlying population or
exactly how we collect a sample but remember these questions have had to be consid-
ered.
Three methods of collecting data
1) Take a sample from a population.
Do we ask questions (in which case, how do we word the questionnaire?), or
take objective measurements, such as blood pressure?
This is called a survey. If the sample is the whole population, it is called a
census.
3
2) Design an experiment.
This means that we apply different treatments to different experimental units,
and then measure something to see if there is a difference between the treat-
ments.
How do we choose the treatments?
How do we choose the experimental units?
How do we decide who or what is given which treatment?
(See MTH6116 Design of Experiments.)
3) If it is impractical or unethical to impose our choice of treatments, we may do
an observational study. We might compare the effect of things that people can
change themselves (for example, diet, or whether they go to the gym), or things
that they cannot change (such as height, or place of birth).
Some practical examples
1. The BBC wants to know how many people watch each of its programmes.
Population = all people in UK
Sample = panel of people, chosen to be representative.
Each member of the panel keeps a diary recording all their TV viewing for a
week, then sends it to the BBC. The BBC uses this data to estimate the total
number of people who watched each programme. This is a survey.
2. Health researchers want to know which lifestyle factors affect the chances of
getting various diseases. The UK BioBank has recently recruited 500,000 vol-
unteers. The UK Biobank people collect some information now (for example,
Do you drink full-cream milk?), then follow the persons medical records
until they die, recording which diseases they get. They will then be able to test
hypotheses such as If you cycle daily, you are less likely to have a stroke. This
is an observational study.
3. A marine engineer wants to know if a new sort of paint protects pier supports
from corrosion. He paints ten metal beams, and leaves a further ten beams
unpainted (why?). He puts all the beams in a tank of sea water for three months,
then he measures the amount of corrosion in each beam. This is an experiment.
4