You are on page 1of 65

1 Data Summary and Presentation

1.1 Populations and Samples


A population is a dened group of individuals or items to whom the conclusions of a study or exper-
iment apply. A nite population is one where the number of individuals or items in the population
can be counted, such as the population of people in a city, or the population of all registered com-
panies, or the population of all licenced cars in the UK. An innite population: a researcher may
notionally repeat an experiment or observational process several times, under the same conditions;
then all of the possible observations that he or she could make under these conditions is the popula-
tion of observations. Examples might be the population of possible claims received by an insurance
company or the population of possible industrial accidents. Here the population is not necessarily
real, in the sense of the earlier examples, and it is often convenient to regard it as an innite number
of hypothetical values.
A sample is a subset of the population. A variable is a quantity that may take any one of a specied
set of values, for a given individual. Examples are age (of persons), income (of households), socio-
economic class (of workers). Data are the set of values of one or more variables recorded on one or
more individuals or items.
1.1.1 Reasons for Sampling
1. It may be too expensive or time consuming to measure every item.
2. It may be more accurate to measure a few items carefully, than to try to measure every item.
3. It is essential to sample if by examining items you destroy them; e.g., if you are interested in the
life length of light bulbs then you must burn them until they fail, so you cannot test every bulb.
The disadvantage of sampling is that information is inevitably lost by not measuring every item.
1.1.2 Random Sample from a Finite Population
This is a sample where every member of the population has an equal chance of being in the sample.
Strictly speaking: a random sample of size n is one chosen such that every sample of size n has the
same chance of being the chosen sample.
It is possible though, that by chance the random sample turns out to be unrepresentative of the
population. One way to reduce this possibility is to increase the sample size, but this may not always
be practical.
1.1.3 How to obtain a random sample
Give every member of the population a number or label.
1. If the population size is small, put each number onto a card and shue the cards thoroughly.
Choose the required number of cards and investigate the members of the population correspond-
ing to these numbers.
1
2. If the population size is large use random number tables. These are pages of numbers which
have been generated in such a way that each digit on the page is equally likely to be 0, 1....,9.
Suppose the population size is N and the number N has x digits. Proceed as follows:
(a) Decide where to start on the page of random numbers.
(b) Decide in which direction to move along the page.
(c) Read o the random numbers consecutively in groups of x, ignoring any numbers which
are greater than N and any which have already occurred.
1.2 Types of variable
Qualitative (non numerical):
Categorical no actual measurement is made, just a qualitative judgment e.g., sex, hair
colour. The observations are said to fall into categories.
Ordinal there is a natural ordering of the categories such as degree of severity of a dis-
ease (mild, moderate, severe), occupational group (professional, skilled manual, unskilled
manual).
Quantitative (numerical):
Discrete can only take one of a discrete set of values (e.g., the number of children in a family,
the number of bankruptcies in a year, the number of industrial accidents in a month).
Continuous can in principle take any value in a continuous range, such as a persons height
or the time for some event to happen. In practice, all real observations are discrete be-
cause they are recorded with nite accuracy (e.g., time to the nearest minute, income to
the nearest pound). But if they are recorded suciently accurately they are regarded as
continuous.
1.3 Tables and Graphs
1.3.1 Presentation of Qualitative Data (ordered and unordered)
1. By a frequency table:
Example 1.1 The number of fatal accidents to children under 15 in the UK during
1987 (source: Action on Accidents, produced by the National Association of Health
Authorities and the Royal Society for the Prevention of Accidents).
Type of accident Number of children Percentage
Pedestrians (road) (P) 260 30.88
Burns, fires (mainly home) (B) 119 14.13
Vehicle occupants (road) (V) 96 11.40
Cyclists (road) (R) 73 8.67
Drownings (home and elsewhere) (D) 63 7.48
Choking on food (C) 50 5.94
Falls (home and elsewhere) (F) 40 4.75
Suffocation (home) (S) 34 4.04
Others (O) 107 12.71
Total 842 100.00
2
The categories are the types of accident. The number of children dying from each type of accident
is the frequency of that category. The relative frequency or proportion of children dying from
each type of accident is the frequency divided by the total number of deaths. Multiplying the
relative frequencies by 100 gives the percentages (i.e., the relative frequencies per 100 cases).
2. Pictorially: for example by a barplot:
Others
Suffocation
Falls
Choking on food
Drownings
Cycling
Vehical passengers
Burns or fires
Pedestrians
0 50 100 150 200 250
Number of children
If the categories are unordered (as here) is it useful to order them by size in the gure indeed
the gure and table could be combined.
1.3.2 Presentation of Discrete Data
1. By a frequency table, giving the number of occurrences of each value of the variable:
Example 1.2 The following data give the number of plants of the sedge Carex acca
found in 100 throws of a quadrat over a wet meadow.
1 0 1 4 1 0 2 4 3 0 2 0 0 4 2 0 0 2 1 0 2 1 0 3 1
6 0 0 1 2 2 0 2 1 2 0 0 0 0 1 1 0 1 2 1 2 2 3 4 0
0 4 0 3 0 1 5 1 2 4 0 1 1 0 4 0 0 3 8 2 1 3 1 3 0
0 1 0 2 0 5 1 2 0 3 1 2 1 3 0 1 0 0 1 0 0 0 0 2 3
Number of plants 0 1 2 3 4 5 6 7 8 Total
Frequency 37 24 18 10 7 2 1 0 1 100
Relative frequency 0.37 0.24 0.18 0.10 0.07 0.02 0 01 0.00 0.01 1.00
2. Pictorially, for example by a relative frequency graph, e.g., for Example 1.2:
0 1 2 3 4 5 6 7 8
0.0
0.1
0.2
0.3
Number of plants
R
e
l
a
t
i
v
e

f
r
e
q
u
e
n
c
y
3
1.3.3 Presentation of Small Amounts of Continuous Data
Example 1.3 The systolic blood pressure of 21 women participating in a keep t class
were as follows.
152 105 123 131 99 115 149 137 126 124 128
143 150 112 135 130 123 118 122 136 141
1. By a dotplot (also known as a dot diagram)
*
* * * * * *** * * ** *** * * ** *
--+---------+---------+---------+---------+---------+----
100 110 120 130 140 150
Systolic blood pressure (mmHg)
2. By a stemplot (also known as a stem and leaf display)
(a) First divide the numbers up into a right hand digit and the rest of the number. The left
hand digits are called the stem values and the right hand digit the leaf value. Aim at having
between 5 and 15 stem values.
(b) Write all the possible stem values from smallest to largest in a column.
(c) Take each number in turn and write its leaf value against its stem value.
(d) Re-write the stemplot putting the leaves on each stem in ascending order. Write the leaves
out carefully in vertical columns so that the shape of the data can be seen.
Stem Leaves (leaf unit = 1) Stem Leaves (leaf unit = 1)
9 9 9 9
10 5 10 5
11 528 11 258
12 364832 12 233468
13 17506 13 01567
14 931 14 139
15 20 15 02
Example 1.4 The following data give the per capita income, in $US, for a sample of 16
countries. (Source IMF)
India 162 Honduras 403 Costa Rica 1491 Israel 3790
Madagascar 183 Botswana 524 Argentina 1988 Japan 6593
The Gambia 208 Ecuador 854 Greece 2822 Australia 6843
Egypt 358 Jamaica 1466 Italy 3439 Kuwait 11554
1. Dotplot:
*
** *
**** * * * * * ** *
+----+----+----+----+----+----+----+----+----+----+----+----+-----
0 2000 4000 6000 8000 10000 12000
Income in US Dollars
4
2. Stemplot:
Following the procedure in Example 1.3 would give 1140 stem values which is far too many. In
order to get over this we round every number down to the nearest 10 or if the number of stem
values is still too large to the nearest 100. By rounding down you retain the actual gures in
the original number.
100 100 200 300 400 500 800 1400
1400 1900 2800 3400 3700 6500 6800 11500
Ignore the zeroes at the end of each number and proceed as in Example 1.3.
Stem Leaves (leaf unit = 100) Stem Leaves (leaf unit = 100)
0 1123458 0 1123458
1 449 1 449
2 8 2 8
3 47 3 47
4 4
5 5
6 58 6 58
7
8 HI: 11500
9
10
11 5
If there are a few observations that are a long way from the main body of the data (i.e., there
are several stems with no leaves) then these observations can be listed separately as either high
(HI) or low (LO) values in a stemplot as above.
1.3.4 Symmetric and Skew Data
Data are symmetric if when you imagine a line through the central value in a table or diagram, the
two halves are mirror images. If the data are not symmetric they are said to be skewed. If the larger
values are more spread out than the smaller values, then the data are positively skewed, or right
skew. If the smaller values are more spread out than the larger values, then the data are negatively
skewed, or left skew. The data in Examples 1.2 and 1.4 are right skew while the data in example 1.3
are approximately symmetric.
1.3.5 Presentation of Large Amounts of Continuous Data
Example 1.5 The individual measurements of span (in inches) of 140 men were as follows.
68.2 64.8 64.2 73.9 69.5 70.8 68.4 72.7 72.6 67.5
67.0 72.7 71.6 72.3 70 0 71.0 71.0 67.4 68.3 66.8
73.1 71.9 73.4 67.6 73.0 69.9 71.8 64.3 71.5 70.4
70.3 73.9 70.8 70.2 65.0 75.4 72.3 71.1 65.5 70.6
70.9 68.3 71.5 66.6 70.0 72.2 67.6 71.2 70.5 66.5
76.3 66.1 76.0 75.1 68.2 68.6 69.4 69.1 70.7 70.5
65.5 69.9 68.0 72 2 69.8 65.5 73.2 64.7 67.5 68.2
72.4 68.5 65.1 65.6 74.8 68.0 70.3 73.2 74.2 74.7
65.8 72.5 70.1 72.2 73.8 66.3 70.3 74.0 69.4 69.7
5
70.7 67.5 68.4 67.0 68.3 67.6 63.9 66.5 67.1 66.9
65.1 72.1 71.3 67.1 65.4 68.0 70.3 66.7 70.8 74.0
66.5 71.6 73.9 70.8 66.5 69.8 73.9 66.7 67.8 67.9
67.5 65.6 70.3 70.7 67.3 65.8 66.0 72.2 70.8 72.1
64.4 65.7 72.4 68.2 73.2 68.0 68.4 61.5 66.9 61.3
1. Grouped frequency table
(a) Calculate the range of the data i.e., the largest value minus the smallest value.
(b) Divide the range up into groups. Aim at having between 5 and 15 groups.
(c) Calculate the frequency of each group.
Span Tally marks Frequency Relative frequency
61.0 - 62.4 2 0.014
62.5 - 63.9 1 0.007
64.0 - 65.4 9 0.064
65.5 - 66.9 21 0.150
67.0 - 68.4 29 0.207
68.5 - 69.9 11 0.079
70.0 - 71.4 27 0.193
71.5 - 72.9 20 0.143
73.0 - 74.4 14 0.100
74.5 - 75.9 4 0.029
76.0 - 77.4 2 0.014
Total 140 1.000
2. Frequency histogram
This is a graph of the information in a grouped frequency table. For each group, a rectangle
is drawn with base equal to the group width and area proportional to the frequency for that
group. Here is a histogram of the span data.
60 65 70 75 80
0
5
10
15
20
25
30
Frequency per 1.4 in
Span (inches)
Usually the groups are of equal width, as in the above example, and the height of the rectangle
is then also proportional to the frequency. It is common for the vertical axis to be called
frequency in this situation, which really means frequency per group width (sometimes also
called frequency density). In the above example, the verticalscale is frequency per 1.5 inches.
If we wanted to change it to frequency per inch we would divide the numbers on the vertical
axis by 1.5, but the shape of the graph would not change.
6
In general, though, the intervals in a grouped frequency table are not necessarily equal, and it
is the areas of the rectangles, not their heights, that are proportional to the frequencies. If the
graph is drawn so that the total area equals 1, it is called a relative frequency histogram. It is
exactly the the same shape, but the vertical axis is again re-scaled so that areas are proportions.
1.3.6 Comments on Dotplots, Stemplots, Histograms and Boxplots
1. A dotplot is a simple and often very eective display of a small sample of continuous data;
it can show location, dispersion, asymmetry (though subtle shape features are hard to see in
small samples) and extreme values. Dotplots are particularly good for comparing several small
samples with respect to a common scale.
2. A stemplot tabulates the actual values of the original data (possibly after rounding). It therefore
lends itself to simple calculation of quantiles. And it also gives a pictorial view of the data (like a
histogram on its side) and shows location, dispersion and, to some extent, shape. Stemplots can
be useful for small and moderate sample sizes (e.g., up to 50 or 60) but not for large samples.
3. A histogram just plots the frequencies i.e., the information in a frequency table, so the
individual values are lost. This form is good for showing shape (in addition to location and
dispersion) for reasonably large samples. It is also possible to compare several histograms on a
common scale.
4. A fourth type of graph useful for quantitative data, is the boxplot (see below). This is also
good for comparing a number of samples, each of a reasonable size (e.g., over 25 each).
1.4 Summary Statistics
In addition to graphical displays it is often useful to have numerical summary statistics that attempt
to condense the important features of the data into a few numbers.
1.4.1 Measures of Location (or Level)
1. Mean. This is sometimes referred to as the arithmetic mean, to distinguish it from other types
of mean, such as geometric mean and harmonic mean. It is dened as
mean =
the sum of all observations
the total number of observations
This is often written in the mathematical notation x which you will nd in textbooks and on
calculators. Using Example 1.3 to help explain the notation:
n is the number of observations in the sample = 21
x
1
is the systolic blood pressure of the rst woman in the sample = 152
x
2
is the systolic blood pressure of the second woman in the sample = 105, etc.

x is the sum of all the x values; this is short for the more precise expression

n
i=1
x
i
.
x is the mean of the sample
Thus
x =

x
n
=
152 + 105 + + 141
21
= 128.52 .
The mean has some properties that it is useful to understand:
7
(a) Imagine trying to balance a dotplot of the data on the end of a pencil (where each dot has
equal weight). The point on the scale where the gure balances exactly is the mean. This
helps us understand why if the data are symmetric, the mean is in the middle; and it tells
us intuitively where the mean must be if the data are not symmetric.
(b) Suppose that you subtract the mean from each data value. Then the resulting dierences
(sometimes called residuals) must add to zero. That is
(x
1
x) + (x
2
x) +. . . + (x
n
x) =
n

i1
x
i
n x = n x n x = 0 .
2. Median. The median of a set of numbers is the value below which (or equivalently above which)
half of them lie. It is also known as the 50-percentile point. To nd the median of n observations,
rst put the observations in increasing order. The median is then given by:
the
n+1
2
th observation if n is odd,
the mean of the
n
2
th and (
n
2
+ 1)th observations if n is even.
For the data in Example 1.3, the observations have been written down in increasing order in the
stemplot. As n = 21, the median is the 11th observation, i.e., median = 128.
For the data in Example 1.4, x = 2667.4, and median = (1466 + 1491)/2 = 1478.5.
Note that for the approximately symmetric data of Example 1.3 the mean and median are similar
but for the right skew data of Example 1.4 the mean is much larger than the median.
3. Quartiles (and other quantiles). In the same way as for the median, we may calculate the value
below which some specied fraction of the observations lie. The lower quartile q
L
is the value
below which one quarter of the observations lie and the upper quartile q
U
is the value below
which three quarters of the observations lie. The lower and upper quartiles are also known as
the 25 and 75 percentiles. Dierent text books may use slightly dierent denitions of sample
quartiles. Here is a standard one: as when nding the median, rst put all the n observations
in increasing order. Then:
If
n
4
is not a whole number then calculate a, the next whole number larger than
n
4
, and
b, the next whole number larger than
3n
4
. The lower quartile is the ath observation
and the upper quartile is the bth observation.
If
n
4
is a whole number then the lower quartile is the mean of the
n
4
th and (
n
4
+ 1) th
observations and the upper quartile is the mean of the
3n
4
th and (
3n
4
+ 1) th.
In Example 1.3, n = 21, so
n
4
= 5.25, which is not a whole number. So a = 6 and the lower
quartile q
L
is the 6th value in the ordered data. From the stem plot, q
L
= 122. Also,
3n
4
= 15.75,
so b = 16 and the upper quartile is the 16th value, q
U
= 137. Note that we could also nd q
U
by counting a = 6 values from down from the largest.
In Example 1.4, n = 16, so
n
4
= 4, which is a whole number. So the lower quartile is the
average of the 4th and 5th observations: q
L
= (358 + 403)/2 = 380.5, and the upper quartile
is the average of the 12th and 13th (or equivalently the 4th and 5th down from the largest):
q
U
= (3439 + 3790)/2 = 3614.5.
1.4.2 Measures of spread
1. Range. The range is the largest observation minus the smallest observation.
In Example 1.3, the range is 152 99 = 53. In Example 1.4, the range is 11554 162 = 11392.
8
2. Interquartile Range. The range has the disadvantage that it may be greatly aected by
extreme values that are a large distance away from the main body of the data as in Example 1.4
so that it may not give an informative measure of the spread of most of the data. A more
stable measure is the interquartile range, which is the range of the middle half of the data. Thus
interquartile range = upper quartile lower quartile = q
U
q
L
.
For the data in Example 1.3 the interquartile range is 137122 = 15. For the data in Example 1.4
the interquartile range = 5191.5 380.5 = 3234.0.
3. Variance and Standard Deviation.
The sample variance is the sum of squares of the residuals (the dierences between each
observation and the sample mean) divided by n 1:
variance =
(x
1
x)
2
+ (x
2
x)
2
+ + (x
n
x)
2
n 1
=

n
i=1
(x
i
x)
2
n 1
The units of the variance are the square of the units of the original data, so its numerical value
is not particularly useful as a measure of spread. (We will see later that the variance is a useful
mathematical parameter; in particular, if you add two independent quantities, then the variance
of the sum equals the sum of the variances.)
The corresponding measure of spread that is in the original units is the standard deviation,
dened by
standard deviation =

variance .
The sample standard deviation is usually denoted by s and the variance by s
2
.
If you calculate the standard deviation using the statistics mode on a calculator, then s is found
by pressing the
n1
key. If you calculate the standard deviation using a calculator without a
statistics mode, the calculating form of the variance is
variance = s
2
=
1
n 1
_

x
2

x)
2
n
_
For the data in Example 1.3, n = 21,

x = 2699,

x
2
= 152
2
+ 105
2
+ + 141
2
= 350983.
Hence the variance is
s
2
=
1
20
_
350983
(2699)
2
21
_
= 204.8619
and the standard deviation is
s =

204.8619 = 14.3130 .
1.5 Five-gure summaries and boxplots
A set of data is often conveniently summarised by the ve statistics: minimum, lower quartile, median,
upper quartile and maximum. For moderate or larger samples, these can give concise information
about location, spread and shape; and several samples can be compared in this way. It is common
also to present these numbers graphically in a boxplot.
A scale is drawn (in the same way as for a dotplot) and a box is drawn between the two quartiles.
So half of the data are in the box. The median is also marked in the box. Lines (sometimes called
whiskers) are drawn from the ends of the box to the minimum and maximum, to show the whole
9
range of the data. Extreme points (e.g., those more than 1.5 interquartile ranges below q
L
or above
q
U
) are often plotted separately.
Here are boxplots for the data in Examples 1.3 and 1.4. You can compare these with the dotplots
shown earlier.

0 2000 4000 6000 8000 10000


income (dollars)
Example 1.4

100 110 120 130 140 150


blood pressure (mmHg)
Example 1.3
In Example 1.3 the ve-gure summary is (99, 122, 128, 137, 152). Half of the observations are
between 122 and 137; the distribution is nearly symmetrical (the median is near the middle of the
box) and one extreme low value has been identied, which is not much lower than the next lowest.
In Example 1.4 the ve gure summary is (162, 380.5, 1478.5, 3614.5, 11554). The distribution is very
positively skewed (the lower whisker is much shorter than the upper one and the median is towards
the left end of the box) and an extreme high value considerably higher than the next highest is
shown.
1.6 Choice of Summary Statistics
The mean and standard deviation together describe location and spread for quantitative data.
They are most useful when it makes sense to add or average the measurements arithmetically (e.g., for
lengths, times, amounts of money, etc.) They are less useful if the data are skewed because they give
no indication of shape indeed for highly skewed data the mean may be rather untypical of most
of the data. They may also be greatly aected by outliers. They also form the basis for condence
intervals and signicance tests when the data are samples from normal populations.
The ve-gure summary describes not only location and spread but also shape. These statistics
describe data essentially by giving the range of each quarter of the data ordered by size, so they
are useful when this description is. In particular, they do not combine the data arithmetically. The
median and quartiles are not aected by extreme values or outliers, so they may be useful for this
reason.
In Example 1.4, the ve-gure summary is much more informative than the mean and standard
deviation. The latter give a poor description because of the strong skewness. Also the observations
are amounts of income per head for several countries of very dierent sizes. It is not clear that it
makes much sense to average these numbers arithmetically.
In Example 1.3, the observations are blood pressures of several similar individuals. They are fairly
symmetrically distributed and well described by location and spread only. While it may not be
physically meaningful to average these, it is not unreasonable to do so, and the mean and standard
deviation are a useful summary here.
10
1.7 Change of origin and scale
It is often convenient or necessary to change the units of measurement. This may involve changing
the origin (e.g., local time to Greenwich Mean Time) or changing the scale (dollars to pounds, miles
to kilometres), or both (degrees Celsius to degrees Fahrenheit). For example, if x is temperature in

C and u is temperature in

F, then u = 1.8x + 32 and x = (u 32)/1.8. There are simple rules for
how summary statistics change under such linear transformations.
1. If you subtract a constant a from each of a set of numbers, then the mean of the new set is the
mean of the old set minus a, and the standard deviation is unchanged. Thus, if u
i
= x
i
a, for
i = 1, 2, . . . , n, then u = x a and s
u
= s
x
.
In general, adding or subtracting a constant (changing the origin) will shift measures of location
(mean, median, quartiles, etc.) by that amount, but will not aect measures of dispersion (sd,
interquartile range, etc).
2. If you multiply each number by the same positive constant, then both the mean and standard
deviation are multiplied by that constant. If u
i
= bx
i
, where b > 0, then u = b x and s
u
= bs
x
.
3. More generally, if u
i
= bx
i
+a, where b > 0, then u = b x +a and s
u
= bs
x
.
4. In particular, if
u
i
=
x
i
x
s
x
then u = 0 and s
u
= 1. That is, if we subtract the sample mean from each data value and then
divide each by the sample standard deviation, the resulting numbers u
1
, u
2
, . . . , u
n
will have
mean equal to 0 and standard deviation equal to 1. These numbers are called the standardised
data. This is the special case of 3 above where a = x/s
x
and b = 1/s
x
.
These rules can be very useful when computing summary statistics using a calculator. For example,
if all the numbers are multiples of 100 it saves time to compute the mean and standard deviation
omitting the two trailing zeros then add these two zeros back to the results.
1.8 Log transformations
Many variables and relationships are usefully described using log scales. Logs to base 10 are useful
to indicate orders of magnitude, but mathematical formulae generally use natural logs, i.e., logs to
base e. Here are some logarithmic scales: the rst shows equal divisions of y = log
10
x and the second
shows equal divisions of y = log
e
x:
Natural logs also have a very useful numerical property:
the dierence between two numbers, as a fraction of their mean value, approximately equals
the dierence between their natural logs.
For example (110 90)/100 = 0.20 and log
e
(110) log
e
(90) = 0.20067. This works well for fractional
dierences up to about 0.5.
Suppose we take (natural) logs of a set of numbers. What happens to their mean and standard
deviation? The mean of the logs does not equal the log of the mean of the numbers. In general
11
-2 -1 0 1 2
0.01 0.05 0.1 0.5 1 5 10 50 100
-2 -1 0 1 2
0.1 0.2 0.5 1 2 5 10
mean(log x) is less than log(mean of x), though these two are quite close if the standard deviation is
small. The approximate formula is
mean(log x) log
_
mean(x)
_

1
2
_
sd(x)
mean(x)
_
2
.
Furthermore, the standard deviation of the logs approximately equals the relative standard deviation
of the original numbers:
sd(log x)
sd(x)
mean(x)
.
Thus, if the standard deviation of log x equals 0.20, then the standard deviation of x is approximately
20% of the mean of x. Again, this works well for relative standard deviations up to about 0.5.
12
2 Describing bivariate data
2.1 Formulae for Univariate and Bivariate Data
Consider a sample of values of two variables x and y for n individuals. Denote these data by x
i
, y
i
for individual i = 1, 2, . . . , n. The sample means are denoted by
x =

x
i
n
=
x
1
+x
2
+ +x
n
n
and y =

y
i
n
,
the sums of squares about the mean are denoted by
C
xx
=
n

i=1
(x
i
x)
2
=

x
2
i

(

x
i
)
2
n
and C
yy
=
n

i=1
(y
i
y)
2
=

y
2
i

(

y
i
)
2
n
and the standard deviations of x and y are
s
x
=

C
xx
n 1
and s
y
=

C
yy
n 1
.
The sum of products about the mean is denoted by
C
xy
=
n

i=1
(x
i
x)(y
i
y) =

x
i
y
i

x
i
)(

y
i
)
n
and the correlation coecient is dened by
r = r
xy
=
C
xy
_
C
xx
C
yy
=
1
n 1
n

i=1
_
x
i
x
s
x
__
y
i
y
s
y
_
.
This is a measure of the strength and direction of a linear relationship between x and y. The least
squares regression line is given by the equation
y = a +bx,
where this line has slope b and intercept a given by
b =
C
xy
C
xx
and a = y b x.
The residual sum of squares is
RSS =
n

i=1
(y
i
a bx
i
)
2
= C
yy

C
2
xy
C
xx
= C
yy
(1 r
2
xy
)
and the residual standard deviation is
s
res
=

RSS
n 2
=

(n 1)s
2
y
(1 r
2
xy
)
n 2
.
Interpretations: For a particular x, the quantity a+bx (i.e., the point on the line) can be interpreted
as the average value of y for individuals with this x value; and the residual standard deviation
can be interpreted as the standard deviation of y values for individuals with the same x value. The
13
meaning of what an individual is will depend on the context. The intercept a is the value of y when
x = 0 and can therefore be regarded as the average y for individuals that have x = 0. (Sometimes this
interpretation may not make physical sense.) The slope b is the amount by which y changes when x
increases by one unit. It can therefore be regarded as the change in the average y when x increases
by one unit.
These interpretations make the tacit assumptions that (a) a straight line is a good description of the
relationship and (b) that the scatter of y values about the line is roughly the same for each x.
Calculators: If you have a statistical calculator with two-variable data entry, it will have keys that
give you x, y, s
x
, s
y
, a, b, and r. It may also have a key labelled y that will give you a + bx when
you enter a value of x. Amazingly, most such calculators do not have a key to give you the residual
standard deviation. To calculate s
res
, use one of the formulae above.
Also, such calculators generally have keys to give you the values of n,

x,

y,

x
2
,

y
2
and

xy,
from which you may also calculate C
xx
, C
yy
and C
xy
using the above formulae.
Example 2.0 In order to illustrate the various calculations with simple numbers, here are
some ctitious data for a small sample of 12 households, where x is weekly income and y
is weekly expenditure. Both variables are in dollars per week.
x: 100 100 200 300 300 400 400 400 500 500 500 600
y: 50 100 95 225 280 270 340 380 400 455 480 535
Here is a scattter plot of these data:

income (dollars per week)


e
x
p
e
n
d
i
t
u
r
e

(
d
o
l
l
a
r
s

p
e
r

w
e
e
k
)
0 100 200 300 400 500 600
0
100
200
300
400
500
600
You can use your calculator to check that the various statistics are n = 12, x = 358.33, y = 300.83,
s
x
= 162.14, s
y
= 159.86 and r
xy
= 0.97. The means and standard deviations are all in dollars
(per week). The average expenditure is a bit less than the average income and for both variables
the standard deviation is large in proportion to the mean (i.e., s
x
/ x = 0.45 and s
y
/ y = 0.53). The
correlation coecient is of course dimensionless and is very close to 1, suggesting that there is a strong
positive association between expenditure and income, as can be seen from the gure.
You can also check that the intercept, slope and residual standard deviation for the least squares
regression line are a = 41.7, b = 0.96 and s
res
= 41.06. So the weekly expenditures of households
with the same weekly income x, vary about a mean of 41.7+0.96x dollars with a standard deviation
of about 41 dollars. For example, if the income is 400 dollars, the average expenditure is 41.7 +
0.96 400 = 342.3 dollars, and the standard deviation of expenditures is 41 dollars. Note that
14
expenditures for households with the same income vary with a much smaller standard deviation than
do expenditures of households with dierent incomes.
Here is another plot of the data, with the regression line drawn on: You can see that the line provides

income (dollars per week)


e
x
p
e
n
d
i
t
u
r
e

(
d
o
l
l
a
r
s

p
e
r

w
e
e
k
)
0 100 200 300 400 500 600
0
100
200
300
400
500
600
a good description of how the average expenditure increases with income. The points are scattered
approximately equally above and below the line, with about the same amount of scatter at each x
value. In this example, both x and y are measured in the same units and the slope b = 0.96 has a
simple interpretation: for every dollar increase in income, the households spend on average an extra
96 cents. The intercept a = 41.7 formally represents the average expenditure of households with
zero income, but this is not meaningful in the present context as there is no reason to suppose that
the straight line description is sensible when income is small.
Example 2.1 The following table gives the area in square km (A) and the number of plant
species (S) for 14 of the Galapagos islands, along with values of x = log A and y = log S
Note that these are logs to base e. We want to describe how S depends on A.
Island Area No of log area log no of
sq.km species species
A S x y
Daphne Mayor 0.34 18 -1.07881 2.89037
Espanola 58.27 97 4.06509 4.57471
Fernandina 634.49 93 6.45282 4.53260
Genovesa 17.35 40 2.85359 3.68888
Isabela 4669.32 347 8.44877 5.84933
Marchena 129.49 51 4.86360 3.93183
Pinzon 17.95 108 2.88759 4.68213
Rabida 4.89 70 1.58719 4.24850
San Salvador 572.33 237 6.34972 5.46806
Santa Cruz 903.82 444 6.80663 6.09582
Santa Fe 24.08 62 3.18138 4.12713
Santa Maria 170.92 285 5.14120 5.65249
Seymour 1.84 44 0.60977 3.78419
Tortuga 1.24 16 0.21511 2.77259
Here are scatter plots of S against A and y against x. Look at these carefully to see the eect of the
log transformations. Note that the rst plot is dominated by the largest island (Isabela), which looks
like an outlier, but this is not particularly extreme on the log scale. It looks as if y is roughly linearly
related to x and that the scatter of y about the line is roughly the same for all x.
15

area in sq.km
n
o
.

o
f

s
p
e
c
i
e
s
0 1000 2000 3000 4000
0
100
200
300
400

area in sq.km (log scale)


n
o
.

o
f

s
p
e
c
i
e
s

(
l
o
g

s
c
a
l
e
)
1 10 100 1000
50
100
500

0 2 4 6 8
3
4
5
6
x = log(area)
y

=

l
o
g
(
s
p
e
c
i
e
s
)
You can check using your calculator that x = 3.7417, y = 4.4499, C
xx
= 100.9813, C
yy
= 13.9076, C
xy
=
32.2638, s
x
= 2.7870, s
y
= 1.0343, r
xy
= 0.8609 and that the least squares line is
y = 3.25 + 0.32x
with residual standard deviation s
res
= 0.55. Thus, for islands of the same area A, the log of the
number of species (y) will vary with standard deviation 0.55 about a mean value of 3.25 + 0.32 log A.
Transforming this back gives the relation between S and A:
S = e
3.25
A
0.32
= 25.9A
0.32
.
Thus the number of plant species increases roughly as the cube root of the area of the island, and the
relative standard deviation of this number, for islands of the same area, is about 55%.
2.2 Principle of least squares
2.2.1 Fitting a constant
Suppose we have a set of n numbers y
1
, y
2
, . . . , y
n
and we wish to approximate them by a single number
a, say. What number a is closest to the ys in the sense of least squares? In other words, what a is
such that (y
1
a)
2
+ (y
2
a)
2
+ + (y
n
a)
2
is as small as possible?
The answer is a = y, the mean of the ys. Here is a proof: For any a and i we may write
y
i
a = (y
i
y) + ( y a) .
Now square both sides:
(y
i
a)
2
= (y
i
y)
2
+ 2(y
i
y)( y a) + ( y a)
2
.
Now add up each side over i = 1, 2, . . . , n. On the right hand side we may add up each of the three
terms separately. Furthermore, because

(y
i
y) = 0, the middle term adds to zero, so
n

i=1
(y
i
a)
2
=
n

i=1
(y
i
y)
2
+ n( y a)
2
.
Each term on the right side is positive, or possibly zero, and we can make the right hand side (and
hence the left hand side) as small as possible by choosing a = y. Furthermore, the minimum value of

(y
i
a)
2
is therefore

(y
i
y)
2
= C
yy
.
You can also derive this result by using calculus.
16
2.2.2 Fitting a straight line
Imagine a scatter plot of points (x
1
, y
1
), (x
2
, y
2
), . . ., (x
n
, y
n
). Suppose we want to approximate the
relationship describing how y depends on x by a straight line y = a + bx. What line is closest to the
points in the sense of least squares? That is, what values of a and b are such that
n

i=1
(y
i
a bx
i
)
2
is as small as possible? The answer is
b =
C
xy
C
xx
and a = y b x.
You can prove this either using calculus or by extending the above argument using algebra. The
equation of the least squares line is therefore
y = y +b(x x) .
This line goes through the point ( x, y) and has slope b = C
xy
/C
xx
.
2.3 Properties of the correlation coecient
The correlation coecient r
xy
, also known as Pearsons correlation coecient, is dened as
r
xy
=
C
xy
_
C
xx
C
yy
=
1
n 1
n

i=1
_
x
i
x
s
x
__
y
i
y
s
y
_
.
It is a measure of the strength and direction of a linear relation between two quantitative variables.
You can see from the second formula above that it does not depend on the units of measurement.
If you change the origin and scale of x or y (or both) r
xy
does not change. It has the following
mathematical properties:
1 r
xy
1 ,
if r
xy
> 0, y tends to increase as x increases,
if r
xy
< 0, y tends to decrease as x increases,
if r
xy
= 1 or r
xy
= 1 the points in the xy-scatter plot lie exactly on a straight line,
the closer r
xy
is to 1 or 1, the closer the points are to a straight line.
These properties are illustrated by the following scatter plots:
x
y

r = -1
x
y

r = -0.6
x
y

r = 0
x
y


r = 0.9
x
y

r = 1
There are many types of relationships between two variables. Like other summary statistics, the
correlation coecient by itself cannot summarise a relationship adequately. One should always look
17
at a scatter plot of the data. To illustrate this, the gures below show some very dierent types of
relationship where the points all have the same value of r
xy
= 0.7:
x
y

remote
point
(a)
x
y

outlier
(b)
x
y

(c)
x
y

(d)
x
y

(e)
x
y

(f)
In (a) there is a remote point whose x value is very dierent from that of the other data values. The
observations show no relationship, but the remote point makes r
xy
= 0.7. In (b) there is an outlier
that does not t the pattern of the rest of the data. The observations are very highly correlated except
for the outlier which brings the correlation down to 0.7. Case (c) is a typical scatter plot for a slightly
weak relationship between the variables.
In (d) there is a very strong relationship, but it is not linear. In (e) there are two distinct groups
of observations with no apparent relationship between x and y within either group; but the average x
and y values both dier for the two groups. In (f) there are two distinct groups of observations with
high correlation within each group.
Example 2.2 The following table gives the percentage of 18 London Boroughs devoted to
open space (x) and the percentage of all accidents which involve children in the Boroughs
(y).
Borough x y Borough x y Borough x y
Bermondsey 5.0 46.3 Woolwich 7.0 38.2 Stoke Newington 6.5 30.8
Deptford 2.2 43.4 Stepney 2.5 38.2 Hammersmith 12.2 28.3
Islington 1.5 42.9 Poplar 4.5 37.0 Wandsworth 14.6 23.8
Fulham 4.2 42.2 Southwark 3.1 35.3 Marylebone 23.6 17.8
Shoreditch 1.4 40.0 Camberwell 5.2 33.6 Hampstead 14.8 17.1
Finsbury 2.0 38.8 Paddington 7.2 33.6 Westminster 27.5 10.8
You can check that r
xy
= 0.92 which indicates a strong negative association between x and y.
But you cannot conclude from the very high negative correlation that providing more open space in a
borough will cause the number of accidents involving children to fall. Boroughs with a high proportion
of parks may have fewer children living there (e.g., Westminster where there are a large number of
oce blocks), so there will be fewer accidents involving children. Also the boroughs with a high rate of
accidents involving children tend to be the poorer boroughs. As most accidents occur in the home, it
could be cramped housing conditions that are causing the high rate of accidents involving children not
the lack of open space see Example 1.1 where approximately half the children died from choking,
burns or poisoning which more open space will do little to prevent. Reasons for associations like this
one are usually quite complex. A scatter plot of these data is given below.
18
2.4 Rank correlation
Another measure of association is Spearmans rank correlation coecient, usually denoted by r
S
.
It is the same as Pearsons coecient applied to the ranks of x and y. It therefore has corresponding
properties. For example, if r
S
= 1, then the ranks of x and y lie exactly on a line: in other words, the
values of x and y are in the same order. In general, r
S
is more robust, in that is is not aected by
extreme values: it just describes how the orderings of the two variables are related.
A simple formula for calculating r
S
is
r
S
= 1
6

i
d
2
i
n(n
2
1)
where d
i
is the dierence between the rank of x
i
and the rank of y
i
.
In Example 2.2, the ranks of x and y and their dierences are:
Borough rx ry d Borough rx ry d Borough rx ry d
Bermondsey 9 18 -9 Woolwich 12 11.5 0.5 Stoke Newington 11 6 5
Deptford 4 17 -13 Stepney 5 11.5 -6.5 Hammersmith 14 5 9
Islington 2 16 -14 Poplar 8 10 -2 Wandsworth 15 4 11
Fulham 7 15 -8 Southwark 6 9 -3 Marylebone 17 3 14
Shoreditch 1 14 -13 Camberwell 10 7.5 2.5 Hampstead 16 2 14
Finsbury 3 13 -10 Paddington 13 7.5 5.5 Westminster 18 1 17
Here

i
d
2
i
= (9)
2
+ (13)
2
+ + (17)
2
= 1779 and n = 18, so n
2
1 = 323. Hence
r
S
= 1
6 1779
18 323
= 0.84 .
Again this indicates a quite stong negative relationship.
Here are scatter plots of y against x and of rank(y) against rank(x).

% open space, x
%

a
c
c
i
d
e
n
t
s

w
i
t
h

c
h
i
l
d
r
e
n
,

y
5 10 15 20 25
10
20
30
40

rank of x
r
a
n
k

o
f

y
5 10 15
5
10
15
Example 2.2
You can see from these why the rank correlation r
S
is not as strong as the ordinary correlation r
xy
in
this example.
In Example 2.1, the correlation between the log number of species and log area is r
xy
= 0.86, while
the correlation between number of species and area is only r
SA
= 0.60, indicating a weaker linear
relation (look at the scatter plots again). On the other hand, the log transformation does not change
the ordering of the numbers, so the rank correlation between S and A is the same as for y and x.
You can check that it is r
S
= 0.84 which is very close to r
xy
.
19
3 The Normal Distribution
Example 3.1 The span measurements (in inches) of 1200 men were as follows.
Mid Frequency Relative Mid Frequency Relative
point Frequency point frequency
58.45 1 0.001 70.45 155 0.129
59.45 2 0.002 71.45 129 0.108
60.45 1 0.001 72.45 103 0.086
61.45 4 0.003 73.45 79 0.066
62.45 6 0.005 74.45 49 0.041
63.45 14 0.012 75.45 30 0.025
64.45 34 0.028 76.45 13 0.011
65.45 66 0.055 77.45 10 0.008
66.45 88 0.073 78.45 7 0.006
67.45 109 0.091 79.45 3 0.003
68.45 136 0.113 80.45 1 0.001
69.45 159 0.132 81.45 1 0.001
Here is a relative frequency histogram for these data:
0.0
0.02
0.04
0.06
0.08
0.10
0.12
Span (inches)
R
e
l
a
t
i
v
e

f
r
e
q
u
e
n
c
y
60 65 70 75 80
With the 140 observations of span in Example 1.5 the outline to the relative frequency histogram is
jagged. In Example 3.1 there are 1200 observations, the group widths are narrower and the outline
of the relative frequency histogram is smoother. As we obtain more and more observations we can
approximate the outline of the histogram by a smooth curve called a relative frequency curve,
usually denoted by f(x). In probability theory, f(x) is also called a probability density function.
A variable whose frequency curve has the following mathematical form
f(x) =
1

2
e

1
2
(
x

)
2
, < x <
is said to have a normal distribution. This is a symmetric bell-shaped curve. The parameter
represents the population mean and is the population standard deviation.
Examples of frequency curves:
Normal Positively skew Symmetric but not normal
20
The relative frequency histogram in Example 3.1 suggests that the span of males may be approximated
by a normal distribution. Many variables are approximately normally distributed. Sometimes skew
data can be transformed (e.g., by taking the square root or logarithm of every observation) in order to
make the resulting values approximately symmetrically distributed and so enable the use of statistical
techniques that assume normally distributed variables.
3.1 Population distribution and parameters
For a sample from a population we may calculate sample statistics such as the sample mean and
sample standard deviation. The corresponding quantities for the population itself are referred to as
population parameters. In order to distinguish between sample statistics and population parameters,
we use Roman letters for sample values and Greek letters for population parameters. In particular, the
sample mean and standard deviation are denoted by x and s and the population mean and standard
deviation by and . It is important to understand this distinction.
A relative frequency curve f(x) describes the distribution of a variable for a population. Mathemat-
ically, f(x) is dened so that the total area under the curve equals 1.
3.2 Probability
Of the 1200 men whose spans are depicted by the relative frequency histogram in Example 3.1, the
proportion with span less than 65 inches is represented by the area of those bars to the left of 64 in
that histogram. Similarly, among the men in the population, the proportion with span less than 65
inches is the area under the frequency curve to the left of 65, i.e., the shaded area:
65 70 Span
The population relative frequency is called the probability, i.e. the probability a man has a span less
thean 65 inches is the proportion of the population with a span less than 65 inches. In this example
the variable being measured is each mans span. If we denote this variable by X then a convenient
way to write down this probability is P(X < 65). Assume that the variable X (a mans span) has
a population distribution that is normal with mean and standard deviation . Subtracting from
the variable and then dividing by gives a new variable whose distribution normal with mean 0 and
standard deviation 1. This new variable is said to have a standard normal distribution.
Algebraically, the new (standardised) variable is Z =
X

and Z has a standard normal distribution.


Thus:
P(X < c) = P
_
X

<
c

_
= P
_
Z <
c

_
Thus the proportion with X < c equals the proportion with Z <
c

, i.e., the two shaded areas below


are the same.
21
c
x
Frequency curve of X
c-

0
z
Frequency curve of Z
3.3 Calculation of probabilities for a normal distribution
The shaded area under the standard normal frequency curve can be obtained from Statistical Tables.
Table 2 gives P(Z < z) for values of z from 0 to 3.30. Other probabilities can be calculated from
these.
Example 3.2
1. P(Z < 1.59) = 0.9441
2. P(0.6 < Z < 2.0)
= P(Z < 2.0) P(Z < 0.6)
= 0.9772 0.7257
= 0.2515
3. P(Z > 1.8) = 1 P(Z < 1.8)
= 1 0.9641
= 0.0359
4. P(Z < 0.75) = P(Z > 0.75)
= 1 P(Z < 0.75)
= 1 0.7734
= 0.2266
5. P(2.31 < Z < 1.65) = P(Z < 1.65) P(Z < 2.31)
= P(Z < 1.65) (1 P(Z < 2.31))
= 0.9505 1 + 0.9896
= 0.9401
6. The distribution of hand span X in a population of men is normal with mean 70
inches and standard deviation 3 inches. What proportion of men have a span less
than 65 inches?
P(X < 65) = P
_
Z <
65 70
3
_
= P(Z < 1.67)
= 1 P(Z < 1.67)
= 1 0.9525
= 0.0475 .
3.4 Percentage points
A percentage point is the value of a variable for a given area. For example, 1.645 is the upper 5
percentage point of Z and 1.645 is the lower 5 percentage point. Tables of percentage points for the
22
standard normal distribution, giving the values of z corresponding to various shaded areas, are also
available.
1.645 0
z
area = 0.95
area = 0.05
Example 3.2 part 6, continued. What span would be exceeded by only 5% of the
population of men? Here we need to nd the value of k such that P(X > k) = 0.05, or
equivalently, P(X < k) = 0.95; or equivalently, the value of k such that P(Z <
k70
3
) =
0.95.
Hence,
k70
3
= 1.645, so k = 70 + (3 1.645) = 74.9 inches.
Example 3.3 Find the proportions of values from a normal distribution that lie between
k and +k for k = 1, 2, 3. Now
P( k < X < +k) = P
_
k

< Z <
+k

_
= P(k < Z < k)
= P(Z < k) P(Z < k)
= P(Z < k)
_
1 P(Z < k)
_
= 2P(Z < k) 1
Hence
P( < X < +) = 2P(Z < 1) 1 = 2 0.8413 1 = 0.6826
P( 2 < X < + 2) = 2P(Z < 2) 1 = 2 0.9772 1 = 0.9544
P( 3 < X < + 3) = 2P(Z < 3) 1 = 2 0.9987 1 = 0.9975
This leads to a useful way of interpreting the standard deviation of a variable that is approximately
normally distributed. For such a variable:
approximately 68% of observations will lie between and +,
approximately 95% of observations will lie between 2 and + 2, and
approximately 99.7% of observations will lie between 3 and + 3.
23
4 Sampling distributions
What happens when we take random samples from population and, for each sample, we calculate a
statistic such as the sample mean?
4.1 Theory
1. Consider a Normal population that has mean and standard deviation . Imagine taking
a random sample of n observations and calculating the sample mean x. Imagine doing this
many times. What will the values of x look like? It is found that
they have mean equal to the population mean ,
they have standard deviation equal to /

n, and
they are Normally distributed.
So this distribution will be centered at . It will be less spread out for a sample of size n = 100
than for n = 10, say. The larger n is, the smaller is the standard deviation of the distribution of
means, and the closer x is likely to be to .
2. Suppose we use the sample mean x to estimate the population mean . Then the quantity
/

n is called the standard error of x. It measures how dierent x might be from , i.e. it
measures the precision (or lack of precision) of the estimate of . If the standard error is small,
then x is likely to be close to . If it is large, then x might be very dierent from .
3. Suppose we know and we nd n and x from a sample. Then we can use this result to tell us
something about the population mean . The standardised mean
Z =

X
/

n
has a standard normal distribution, so it is likely to be between 2 and +2, and very likely
to be between 3 and +3. The probability that Z is between 1.96 and +1.96 is 0.95. So if we
form the interval
x 1.96/

n to x + 1.96/

n
this will probably include (with probability 0.95). That is, if we take many random samples
and form this interval for each, then 95% of them will include . This interval is called a 95%
condence interval for .
4. There is another way we use the above result. Suppose there is a hypothesis that the population
mean has some particular value, for example suppose the hypothesis is that = 25. Then we
can see if the data agree with this hypothesis by calculating the value of z given by
z =
x 25
/

n
.
If the hypothesis is correct, z will be from a standard normal distribution. If not, z is more
likely to be further from 0.
We can use the standard normal distribution to calculate the probability of getting a value
further from 0 than z, assuming the hypothesis is correct. This is called the P-value. If the
P-value is very small, this is evidence that the hypothesis is not correct. This procedure is called
a hypothesis test or a signicance test.
24
5. Sometimes we want to calculate a condence interval to see how closely we can estimate the
population mean, and sometimes we want to do a signicance test to see if the data agree with
a hypothesised value of . We will look at examples in the next section.
6. Now suppose we do not know the population standard deviation . Usually we estimate
using the sample standard deviation s
x
, and instead of Z we consider
T =

X
s
x
/

n
.
Now T does not have a standard normal distribution but a Students t-distribution with
n 1 degrees of freedom. We usually abbreviate degrees of freedom to df or the Greek
letter . A t-distribution is still symmetrical, but has longer tails than the Normal, and its shape
depends on n. When n is very large it is very close to a standard normal distribution.
So to nd a 95% condence interval for we do the same as before except that we use the
t-distribution. This gives the interval
x t
p
s
x
/

n to x +t
p
s
x
/

n
where the number t
p
is the upper 2.5 percentage point of the t-distribution with n 1 degrees
of freedom. We can nd t
p
from statistical tables (e.g., Table 3) or from computer programs. It
is always bigger than 1.96 and as n gets larger t
p
gets nearer to 1.96.
And if we want to test a hypothesis, for example that = 25, we calculate
t =
x 25
s
x
/

n
and determine the P-value by nding the probability of getting a value further from 0 than t,
using the t-distribution with n 1 degrees of freedom.
7. Now consider a population that is not Normal. Let and be the population mean and
standard deviation. If we take random samples of size n and calculate the sample mean x for
each, then the distribution of sample means
still has mean
and standard deviation /

n, and
if n is large, is approximately Normal.
This last result the fact that the distribution of sample means is approximately Normal,
regardless of the population is known as the central limit theorem. A consequence of this
is that if we have a large sample from a non-Normal population, we can calculate approximate
condence intervals for , as if the population were Normal.
4.2 Simulation of samples from a Normal population
The table below gives 75 random samples, each of size 4, from a Normal population with mean 100
and standard deviation 15 (think of them as IQ scores). Also given are the sample means, standard
deviations, values of t = 2( x 100)/s
x
and 95% condence intervals x 3.182s
x
/2.
25
sample mean s.d. t confidence
interval
1 100 87 122 111 105.00 14.99 0.67 81.15 128.85
2 99 110 122 101 108.00 10.49 1.53 91.31 124.69
3 85 89 98 94 91.50 5.69 -2.99 82.45 100.55
4 98 85 104 76 90.75 12.63 -1.46 70.65 110.85
5 95 120 113 81 102.25 17.65 0.25 74.17 130.33
6 99 120 103 109 107.75 9.14 1.70 93.20 122.30
7 72 134 73 95 93.50 29.01 -0.45 47.34 139.66
8 105 81 61 102 87.25 20.50 -1.24 54.63 119.87
9 100 103 97 117 104.25 8.85 0.96 90.18 118.32
10 118 137 98 130 120.75 17.08 2.43 93.58 147.92
11 100 76 99 92 91.75 11.09 -1.49 74.11 109.39
12 85 120 105 110 105.00 14.72 0.68 81.58 128.42
13 114 91 69 95 92.25 18.46 -0.84 62.87 121.63
14 79 93 83 92 86.75 6.85 -3.87 75.85 97.65 *
15 93 114 129 106 110.50 15.07 1.39 86.53 134.47
16 88 92 104 85 92.25 8.34 -1.86 78.98 105.52
17 114 120 123 112 117.25 5.12 6.73 109.10 125.40 *
18 83 87 105 86 90.25 9.98 -1.95 74.37 106.13
19 102 74 120 67 90.75 24.68 -0.75 51.49 130.01
20 103 79 93 116 97.75 15.65 -0.29 72.85 122.65
21 136 112 97 101 111.50 17.52 1.31 83.62 139.38
22 101 96 128 100 106.25 14.66 0.85 82.93 129.57
23 100 88 110 90 97.00 10.13 -0.59 80.88 113.12
24 111 111 79 107 102.00 15.45 0.26 77.42 126.58
25 83 116 96 107 100.50 14.25 0.07 77.83 123.17
26 67 66 105 87 81.25 18.55 -2.02 51.73 110.77
27 118 101 97 113 107.25 9.88 1.47 91.53 122.97
28 123 95 84 96 99.50 16.58 -0.06 73.12 125.88
29 92 81 105 84 90.50 10.72 -1.77 73.44 107.56
30 106 85 117 66 93.50 22.63 -0.57 57.49 129.51
31 95 81 109 111 99.00 13.95 -0.14 76.80 121.20
32 108 92 92 90 95.50 8.39 -1.07 82.16 108.84
33 140 82 51 86 89.75 36.97 -0.55 30.93 148.57
34 116 93 125 103 109.25 14.10 1.31 86.81 131.69
35 101 97 120 108 106.50 10.08 1.29 90.46 122.54
36 86 106 91 105 97.00 10.03 -0.60 81.04 112.96
37 74 98 102 108 95.50 14.91 -0.60 71.78 119.22
38 105 109 115 107 109.00 4.32 4.17 102.13 115.87 *
39 92 96 124 81 98.25 18.30 -0.19 69.13 127.37
40 122 103 112 112 112.25 7.76 3.16 99.90 124.60
41 107 100 98 91 99.00 6.58 -0.30 88.53 109.47
42 106 71 88 122 96.75 22.08 -0.29 61.62 131.88
43 108 87 108 68 92.75 19.24 -0.75 62.14 123.36
44 101 83 91 89 91.00 7.48 -2.41 79.09 102.91
45 105 76 89 87 89.25 11.95 -1.80 70.23 108.27
46 80 91 103 113 96.75 14.34 -0.45 73.94 119.56
47 64 96 113 91 91.00 20.31 -0.89 58.68 123.32
48 105 84 103 88 95.00 10.55 -0.95 78.21 111.79
49 137 97 101 107 110.50 18.14 1.16 81.64 139.36
50 145 86 94 87 103.00 28.23 0.21 58.09 147.91
51 77 106 87 118 97.00 18.46 -0.33 67.63 126.37
52 119 95 76 93 95.75 17.69 -0.48 67.61 123.89
53 123 101 112 124 115.00 10.80 2.78 97.82 132.18
54 112 89 80 112 98.25 16.30 -0.21 72.32 124.18
55 91 75 114 96 94.00 16.06 -0.75 68.44 119.56
56 113 90 116 124 110.75 14.59 1.47 87.53 133.97
57 65 93 90 112 90.00 19.30 -1.04 59.29 120.71
58 119 107 92 129 111.75 15.95 1.47 86.38 137.12
59 83 107 92 97 94.75 10.01 -1.05 78.82 110.68
26
60 108 100 111 106 106.25 4.65 2.69 98.86 113.64
61 104 99 121 115 109.75 10.05 1.94 93.77 125.73
62 104 125 104 106 109.75 10.21 1.91 93.51 125.99
63 115 111 90 102 104.50 11.09 0.81 86.85 122.15
64 102 103 115 115 108.75 7.23 2.42 97.25 120.25
65 105 116 89 94 101.00 12.03 0.17 81.86 120.14
66 101 97 111 79 97.00 13.37 -0.45 75.73 118.27
67 122 99 123 67 102.75 26.29 0.21 60.93 144.57
68 63 94 92 101 87.50 16.78 -1.49 60.80 114.20
69 114 119 89 80 100.50 18.95 0.05 70.35 130.65
70 108 97 84 114 100.75 13.20 0.11 79.75 121.75
71 83 96 64 75 79.50 13.48 -3.04 58.06 100.94
72 108 101 106 85 100.00 10.42 0.00 83.41 116.59
73 109 79 96 105 97.25 13.33 -0.41 76.05 118.45
74 134 99 106 106 111.25 15.52 1.45 86.56 135.94
75 99 111 74 82 91.50 16.66 -1.02 64.99 118.01
Here are histograms of the 300 individual scores x, the 75 sample means x, the 75 standardised means
z = ( x 100)/15 and the 75 t values:
55 70 85 100 115 130 145
0
20
40
60
80
Frequency
x
55 70 85 100 115 130 145
0
5
10
15
20
25
mean
-6 -4 -2 0 2 4 6
0
5
10
15
20
25
30
z
-6 -4 -2 0 2 4 6
0
5
10
15
20
25
30
t
Superimposed on the histograms are the theoretical frequency curves: for x, Normal with mean 100
and standard deviation 15; for x, Normal with mean 100 and standard deviation 7.5; for z, standard
normal; and for t a t-distribution with 3 degrees of freedom.
Note that both x and x are centered at the population mean of 100, but the standard deviation of the
distribution of sample means is half that of the original population. Look carefully at the comparison
between z and t. Both are symmetric about 0, but the latter, which has a t-distribution, has longer
tails implying a likelihood of more extreme values than for the normal distribution.
Look also at the condence intervals in the table. Just three of these (marked with a *) fail to include
the population mean of 100. The theory says that in the long run 95% of such intervals will include
100, and in our 75 samples 72/75 = 0.96 of intervals do, which is about right.
27
4.3 Simulation of samples from an Exponential population
A variable whose frequency curve has the following mathematical form
f(x) = e
x
, 0 < x <
is said to have an exponential distribution. This distribution is often used to describe the times
between random events such as arrivals of telephone calls, accidents, earthquakes, etc. In this context
the parameter is called the rate the average number of events per unit time. The population
mean (i.e., the average time between successive events) is equal to 1/. The population standard
deviation is also equal to 1/. Thus if the times between arrivals of telephone calls had an exponential
distribution with rate three per hour, then the average time between calls would be 1/3 of an hour,
i.e., 20 minutes. Also the standard deviation of these times would be 20 minutes. The exponential
frequency curve is very dierent from a normal curve. It has a mode at 0 and decays exponentially
as x increases. A variable with an exponential distribution is necessarily positive, whereas one with a
Normal distribution has a non zero probability of being negative.
Here are 75 samples, each of size 9, from an Exponential population with mean 1. The population
standard deviation is also 1. The 75 sample means are also given below.
sample mean
0.49 2.41 1.61 1.37 0.20 0.60 0.09 0.79 3.41 1.22
0.73 0.05 1.82 1.62 0.08 2.76 0.19 0.17 3.50 1.21
0.42 0.35 3.19 2.89 0.88 1.69 0.38 0.21 0.79 1.20
0.52 0.36 3.95 0.44 0.62 0.37 0.15 2.37 1.08 1.09
1.09 0.45 2.56 1.15 0.73 7.62 0.39 0.92 0.41 1.70
0.01 0.04 0.45 0.57 1.54 0.05 1.24 0.77 2.25 0.77
1.15 0.86 0.86 0.46 0.58 1.70 1.20 0.29 0.03 0.79
0.09 0.16 0.61 2.04 0.46 5.02 0.95 0.12 0.78 1.14
2.20 0.84 0.56 0.73 0.15 0.39 0.99 1.90 0.29 0.90
2.29 5.35 0.16 0.40 1.15 1.15 2.45 0.85 1.11 1.66
0.91 0.03 0.64 0.07 0.27 0.94 0.40 0.20 1.51 0.55
2.55 0.47 0.45 0.14 0.25 1.52 0.49 0.40 3.19 1.05
0.14 0.78 0.90 1.59 0.24 0.22 0.70 0.47 0.79 0.65
0.03 0.95 3.03 0.04 2.17 0.49 1.21 0.11 1.53 1.06
0.01 3.88 0.44 1.13 0.26 0.50 0.56 1.06 1.14 1.00
1.09 1.43 1.49 0.20 1.80 0.28 0.15 1.76 0.16 0.93
1.88 4.44 0.44 0.11 0.42 0.52 2.37 0.48 0.46 1.23
7.10 0.48 1.56 3.59 0.65 0.10 1.65 1.80 0.20 1.90
1.01 0.02 1.01 1.85 0.56 0.40 0.13 0.10 1.23 0.70
1.56 2.41 1.76 1.87 0.85 0.15 1.27 1.74 2.69 1.59
0.80 2.00 2.25 1.39 0.90 0.53 0.10 0.27 1.03 1.03
3.65 2.54 0.41 0.24 0.44 0.45 0.63 0.41 0.59 1.04
2.15 2.20 2.02 0.83 1.27 1.03 0.63 1.74 0.24 1.35
0.69 2.14 1.81 1.41 1.83 2.04 0.96 0.43 0.12 1.27
2.11 0.48 2.09 0.62 1.60 0.26 1.20 0.63 2.30 1.25
0.81 0.70 2.37 0.62 0.50 0.94 0.39 1.61 0.48 0.94
1.79 0.02 0.82 0.71 2.22 1.64 2.77 2.19 2.96 1.68
0.64 1.23 0.88 0.53 0.11 4.25 1.70 1.93 1.48 1.42
0.09 0.25 0.15 0.60 0.18 1.56 2.35 1.23 1.13 0.84
4.38 0.12 0.00 4.33 0.15 2.81 0.05 1.65 2.42 1.77
1.37 1.15 2.26 0.06 0.39 0.24 1.78 0.38 1.41 1.00
1.76 0.79 1.69 0.03 0.07 1.36 2.43 1.42 0.91 1.16
0.66 1.00 0.28 0.61 0.37 0.26 0.28 0.37 0.01 0.43
0.61 0.13 0.19 1.13 0.45 0.69 0.11 0.19 0.21 0.41
1.84 0.98 1.84 1.05 1.47 0.03 0.40 0.66 0.28 0.95
0.76 0.40 0.32 0.35 0.33 0.04 0.05 0.98 1.23 0.49
28
0.05 0.10 0.22 0.37 0.70 0.38 0.15 0.78 1.00 0.42
0.95 1.05 0.29 1.80 0.09 0.29 0.01 0.55 0.28 0.59
3.10 0.24 0.60 1.09 1.94 0.88 0.86 1.79 1.93 1.38
1.21 0.09 0.34 0.79 0.30 1.42 0.30 1.68 0.29 0.71
0.04 0.78 2.05 0.18 1.53 0.37 1.55 1.08 1.47 1.01
1.21 1.38 0.15 1.16 0.94 0.21 0.91 0.57 1.57 0.90
0.26 0.15 3.68 0.56 0.28 0.80 1.19 0.20 3.28 1.16
1.11 1.05 0.27 1.36 0.12 0.18 2.32 2.46 0.80 1.07
0.84 0.28 0.94 1.04 2.76 0.07 1.78 1.06 1.73 1.17
0.10 0.49 0.09 0.62 3.70 0.80 0.13 0.38 1.84 0.91
0.49 0.40 1.09 2.76 0.19 0.41 0.43 1.27 2.94 1.11
0.71 0.07 1.45 1.39 0.92 0.19 0.17 0.43 0.91 0.69
0.17 0.28 0.73 0.24 0.85 2.40 1.14 1.49 0.16 0.83
1.18 0.80 0.33 0.37 0.23 0.48 1.53 0.35 1.14 0.71
0.60 0.83 0.42 1.11 0.87 0.30 0.30 0.95 0.42 0.65
0.50 0.79 2.06 0.34 0.85 3.39 0.73 0.21 1.78 1.18
1.94 0.18 5.54 0.55 0.03 0.47 0.75 0.56 0.14 1.13
0.27 0.39 1.04 0.18 0.15 0.57 0.78 1.34 0.67 0.60
0.38 0.13 1.77 1.82 1.60 1.11 2.79 0.84 2.09 1.39
2.54 0.65 0.56 0.51 1.10 0.39 0.54 5.40 1.37 1.45
0.35 1.47 0.31 0.00 0.48 3.51 1.00 0.04 5.65 1.42
0.30 0.44 1.22 0.23 0.30 4.66 1.81 0.38 3.42 1.42
0.32 2.44 1.24 1.56 2.61 0.86 0.24 0.39 1.59 1.25
0.27 0.43 0.63 0.09 1.20 2.69 1.88 1.40 0.61 1.02
0.78 0.02 1.49 0.47 1.03 1.42 0.79 0.68 0.33 0.78
0.16 0.18 4.08 0.35 0.77 0.56 0.01 0.08 1.11 0.81
0.72 0.14 0.53 2.83 1.83 2.70 3.17 0.54 1.46 1.55
0.24 1.13 0.05 0.92 0.54 0.69 0.01 0.72 1.23 0.61
0.14 0.05 0.28 0.89 1.48 1.80 1.04 2.29 0.32 0.92
1.53 0.27 2.47 0.25 1.77 2.88 0.25 2.54 0.51 1.39
0.68 0.55 0.13 1.16 1.10 2.51 1.86 0.24 0.05 0.92
3.50 0.18 0.27 0.16 0.89 5.73 0.10 0.22 0.45 1.28
0.94 1.65 0.52 2.68 1.04 0.77 1.08 0.95 1.08 1.19
0.82 0.07 2.26 0.91 2.17 0.09 2.30 0.09 0.76 1.05
4.59 0.60 0.38 0.83 0.41 3.46 1.61 0.39 0.10 1.38
3.00 0.34 0.45 0.27 2.45 0.72 0.84 0.72 0.32 1.01
1.22 0.08 0.10 3.47 0.08 0.70 2.14 3.74 0.20 1.30
0.45 1.78 0.12 0.32 3.39 1.27 0.57 1.09 0.72 1.08
0.02 0.44 0.68 0.43 0.02 0.32 0.16 1.38 0.48 0.44
Here is a histogram of the 675 original observations, along with the exponential frequency curve; and a
histogram of the 75 sample means, along with a normal curve with mean = 1 and standard deviation
/

9 = /

9 = 1/3.
0 2 4 6 8
0
50
100
150
200
250
300
time, x
Frequency
0.0 0.5 1.0 1.5 2.0 2.5
0
5
10
15
20
25
sample mean
This is a vivid demonstration of the central limit theorem. The population is very non-normal, but
the distribution of sample means even for samples of size 9 is not so dierent from a normal
distribution. For a larger sample size, it would be even closer to normal.
29
5 Condence Intervals and Tests based on the t distribution
In 2 we considered populations that have a normal distribution with a known mean and standard
deviation. Often we know that a population has a normal distribution (or approximately a normal
distribution) but we do not know its mean and standard deviation. Usually it is the population mean
that is of most interest. Typically we have a sample from the population and we wish to either
obtain an estimate of the population mean, or
compare the mean of the population with some other value.
We use the sample data to make inferences about the population. The sample should be representative
of the population from which it is drawn and care should be taken to ensure that this is so, for example
by taking a simple random sample (see 1) or possibly a stratied simple random sample.
The size of the sample should be large enough to allow sensible conclusions to be drawn from the data,
but it also should be of a size that can easily be handled by the investigators.
5.1 Estimation of a population mean
5.1.1 Method 1: Quoting the sample mean and standard error
We take a sample of n observations from the population and calculate the sample mean x. We think
of x as an estimate of . We express this by writing
= x,
where the hat means an estimate of.
Now if we took another sample of n values, a dierent x would be obtained and hence a dierent
estimate of . So just quoting x does not give any indication of the possible error involved. However
we do know that the standard deviation of the sampling distribution of x is /

n. If this standard
deviation is small then most of the values of x are close to and we are condent that the particular
x from our sample will be close to . If this standard deviation is large then some xs will be a long
way from and our particular x might be one of these. Thus knowing /

n will give us some idea


about how close our x is likely to be to . The quantity /

n is called the standard error of our


estimate of , and is denoted by se( ). Thus in this case
se( ) =

n
.
However, in most situations is not known and we estimate it using the the sample standard deviation
s. The estimated standard error of is thus s/

n.
Note: many text books use the term standard error to refer to the estimated standard error, and
write
se( ) =
s

n
.
This is not entirely satisfactory, but is acceptable for the present course provided that we understand
that for small samples s/

n may be rather dierent from the correct value /

n. In particular, the
use of the t distribution (below) will allow for this dierence.
30
In general, the standard error of an estimate of a parameter is the standard deviation of the sampling
distribution of estimates of that parameter (when you imagine taking repeated samples of size n).
Example 5.1 For the blood pressure data in Example 1.3, n = 21, x = 128.52 and
s = 14.31. Let be the population mean systolic blood pressure for women attending
keep t classes. An estimate of is = 128.52 mm Hg with estimated standard error
se( ) = 14.31/

21 = 3.12 mm Hg.
5.1.2 Method 2: Condence intervals
In the above method of estimation we quoted an estimate of and its standard error, to express the
uncertainty of our estimate. An alternative method is to nd an interval in which we expect
to lie, that is, we calculate an interval of the form ( x b, x + b) which is likely to contain . It is
conventional, for most purposes, to calculate a 95% condence interval; that is, to choose b so that
if repeated samples were taken, 95% of the intervals ( x b, x +b) would include .
Let t
p
be the percentage point of a t distribution with n 1 df for an upper tail area of p = 0.025.
These values are given in Table 3 under the 0.025 column with = n 1. Then
P
_
t
p
<

X
s/

n
< t
p
_
= 0.95
which can be rearranged to
P
_

X t
p
s

n
< <

X +t
p
s

n
_
= 0.95 .
This statement says that if we take many samples of size n from the population and calculate the
interval x t
p
s/

n to x + t
p
s/

n for each sample, then 95% of the intervals would contain the
population mean . We actually observe just one sample of size n and calculate just one interval, so
(in this rather indirect sense) we are 95% condent that our particular interval is one of those that
contains .
Notes
The condence interval can also be written as
t
p
se( ) , +t
p
se( )
which is often abbreviated to t
p
se( ). For a 95% condence interval, p = 0.025 and it can
be seen from Table 3 that t
p
is about 2 when n is large and is a bit greater than 2 for smaller n.
So the 95% condence interval contains values within about 2 (or a bit greater than 2) standard
errors of .
If we wish to be even more condent that our interval contains , we could use a higher condence
level than 95% e.g. 99%. In this case we use the above formula with upper percentage point
p = 0.005, so that the value of t
p
is that in Table 3 under the 0.005 column with = n 1.
The resulting interval is longer and more likely to contain . In practice it is a very common
convention to use a condence level of 95%.
Example 5.1 continued. From the dotplot in 1 the sample data are fairly symmetric
and it looks reasonable to assume that they come from a normal distribution. We will
31
calculate a 95% and 99% condence intervals for , the population mean systolic blood
pressure.
We have n = 21 so = 20. For a 95% condence interval we use t
p
= 2.086 from Table 3
to get
128.52 2.086 3.12 = 128.52 6.508 = (122.0, 135.0) mm Hg.
For a 99% condence interval we would get t
p
= 2.845 and hence
128.52 2.845 3.12 = 128.52 8.876 = (119.6, 137.4) mm Hg.
5.2 One sample t tests
In 5.1 we were interested in estimating a population mean from a sample of n observations. We
used two types of uncertain inference: (a) a point estimate and its standard error se( ), and
(b) a condence interval for .
In this section we consider another type of inference: a hypothesis test. We have a hypothesized
value of , denoted by
0
, say, and we ask the question: do the data agree with the hypothesis?
Example 5.2 Extensive data collected during the rst half of this century showed that in
those years, Japanese children born in America grew faster than did Japanese children born
in Japan. The population mean height of 11-year-old Japan-born Japanese boys is known
to be 139.7 cm. In order to investigate whether improved economic and environmental
conditions in postwar Japan had narrowed this gap, a large sample of Japanese children
born in Hawaii was obtained, and the children were categorised with respect to age. There
were 13 eleven-year old boys in the sample. Their heights (in cm) were as follows:
138, 146, 148, 151, 140, 149, 143, 155, 147, 146, 160, 145, 134.
We will test the hypothesis that the population mean height of Hawaii-born Japanese boys
is 139.7 cm (i.e., the same as that of Japan-born Japanese boys).
The hypothesis that we test is called a null hypothesis and is denoted by H
0
. In general a null
hypothesis represents no change here it asserts that the population mean height of eleven-year-
old Japanese boys born in Hawaii is the same as that for eleven-year-old Japanese boys born in Japan.
In mathematical notation the null hypothesis is
H
0
: =
0
where in this case,
0
= 139.7 cm.
The logical alternative to H
0
is called the alternative hypothesis and is denoted by H
1
. Here our
alternative hypothesis is
H
1
: =
0
.
Note that both H
0
and H
1
are statements about the population mean (not about the sample mean).
Usually H
0
is a precise statement while H
1
is vague.
We are going to use a one sample t-test. We make the assumption that our data are a random
sample from a normal distribution and we calculate the t-statistic:
t =
x
0
s/

n
.
32
If the null hypothesis is true, then t will be a random value from the t-distribution with n1 df. Thus
if H
0
is true, t should be reasonably close to zero, while if H
1
is true, t is likely to be further away
from zero either greater than zero (if >
0
) or less than zero (if <
0
).
Example 5.2 continued Here is a dotplot of the heights of the 13 Hawaii-born eleven-
year-old Japanese boys:
*
* * * * * * * * * * * *
---+---------+---------+---------+---------+---------+---------------
135 140 145 150 155 160 Height in cm
The dotplot is fairly symmetric, and experience suggests that heights in a homogeneous
group of individuals are approximately normally distributed. Note, though, that the
heights are only measured to the nearest cm, so that our data are really discrete. Neverthe-
less it should be reasonable to treat them as a random sample from a normal distribution.
We may calculate n = 13, x = 146.31, s = 6.88, and hence
t =
146.31 139.7
6.88/

13
= 3.46 .
Now, as a random value from the t-distribution with 12 df, the value of 3.46 is rather
extreme: the chance of getting a value greater than this is less than 0.005 (see Table 3) and
is in fact about 0.003. The chance of getting a value more extreme (i.e., greater than 3.46
or less than 3.46) is therefore only 2 0.003 = 0.006. We therefore say there is evidence
that H
0
is not true and that the population mean height of eleven-year-old Japanese
boys born in Hawaii is greater than 139.7 cm.
In general, if our value of t is a typical value from the t-distribution with n1 df we say that our data
are consistent with the null hypothesis. If t is not such a typical value, but is too extreme, we regard
this as evidence that H
0
is not true. Specically, we calculate a quantity called the P-value, which is
the probability of getting a value of T as extreme as, or more extreme than, our observed
value t, assuming that H
0
is true,
where T has a t-distribution with n 1 df.
In Example 5.2, the P-value is
P = P(T > 3.46 or T < 3.46) = 2 P(T > 3.46) = 0.006 .
5.3 Interpretation of P-values
A P-value is the probability of getting a value of the test statistic that is as extreme as (or more
extreme than) the observed value if H
0
is true. A small value of P is therefore regarded as evidence
that H
0
is not true: the smaller P is, the stronger the evidence is against H
0
. As a guide, the common
convention is:
If P < 0.01 there is strong evidence against H
0
If 0.01 < P < 0.05 there is fairly strong evidence against H
0
If P > 0.05 there is little or no evidence against H
0
, or
the data are consistent with H
0
33
Thus, in Example 5.2 we found P 0.006, so there is strong evidence against H
0
, i.e., strong evidence
that the mean height of postwar 11-year-old Hawaii-born Japanese boys diers from that of Japanese-
born Japanese boys. In fact, the sample mean for the Hawaii-born boys is greater than 139.7 cm, so
the evidence is that the population mean for Hawaii-born boys is greater than for Japan-born boys.
Notes:
A P-value measures evidence against the null hypothesis. A large P-value (such as 0.8, say) does
not necessarily imply that H
0
is true, because data can be consistent with H
0
and at the same
time be consistent with other hypotheses.
A small P-value does not mean that H
0
cant be true, because it is possible (though unlikely)
that extreme data may occur by chance, even when H
0
is true.
The above guidelines are not hard and fast rules. For example a P-value of 0.06 means much
the same as one of 0.04, even though one of these is less than 0.05 and the other is not.
What do we mean by more extreme? In general, we decide this by thinking about the alter-
native hypothesis H
1
. In Example 5.2, H
1
says that is either greater than 139.7 or less than
139.7. In the rst case, we would expect t to be greater than 0 and in the second case, to be less
than 0. Thus, a more extreme value of t than 3.44, would be one that is further from zero than
3.44.
Sometimes it may not be possible for to be less than the hypothesised value
0
. Then our
alternative hypothesis would be H
1
: >
0
; more extreme values of t would just be values
greater than the observed value; and the P-value would be P(T > t) rather than twice this. This
is called a one-sided test. The usual case, where the P-value equals 2P(T > t) is a two-sided
test. One-sided tests are appropriate only rarely.
A hypothesis test is a rather limited form of inference. Very often we will also wish to make an
estimate, e.g., using a condence interval.
For Example 5.2 a 95% condence interval for is
146.3 2.179
6.88

13
= 146.3 4.158 = (142.1, 150.5)
Thus the population mean for the Hawaii-born boys is estimated to be between 142.1 cm and 150.5 cm.
Note that this interval does not include 139.7 cm, the population mean for the Japan-born boys. In
general, if the P-value is greater than 0.05, the 95% condence interval will include the hypothesised
mean
0
.
Example 5.3 The mean systolic blood pressure for white males aged 35-44 is 127.2 mm Hg.
The systolic blood pressures (in mm Hg) for a sample of 45 diabetic males aged 35-44 were
as follows.
135 138 149 132 136 136 127 132 128 126
117 136 136 142 135 133 130 131 140 130
140 127 127 124 123 121 131 129 136 125
142 127 127 123 128 131 127 138 137 124
125 133 129 128 133
The researchers were interested in determining whether the mean systolic blood pressure
of 35-44-year-old diabetic males diered from that of 35-44-year-old males in the general
population.
34
Let be the population mean systolic blood pressure for diabetic males aged 35-44. The null and
alternative hypotheses are
H
0
: = 127.2 and H
1
: = 127.2 .
Again we assume that we have a random sample from a normal population. The sample statistics are
n = 45, x = 131.20, s = 6.3661 and hence
t =
131.20 127.2
6.3661/

45
= 4.21
The P-value is thus P(T > 4.21) + P(T < 4.21) = 2 P(T > 4.21). where T has a t-distribution
with 44 df. From Table 3 we can see that P(T > 4.21) < 0.0005 and so P < 2 0.0005 = 0.001.
Thus there is very strong evidence against H
0
, suggesting that 35-44-year-old diabetics have a higher
systolic blood pressure, on average, than 35-44-year-old men in the general population.
A 95% condence interval for is
131.2 2.015
6.3661

45
= 131.2 1.912 = (129.3, 133.1)
so the mean systolic blood pressure for 35-44-year-old diabetic men is estimated to be between
129.3 mm Hg and 133.1 mm Hg.
5.4 Procedure for hypothesis tests
A good procedure to be followed in performing hypothesis tests is:
1. Set up the null and alternative hypotheses dening any notation you use.
2. State any assumptions you are making and, if possible, check whether they are reasonable.
3. Calculate the test statistic (t in this section).
4. Obtain the P-value.
5. Interpret the P-value.
6. Write a one sentence conclusion.
5.5 Relation between condence intervals and hypothesis tests
Suppose we test a null hypothesis H
0
: =
0
and nd that the P-value is greater than 0.05. Then
the 95% condence interval for will include the hypothesised value
0
. If P < 0.05 then the 95%
condence interval will not include
0
. In other words:
the 95% condence interval for consists of all hypothesised values
0
for which the P-value
is greater than 0.05.
Thus, if you calculate a 95% condence interval that does not include a
0
of interest, then you can
infer that the P-value will be less than 0.05. Likewise, if a 99% condence interval does not include
0
,
the P-value will be less than 0.01. Note, though, that there is a logical disctinction between estimation
and hypothesis testing.
35
5.6 Two sample t tests and condence intervals
5.6.1 Matched pairs t test
Example 5.4 The following data give the pH reading for the surface soil and subsoil of 13
areas of acid soil. Test whether the average pH diers between the surface soil and subsoil.
Topsoil pH Subsoil pH Difference Topsoil pH Subsoil pH Difference
6.57 8.34 -1.77 5.49 7.90 -2.41
6.77 6.13 0.64 5.56 5.20 0.36
6.53 6.32 0.21 5.32 5.32 0.00
6.71 8.30 -1.59 5.92 6.21 -0.29
6.72 8.44 -1.72 6.55 5.66 0.89
6.01 6.80 -0.79 6.93 5.66 1.27
4.99 5.42 -0.43
In this example we have, for each area, a pair of observations that are not independent. However, we
assume that the two observations on an area are independent of the two observations on any other
area. We test for the dierence, on average, between the surface soil pH and the subsoil pH by rst
calculating the dierence between the two readings for each area. These dierences are assumed to
be a random sample from a normal distribution.
Here is a dot plot of the dierences:
* *** * ** * * * * * *
---+---------+---------+---------+---------+--------------
-3 -2 -1 0 1 Difference in pH
The dotplot is fairly symmetric and the assumption looks reasonable.
Let be the population mean of the dierence between the surface soil pH and the subsoil pH. We
test the null hypothesis
H
0
: = 0
against the alternative H
1
: = 0. Note that the mean of the dierences between surface and subsoil
pH equals the dierence between the means, so we may interpret as either of these.
The test is exactly the same as for the one sample t test except that the observations are now the
dierences. The calculated statistics are:
n = 13, x = 0.4331, s = 1.1500, t =
0.4331
1.1500/

13
= 1.36 .
The P-value is P(T < 1.36) +P(T > 1.36) = 2 P(T > 1.36) where T has a t-distribution with 12 df.
From Table 3, P 2 0.10 = 0.20. The P-value is greater than 0.05 and the data are consistent with
H
0
. We conclude that there is no evidence that the average pH for subsoil diers from that of surface
soil.
5.6.2 Two sample t test
Example 5.5 As part of a study on job satisfaction reported in the Journal of Library
Administration (1984) samples of 13 male and 11 female employees of a University Library
were asked to complete a job satisfaction questionnaire. The results were as follows, with
the higher the score the greater the job satisfaction.
36
Male 67 65 65 84 92 95 82 76 78 80 60 74 77
Female 78 67 72 67 65 48 81 63 91 71 78
Do the data suggest that male and female university librarians dier in their mean score
on the job satisfaction questionnaire?
In this example we have two unrelated samples from two dierent populations. Let
x
and
y
be the
population mean job satisfaction score for male and female university library employees. We test
H
0
:
x
=
y
against the alternative H
1
:
x
=
y
. We can also express the null hypothesis as H
0
:
x

y
= 0.
Thus intuitively we can imagine testing whether the dierence in population means is zero, and then
go on and estimate this dierence.
We assume that the two samples come from normal populations that have the same standard deviation
(where is unknown), though their means may be dierent. We also assume that all the observations
are independent.
Here is a dotplot of the two samples:
*
* * * * *** * * * * * males
*
* * * * ** * * * females
---+---------+---------+---------+---------+---------+-
50 60 70 80 90 100 job satisfaction score
These look reasonably like two samples from normal populations with the same standard deviation.
In general, let n
x
, x, s
x
and n
y
, y, s
y
be the sample sizes, means and standard deviations for the two
samples. We use the dierence in sample means x y to estimate
x

y
, that is:

x

y
= x y .
Under the above assumptions it can be shown that the standard error of this estimate is
se(
x

y
) =

1
n
x
+
1
n
y
.
An estimate of
2
, the common variance of the two populations, is given by
s
2
p
=
(n
x
1)s
2
x
+ (n
y
1)s
2
y
n
x
1 +n
y
1
.
This is a weighted average of the two sample variances, with weights equal to their degrees of freedom.
The square root of this quantity, s
p
, is called the pooled standard deviation and is used to estimate
, the common standard deviation of the two populations. Thus the estimated standard error of
x

y
is
se(
x

y
) = s
p

1
n
x
+
1
n
y
and the t-statistic is
t =
x y
s
p
_
1
nx
+
1
ny
.
37
It can be shown that if H
0
is true, this will be a random value from the t-distribution with =
(n
x
1) + (n
y
1). The further t is from 0, the stronger the evidence is against H
0
, and we can use
the t-distribution to calculate the P-value.
Example 5.5 continued The summary statistics are:
Male n
x
= 13 x = 76.538 s
x
= 10.477
Female n
y
= 11 y = 71.000 s
y
= 11.225
The pooled standard deviation is
s
p
=
_
12 10.477
2
+ 10 11.225
2
12 + 10
_1
2
= 10.838
and the estimated standard error of
x

y
is
10.838
_
1
13
+
1
11
= 4.434 .
The t-statistic is thus
t =
76.538 71.000
4.434
=
5.538
4.434
= 1.25
and from Table 3 with = 12 + 10 = 22 we see that P 2 0.11 = 0.22. Thus the data
are consistent with H
0
, and we conclude that there is no evidence that the average job
satisfaction score diers between male and female university librarians.
5.7 Condence intervals for the dierence between two population means
Using the same general method as before, a 95% condence interval for
x

y
is given by
x y t
p
s
p

1
n
x
+
1
n
y
, x y +t
p
s
p

1
n
x
+
1
n
y
.
where t
p
is the upper 0.025 percentage point of the t-distribution with (n
x
1) + (n
y
1) df.
Example 5.5 continued From Table 3, for = 22 we nd that t
p
= 2.074. So a 95%
condence interval for
x

y
is
5.538 2.074 4.434 = 5.538 9.196 = (3.7, 14.7) .
Thus the dierence in population mean job satisfaction score is estimated to be between
3.7 and 14.7. Note that this interval includes 0, which agrees with our conclusion from
the hypothesis test that the data are consistent with
x

y
= 0.
38
6 Two Sample Non Parametric Tests
Sometimes it may not be reasonable to regard the data as random samples from normal populations.
For example, the data may be too skew, discrete or ordinal. These non-parametric tests make rather
dierent types of assumptions to test hypotheses of interest, but they do not easily lend themselves
to estimating population parameters. They are best explained by considering examples.
The tests will use ranks of the observations, rather than the observations themselves. In a sample of
n distinct values, the smallest has rank 1, the next smallest has rank 2, and so on up to the largest,
which has rank n. Here is an example, for a sample of n = 5 numbers:
x
i
4.2 0.6 2.1 6.3 3.4
rank(x
i
) 4 2 1 5 3
Note that the ranks will always consist of the numbers 1, 2, . . . , n in some order. If in the above
example x
4
was 106.3 rather than 6.3, the ranks would be unchanged. Only the ordering of the
numbers aects their ranks.
6.1 Wilcoxon signed rank test
This is a non-parametric test for paired data.
Example 6.1 Eight volunteers are asked to transfer peas from one dish to another using
a straw before and after they have consumed two pints of beer. The numbers of peas
transferred in two minutes were as follows.
Volunteer 1 2 3 4 5 6 7 8
Before 32 54 22 43 40 37 16 48
After 26 38 29 15 42 29 17 34
Difference 6 16 -7 28 -2 8 -1 14
Do the results suggest that alcoholic consumption aects the level of performance?
As with the matched pairs t-test we calculate the dierences between the two test results for each
volunteer. This is essentially because we are not interested in how many peas each person can transfer
some people will be quicker at doing this than others but in how this number may dier between
the two conditions. Our null hypothesis is that the general level of performance is the same under
either condition (before or after consuming alcohol). We interpret this as saying that the two numbers
for each person are two random values from the same distribution (though the distribution may dier
between people).
6.1.1 Procedure for calculating the test statistic and P-value
1. If any of the dierences are 0 ignore them and reduce the sample size by the number of zero
dierences.
2. Ignore the signs of the dierences and replace each dierence by its rank (i.e., calculate the ranks
of the absolute values of the dierences). If two or more dierences have the same value give
each of them the average of the ranks for those dierences.
39
3. Give each rank the sign of the dierence corresponding to it.
4. Calculate the test statistic W, dened as follows. Let T
+
and T

be the sum of the positive and


negative ranks respectively. Then W is the smaller of T
+
and T

.
5. Find the approximate P-value, or a range in which it lies, from Table 4 (or otherwise).
For the data in Example 6.1 we have
Differences (B-A) 6 16 -7 28 -2 8 -1 14
Ranks of absolute differences 3 7 4 8 2 5 1 6
Sign of difference + + - + - + - +
Now T
+
= 3 +7 +8 +5 +6 = 29 and T

= 4 +2 +1 = 7, so W = T

= 7. If H
0
were true it would be
as if each of the ranks 1 to 8 was given a + or sign with equal probability. Note that the smallest
that T

can possibly be is 0 (when all signs are +) and the largest that T

can be is
1
2
n(n + 1) = 36
(when all signs are ). Similarly for T
+
. So if H
0
is true we would expect T
+
and T

both to be
about half way between these values, i.e., 18. The closer these are to 0 or 36 (i.e., the closer W is to
0) the stronger the evidence against H
0
.
From Table 4, the row of percentage points for n = 8 is
p 0.05 0.025 0.01 0.005
W
p
5 3 1 0
Thus if we observed W = 4 the tail probability would be between 0.05 and 0.025, so the P-value would
be between 0.10 and 0.05. (As usual, we double the tail probability to get the P-value.) In our case we
observed W = 7, which is greater than 5, so the tail probability is greater than 0.05 and so P > 0.10.
Thus the data are consistent with H
0
and we conclude that there is no evidence that the performance
level changes after drinking the beer.
Here is an example with ties and a zero dierence:
Example 6.2 The cinema attendance (in millions) for the 13 regions of Great Britain in
1994 and 1995 were as follows. (Source: Regional Trends 1996)
Region 1994 1995 Difference Rank of abs Sign
difference
Yorkshire 10.4 9.7 0.7 6.5 +
North East 6.3 5.8 0.5 5 +
Midlands 18.4 16.8 1.6 11 +
Anglia 6.6 6.2 0.4 3.5 +
London 33.7 31.3 2.4 12 +
Southern 9.7 8.7 1.0 8.5 +
South West 2.0 1.7 0.3 1.5 +
Lancashire 16.6 15.1 1.5 10 +
HTV 5.8 6.8 -1.0 8.5 -
Border 0.6 0.6 0.0 *
Central Scotland 8.2 7.5 0.7 6.5 +
Northern Scotland 1.9 1.5 0.4 3.5 +
Northern Ireland 3.8 3.5 0.3 1.5 +
Test whether there is any evidence that cinema attendance in Great Britain diered be-
tween the two years.
40
It is interesting to consider what the relevant populations are in this example. Nevertheless, let us
apply Wilcoxons signed ranks test.
There is one zero dierence, which we omit, so the sample size becomes n = 13 1 = 12. Also we
average the relevant ranks where there are ties: so, for example, the two smallest dierences both
equal 0.3, so these each get rank 1.5 (instead of 1 and 2). There is only one negative dierence (all
attendances went down except for HTV) so W = T

= 8.5. the relevant row of Table 4 (n = 12) is


p 0.05 0.025 0.01 0.005
W
p
17 13 9 7
W is between 9 and 7, so the tail probability is between 0.01 and 0.005. Hence P is between 0.02 and
0.01, which represents fairly strong evidence against H
0
. We conclude that cinema attendance went
down in 1995.
6.2 Mann-Whitney two sample test
This is a non parametric test to compare two independent samples.
Example 6.3 A group of 7 people all over 50 years of age and another group of 6 people
all under 30 years of age had their conduction velocity measured. This was done by
measuring both the time taken for the signal resulting from a standardized knock on the
Achilles tendon to travel up the relevant nerve to the spinal cord and then back down
again on another nerve to make the muscle twitch, in milliseconds, and the distance the
nerve impulse travelled. The conduction velocity is the distance travelled divided by the
time taken and was measured in metres per second. The results were as follows.
Older group 37.7 40.0 42.8 38.2 37.4 33.4 44.7
Younger group 45.9 53.9 40.0 43.7 41.3 44.6
Is there evidence that reactions (as measured by conduction velocity) tend to be dierent
in the two age groups?
In this example we have unrelated samples from two dierent populations. We will test the null
hypothesis that the two populations are identical.
6.2.1 Procedure for calculating the test statistic and P-value
1. Pool all observations into one sample and arrange them in order of size.
2. Write down the ranks of the observations. If two or more observations have the same value give
each of them the average of the ranks for those observations.
3. Write down which of the two samples each observation comes from.
4. Calculate the test statistic, U as follows. Let n
x
and n
y
be the two sample sizes, let R
x
and R
y
be the sum of the ranks for each of the samples, and let
U
x
= R
x

1
2
n
x
(n
x
+ 1) and U
y
= R
y

1
2
n
y
(n
y
+ 1) .
Then U is the smaller of U
x
and U
y
.
5. Percentage points of the distribution of U when H
0
is true are given in Table 5, from which the
approximate P-value (or a range in which it lies) can be deduced.
41
For the data in Example 6.3, this gives:
ordered data: 33.4 37.4 37.7 38.2 40.0 40.0 41.3 42.8 43.7 44.6 44.7 45.9 53.9
rank: 1 2 3 4 5.5 5.5 7 8 9 10 11 12 13
sample: x x x x x y y x y y x y y
R
x
= 1 + 2 + 3 + 4 + 5.5 + 8 + 11 = 34.5 , U
x
= 34.5
1
2
7 8 = 6.5
R
y
= 5.5 + 7 + 9 + 10 + 12 + 13 = 56.5 , U
y
= 56.5
1
2
6 7 = 35.5 .
Hence U = 6.5, the smaller of 6.5 and 35.5.
Note that the smallest that U
x
could possibly be is 0 (when all x values are less than the smallest y
value) and the largest that U
x
could be is 42 (when all x values are greater than the largest y value).
So if H
0
is true we would expect U
x
to be in the middle of this range, i.e., 21. (In general, the expected
value of U
x
is
1
2
n
x
n
y
.) Similarly for U
y
. Thus the closer U is to 0, the more evidence there is against
H
0
.
We can nd a range in which the P-value lies from Table 5, which gives percentage points of U under
H
0
, for various n
1
and n
2
, where n
1
is the smaller of n
x
and n
y
, and n
2
is the larger of n
x
and n
y
. In
our case, n
1
= 6 and n
2
= 7 and the relevant row of Table 5 is
p 0.05 0.025 0.01 0.005
U
p
8 6 4 3
Our observed U, 6.5, is between 6 and 8 so the one sided tail probability is between 0.025 and 0.05.
Hence, for a two-sided test, 0.05 < P < 0.10, which is only weak evidence against H
0
. We conclude
that there is only weak evidence that reactions dier for the two populations, though this is probably
because we only have very small samples. Note that here, as in general, it is fallacious to conclude
that H
0
is true.
42
7 Probability and Binomial and Poisson Distributions
7.1 Idea of probability
In 2 we referred to probabilities of events such as a < X b, where the random variable X is
continuous and has a normal distribution. Thus, if X represents height (in inches) in a population
of men, then P(66 < X 72) denotes the proportion of men in the population whose heights are
between 66 and 72 inches. If a man is to be chosen at random from the population, the probability
that his height will be between 66 and 72 inches is P(66 < X 72).
Now we are going to consider probabilities for general events and discrete variables (i.e., those that
can only take a discrete set of values). An event is a set of values, or outcomes, in which we are
interested. The probability of an event A is denoted P(A) and is a number on a scale from 0 to 1
where
P(A) = 0 means that A is impossible,
P(A) = 1 means that A is certain,
P(A) =
1
2
means that A is equally likely to happen as not to happen.
All probabilities are either 0, 1 or a number between 0 and 1.
Example 7.1 If we imagine rolling a fair die at random we would expect each of the
six faces to have the same chance of falling uppermost. Indeed, if we rolled the die a
large number of times we would observe that the proportion of times a 6 falls uppermost
converges to
1
6
as we increase the number of tosses. Thus the probability of obtaining a 6
is
1
6
. Here the event of interest is a 6 falls uppermost.
We can imagine a population consisting of the possible outcomes 1, 2, 3, 4, 5, 6 and an experiment
of choosing one of these outcomes at random. If this experiment is repeated many times, then in the
long run the outcome 6 will occur in one sixth of these experiments. In this sense the probability
that a six falls uppermost corresponds to the long run proportion of times this would happen if the
experiment were repeated many times.
An example of an impossible event is a 7 falls uppermost as there is no 7 face. This event has
probability 0. An example of a certain event is the score on the uppermost face is 10, as every
possible score is less than or equal to 10 (in fact 6). This event has probability 1.
7.2 Rules of probability
Probabilities obey the rules of proportions. Imagine a population of individuals among whom a
proportion p have some attribute A, while the remaining proportion 1 p do not. Imagine choosing
an individual at random from this population. The individual chosen might or might not have the
attribute A: the probability that he or she does is
P(A) = p .
Thus there is a direct correspondence between the probability of the event A and the proportion in the
population who have the attribute A.
43
Example 7.2 Consider a population of 220 people classied by sex and height:
Tall Short Total
Male 50 50 100
Female 30 90 120
Total 80 140 220
Then
P(Tall) =
80
220
=
4
11
P(Short) =
140
220
=
7
11
= 1 P(Tall)
P(Male) =
100
220
=
5
11
P(Female) =
120
220
=
6
11
= 1 P(Male).
In general, for any event A,
P(not A) = 1 P(A) .
7.2.1 Conditional and joint probabilitites: multiplication rules
The conditional probability of A given B is denoted P(A| B). It is the proportion of individuals
who have the attribute A among those who have B. Thus P(Tall | Male) is the proportion of men who
are tall, whereas P(Tall) is the proportion of people (men or women) who are tall. In Example 7.2:
P(Tall | Male) =
50
100
=
1
2
P(Tall | Female) =
30
120
=
1
4
P(Male | Tall) =
50
80
=
5
8
P(Male | Short) =
50
140
=
5
14
.
Note that P(Tall | Male) is dierent from P(Male | Tall). The former is the proportion of men who are
tall while the latter is the proportion of tall people who are men.
In general P(A| B) is dierent from P(B| A) though these are often confused in practice. In applica-
tions of probability in courts of law, the confusion is so common that it has a name: the prosecutors
fallacy.
We may also consider joint probabilities of two or more events:
P(Tall and Male) =
50
220
=
5
22
P(Short and Female) =
90
220
=
9
22
Note that P(Tall and Male) = P(Tall | Male) P(Male), or
5
22
=
1
2

5
11
.
In general, for any events A and B,
P(A and B) = P(A| B)P(B) = P(B| A)P(A) .
This last equality gives us a means of calculating P(A| B) when we know P(B| A), P(A) and P(B).
In this form, it is known as Bayes Theorem: P(A| B) = P(B| A)P(A)/P(B).
It is sometimes useful to imagine the joint probabilites in a table, as if they were numbers out of a
total population of 1:
44
Tall Short Total
Male
5
22
5
22
5
11
Female
3
22
9
22
6
11
Total
4
11
7
11
1
7.2.2 Mutually exclusive events: addition rules
In Example 7.1 the experiment consisted of rolling a die. There were six possible outcomes: 1, 2, 3, 4,
5, or 6. These outcomes are mutually exclusive, meaning that if one occurs any other cannot. Two
events that are not mutually exclusive are 5 or 6 and an even number, because if the outcome of
the roll is a 6 then both of these events have happened.
Consider the event 5 or 6 which occurs if the experiment results in either a 5 or a 6. This has
probability
P(X = 5 or 6) =
2
6
=
1
6
+
1
6
= P(X = 5) +P(X = 6) .
In general, if A and B are two mutually exclusive events then
P(A or B) = P(A) +P(B) .
In Example 7.2 the events Tall and Male and Tall and Female are mutually exclusive, since an
individual cannot simultaneously be both of these. Now, by the above addition rule
P(Tall) = P(Tall and Male) +P(Tall and Female) or from the table
80
220
=
50
220
+
30
220
.
Furthermore, expressing the joint probabilities in terms of conditional probabilities:
P(Tall) = P(Tall | Male)P(Male) +P(Tall | Female)P(Female), or
4
11
=
1
2

5
11
+
1
4

6
11
.
In general for events A and B,
P(A) = P(A| B)P(B) +P(A| not B)P(not B) .
This is called the generalised addition law and is often used in conjunction with Bayes theorem.
7.2.3 Independence
Example 7.3 Consider a population of individuals classied by sex and by the ability
to curl ones tongue. (This ability is determined by a single gene and is not sex-linked.)
Suppose the population proportions are:
Curl Straight Total
Male .10 .30 .40
Female .15 .45 .60
Total .25 .75 1
Then P(Curl and Male) = 0.10 = 0.40 0.25 = P(Curl) P(Male). Similarly, each joint
probability in the table is the product of the relevant row and column totals. In this case,
sex and the ability to curl ones tongue are independent.
45
In general two events A and B are independent if
P(A and B) = P(A) P(B) .
It follows from this that P(A| B) = P(A| not B) = P(A). Thus in Example 7.3
P(Curl | Male) =
.10
.40
=
1
4
, P(Curl | Female) =
.15
.60
=
1
4
, P(Curl) =
.25
1
=
1
4
.
Intuituively, if A and B are independent, then the probability that A happens does not depend on
whether B has happened. Note that independence is a property of the probabilities of events, not just
of the (logical) events themselves.
Example 7.4 Suppose that the probability that a new born baby is a boy is
1
2
(i.e., a
baby is equally likely to be a boy or girl) independently of all other births. Consider two
births in a maternity hospital on a particular day. Let
B
1
be the event the rst baby born is a boy
G
1
be the event the rst baby born is a girl
B
2
be the event the second baby born is a boy
G
2
be the event the second baby born is a girl
B
1
B
2
be the event both babies born are boys, etc.
For the rst birth, P(B
1
) =
1
2
and P(G
1
) =
1
2
. Similarly for the second birth P(B
2
) = P(G
2
) =
1
2
.
Furthermore, since all births are independent, the two babies born are equally likely to be any one
of the four possibilities B
1
B
2
, B
1
G
2
, G
1
B
2
, or G
1
G
2
. So each of these events has probability
1
4
. Or,
using the independence rule:
P(B
1
B
2
) = P(B
1
) P(B
2
) =
1
2

1
2
=
1
4
.
7.3 Random variables
In Example 7.1, let X denote the (random) value of the uppermost face after rolling the die. X is
called a random variable. We write
P(X = 6) =
1
6
to denote the probability that the experiment results in the outcome 6; that is, the probability of the
event X = 6. Similarly
P(X = 1) =
1
6
, P(X = 2) =
1
6
, P(X = 3) =
1
6
, P(X = 4) =
1
6
, P(X = 5) =
1
6
.
In this example, X has six possible values, which are equally likely since their probabilities of
occurrence are the same.
Also, dierent values of X represent mutually exclusive events, so the probabilities of all possible
values must add to 1. For example, consider the event of not getting a 6. Then
P(X = 6) = P(X = 1 or 2 or 3 or 4 or 5) =
5
6
= 1
1
6
= 1 P(X = 6) .
That is, to nd the probability that X = 6, we can either add up the probabilities of all values of X
that do not equal 6, or we can subtract the probability that X = 6 from 1.
46
7.3.1 Mean and Variance of a random variable
The mean (or expectation) of a random variable corresponds to the population mean for the
relevant population. In Example 7.1, our population consists of the numbers 1, 2, 3, 4, 5, and 6 in
equal proportions. The mean of the random variable X is therefore the mean of these six numbers,
or = 1
1
6
+ 2
1
6
+ 3
1
6
+ 4
1
6
+ 5
1
6
+ 6
1
6
= 3.5.
In general the mean of a random variable may be found by multiplying each value by its probability
and adding all the products:
=

i
iP(X = i) .
The variance
2
of a random variable is dened similarly:

2
=

i
(i )
2
P(X = i) .
So in Example 7.1, X has variance

2
= (1 3.5)
2

1
6
+ (2 3.5)
2

1
6
+ + (6 3.5)
2

1
6
=
35
12
.
If a large number of independent realisations of the random variable are obtained (e.g., if the die is
rolled many times) the resulting values of X will have a mean close to and standard deviation .
7.4 Binomial Distribution
An experiment that consists of n independent trials such that each trial can only result in two
possible outcomes (usually called success and failure) and such that the probability of success
is the same for each trial is known as a binomial experiment.
Example 7.5 One hundred seeds are sown and it is observed whether or not each seed
germinates. The sowing of seed constitutes a trial and success on a particular trial
occurs if that seed germinates. Here there are n = 100 trials. If whether or not one
seed germinates does not aect whether or not other seeds germinate (i.e., if the trials are
independent), and if each seed has the same chance of germinating, then this is a binomial
experiment. But if some of the seeds were watered and others were not then the probability
of success would not be the same for each trial and so the binomial situation would not
hold.
We are often interested in how many of the n trials result in a success. We dene the random
variable X as the number of successes in n trials. If we have a binomial experiment, then X has a
binomial distribution with index n and success probability p, where p is the probability that
any one trial results in a success.
Let S
i
be the event success on the ith trial and F
i
be the event failure on the ith trial, for
i = 1, 2, . . . , n. Let P(S
i
) = p, so P(F
i
) = 1 p.
For n = 1 (i.e., there is just one trial), X is either 0 or 1 and
P(X = 0) = 1 p , P(X = 1) = p .
47
For n = 2, X may be 0, 1, or 2. Now
P(F
1
F
2
) = P(F
1
) P(F
2
) = (1 p) (1 p)
P(F
1
S
2
) = P(F
1
) P(S
2
) = (1 p) p
P(S
1
F
2
) = P(S
1
) P(F
2
) = p (1 p)
P(S
1
S
2
) = P(S
1
) P(S
2
) = p p
In the rst line above, X = 0, in the second and third lines X = 1 and in the fourth line, X = 2.
Hence
P(X = 0) = (1 p)
2
, P(X = 1) = 2p(1 p) , P(X = 1) = p
2
.
For n = 3, X may be 0, 1, 2 or 3, and
P(X = 0) = P(F
1
F
2
F
3
) = (1 p)
3
P(X = 1) = P(S
1
F
2
F
3
) +P(F
1
S
2
F
3
) +P(F
1
F
2
S
3
) = 3p(1 p)
2
P(X = 2) = P(S
1
S
2
F
3
) +P(S
1
F
2
S
3
) +P(F
1
S
2
S
3
) = 3p
2
(1 p)
P(X = 3) = P(S
1
S
2
S
3
) = p
3
For n = 4, X may be 0, 1, 2, 3 or 4 and a similar argument gives probabilities:
(1 p)
4
, 4p(1 p)
3
, 6p
2
(1 p)
2
, 4p
3
(1 p), p
4
.
The coecients in these binomial probabilities can be obtained from Pascals triangle. This is a tri-
angle of numbers in which each number is calculated by adding together the two numbers immediately
above it. Values up to n = 8 are:
n=0 1
n=1 1 1
n=2 1 2 1
n=3 1 3 3 1
n=4 1 4 6 4 1
n=5 1 5 10 10 5 1
n=6 1 6 15 20 15 6 1
n=7 1 7 21 35 35 21 7 1
n=8 1 8 28 56 70 56 28 8 1
Thus if n = 7, P(X = 3) = 35p
3
(1 p)
4
. The number 35 is often written as
_
7
3
_
where it denotes the number of subsets of 3 objects that could be drawn from a set of 7 objects.
Because any choice of 3 objects from 7 must leave behind 4 objects, and vice versa, it must be true
that
_
7
4
_
=
_
7
3
_
as can be seen from Pascals triangle above. A formula for calculating this coecient is:
_
7
3
_
=
7 6 5
1 2 3
=
7 !
3 ! 4 !
= 35 ,
where, for example, 3 ! = 1 2 3 = 6.
48
In general, the number n!, called factorial n, is 1 2 n, with the convention that 0 ! = 1, and
the number of choices of r objects from n is
_
n
r
_
=
n (n 1) (n r + 1)
1 2 r
=
n!
r ! (n r) !
=
_
n
n r
_
.
A general formula for binomial probabilities is therefore
P(X = r) =
_
n
r
_
p
r
(1 p)
nr
for r = 0, 1, 2, . . . , n.
Formulae for the mean and variance of the binomial distribution are:
= np and
2
= np(1 p) .
For example, for n = 1, = 0 (1 p) + 1 p = p and
2
= (0 p)
2
(1 p) + (1 p)
2
p = p(1 p).
We could calculate binomial probabilities using a calculator or by using Table 6, which gives values
of P(X r) for values of n from 3 to 19, r from 0 to n and p from 0.01 to 0.50. For values of p
greater than 0.5, use the fact that the number of failures, n X, also has a binomial distribution
with parameter 1 p instead of p. Here are some examples:
Example 7.6 Suppose n = 7 and p = 0.30 then
1. P(X 2) = 0.6471
2. P(X < 4) = P(X 3) = 0.8740
3. P(X = 2) = P(X 2) P(X 1) = 0.6471 0.3294 = 0.3177
4. P(X > 4) = 1 P(X 4) = 1 0.9712 = 0.0288
5. P(X 3) = 1 P(X 2) = 1 0.6471 = 0.3529
6. Using a calculator P(X = 2) = 21 (0.3)
2
(0.7)
5
= 0.3176523 0.3177
Example 7.7 The probability that a person will respond to a mailed advertisement is 0.1.
What is the probability that at most two people out of a group of ten will respond?
Let X denote the number of people who respond. Assuming that we have a binomial experiment, we
have n = 10 and p = 0.1. We require P(X 2) = 0.9298.
Example 7.8 In the game of chuck-a-luck you pay 1p to play and then you throw 3 dice.
If you throw one six you get 1p back; if you throw two sixes you get 2p back and if you
throw 3 sixes you get 3p back. What is the probability you at least get your money back?
Let X be the number of sixes thrown. Here we have a binomial experiment with n = 3 and p =
1
6
.
We require P(X 1) = 1 P(X = 0) = 1 (
5
6
)
3
1 0.5787 = 0.4213.
Alternatively, we may interpolate from Table 6: when p = 0.15, P(X = 0) = 0.6141 and when
p = 0.20, P(X = 0) = 0.5120, so when p = 0.167, P(X = 0) 0.58. Thus P(X 1) 0.42.
49
7.5 Poisson distribution
Suppose events occur at a rate of per unit (for example the unit may be time or length). Let
X denote the (random) number of events that occur in a particular unit. Then X has a Poisson
distribution with mean if
P(X = r) =

r
e

r !
for r = 0, 1, 2, . . . .
Poison probabilities may be obtained using a calculator or from Table 7, which gives values of P(X r)
for values of from 0 to 20. This table is used in the same way as Table 6 is used for binomial
probabilities.
Example 7.9 During a certain period of the day the average number of telephone calls
per minute coming into a switchboard is 4. What is the probability that
(a) in one minute during this period the switchboard receives at most 3 calls;
(b) in two minutes during this period the switchboard receives more than 8 calls?
For (a), let X be the number of calls in the relevant one minute period. Suppose that X has a Poisson
distribution with mean = 4. We require P(X 3) = 0.4335 from Table 7.
For (b) let X be the number of calls in the relevant two minute period. Suppose that X has a Poisson
distribution with mean = 8. We require P(X > 8) = 1 P(X 8) = 1 0.5925 = 0.4075 from
Table 7.
The variance of the Poisson distribution is equal to the mean . Hence the standard deviation is
=

. For example, in Example 7.9 if we counted the number of calls arriving in a minute, for each
of a large number of dierent minutes, these numbers would vary about a mean of 4 with a standard
deviation of 2.
The Poisson distribution is a limiting case of the binomial distribution in the following sense: suppose
X has a binomial distribution with index n and success probability p = /n. Then as n becomes large
(and therefore p becomes small), the distribution of X tends to Poisson with mean . In particular,
the binomial mean is np = for all n and the binomial variance is
2
= np(1p) = (1/n), which
approaches as n increases.
7.6 Approximations to the binomial distribution for large n
Let X have a binomial distribution with index n and success probability p.
1. If n is large and p is not too close to 0 or 1, then
P(X r) P(Y r + 0.5)
where Y has a normal distribution with mean = np and standard deviation =
_
np(1 p).
When p is near 0.5, this approximation works well even for quite small n, e.g., n = 20.
2. If n is large and p is close to 0, then the binomial distribution can be approximated by the
Poisson distribution with = np.
50
Example 7.10 The proportion of bull calves born to domestic cattle is 0.512. What is
the probability that out of 100 calves born fewer than 50 are bulls?
Let X denote the number of bull calves born out of the 100 calves. Then x has a binomial distribution
with n = 100 and p = 0.512. We require P(X < 50). Using the normal approximation, we have
P(X < 50) = P(X 49) P(Y 49.5)
where Y has a normal distribution with mean = 100 0.512 = 51.2 and standard deviation =
_
100 0.512 (1 0.512) = 4.999. So
P(X < 50) P
_
Z
49.5 51.2
4.999
_
= P(Z 0.3401) = 1 0.6331 = 0.3669 .
Example 7.11 It is suggested that 0.006% of insured males die in road accidents each
year. What is the probability that in a given year, an insurance company must pay o 3
out of the 10000 policies against such accidents that they have?
Let X denote the number of claims for death from road accidents in a given year. Then X has
a binomial distribution with n = 10000 and p = 0.00006. Hence X has approximately a Poisson
distribution with = 10000 0.00006 = 0.6. Thus
P(X = 3) = P(X 3) P(X 2) = 0.9966 0.9769 = 0.0197 .
7.7 Normal approximation to the Poisson distribution for large
Let X have a Poisson distribution with mean . Then if is reasonably large,
P(X r) P(Y r + 0.5)
where Y has a normal distribution with mean and standard deviation =

. This approximation
can be used to calculate Poisson probabilities for values of outside the range of Table 7.
For example, suppose X has a Poisson distribution with mean = 20 and we want to nd P(X 24).
From Table 7 we get P(X 24) = 0.8432, while the above approximation gives
P(X 24) P(Y 24.5) = P
_
Z
24.5 20

20
_
= P(Z 1.0062) = 0.8428
which is not very dierent. The normal approximation is of course better for larger .
51
8 Inference for Binomial and Poisson Parameters
Example 8.1 A lady says she can tell by taste whether tea has been made with tea bags
or bulk tea. She sips from 15 pairs of cups, one with each kind of tea, and makes the
correct identication 9 times. Is there reason to think that the lady really can tell the
dierence?
Let X denote the number of correct identications out of 15 and let p be the probability the lady
correctly identies the two teas in a pair. We test the null hypothesis
H
0
: p = 0.5
which is what p would be if she made a random choice. If the lady has some ability to choose correctly
we would expect p to be greater than 0.5. However it it also possible that p might be less than 0.5
she might make the wrong choice more often than by chance! We therefore use a two-sided test as
usual.
The P-value is the probability of getting a result as extreme as (ore more extreme than) that observed,
if H
0
is true. The one-sided P-value is P(X 9), where X has a binomial distribution with n = 15
and p = 0.5. Thus the two sided P-value is
P = 2 P(X 9) = 2(1 P(X 8)) = 2(1 0.6964) = 0.61 .
So the data are consistent with H
0
. There is no evidence that the lady can distinguish the two types
of tea by tasting.
Example 8.2 A group of prospectors for a certain mineral has been operating for some
time in an extensive area covering several hundred square kilometers, where deposits of the
mineral are found randomly located over the area at an average density of 1000 deposits
per square kilometer. The group is considering moving to another location. They decide
to carry out a pilot survey of an area of 10,000 square metres in the new location. If they
nd 17 deposits in their pilot area, how strong is the evidence that the new location has a
dierent density of deposits to the existing one?
Let X denote the number of deposits found in the pilot area. Let be the average number of deposits
per 10,000 square metres in the new area. In the old area the density of deposits is 1000 per square
km, which is 10
3
per square metre, and therefore 10
4
10
3
= 10 per 10,000 square meters. So we
test
H
0
: = 10 .
If H
0
is true, X will have a Poisson distribution with mean = 10, because deposits are randomly
located. So the one-sided P-value is
P(X 17) = 1 P(X 16) = 1 0.9730 = 0.0270 .
Hence the two sided P-value is P = 2 0.0270 = 0.054. (We use the rule of doubling the one-sided
P-value.) This is greater than, but very close to the conventional 0.05, and suggests that there is some
(but not strong) evidence that the density of deposits is higher in the new location. The evidence is
not clear cut, but it might be useful to calculate a condence interval for .
52
Example 8.3 From Mendelian inheritance theory it is expected that certain crosses of
pea will give yellow and green peas in the ratio 3:1. In a particular experiment 180 yellow
and 48 green peas were obtained. Does this experiment support the theory?
Let X denote the number of green peas obtained out of 228. Let p be the probability of a pea being
green. According to the theory, p =
1
3+1
= 0.25, so we test
H
0
: p = 0.25 .
If H
0
is true, X has a binomial distribution with n = 228 and p = 0.25. The expected value of X is
np = 2280.25 = 57 and the standard deviation of the distribution of X is =

228 0.25 0.75 =


6.538. We nd the one sided P-value using the normal approximation:
P(X 48) P(Y 48.5) = P
_
Z
48.5 57
6.538
_
= P(Z 1.30) = 1 0.9032 = 0.0968 .
Hence the P-value is P = 2 0.0968 = 0.1936, so the data are consistent with H
0
. We conclude that
there is no evidence to suggest that the theory is wrong.
Example 8.4 Medical researchers studied the eect of tight neckties on the ow of blood
to the head and the possible decrease in the brains ability to respond to visual information.
Results of a random sample of 250 businessmen found that 167 were wearing their tie too
tight. Find a 95% condence interval for the proportion of the population of businessmen
who wear their tie too tight.
Let p be the population proportion of businessmen who wear their tie too tight. Let X denote the
number of businessmen out of 250 who wear their tie too tight. Then X has a binomial distribution
with n = 250 and success probability p. The sample proportion 167/250 is an estimate of p:
p = 167/250 = 0.668 .
The standard error of this estimate is
se( p) =

p(1 p)
n
.
We do not know p but we can obtain an approximate standard error by using p in this formula:
se( p)

p(1 p)
n
=
_
167
250

83
250

1
250
= 0.0298 .
Since n is large we can use the normal approximation to the sampling distribution of p. Thus an
approximate 95% condence interval for p is
p 1.96se( p) = 0.668 1.96 0.0298 = (0.610, 0.726) .
Similarly an approximate 99% condence interval for p is
p 2.5758 se( p) = 0.668 + 2.5758 0.0298 = (0.591, 0.745) .
Note: 1.96 is the upper 2.5 percentage point of the standard normal distribution, and 2.5758 is the
upper 0.5 percentage point.
53
9 Frequency Data and Chi-Square Tests
9.1 Fitting a Probability Model
Example 9.1 The data below give the observed frequency of dierent kinds of pea seeds
in crosses from plants with round yellow seeds and plants with wrinkled green seeds. The
Mendelian theory of inheritance suggests that 1/16 of the seeds will be wrinkled and green
(WG), 3/16 will be round and green (RG), 3/16 will be wrinkled and yellow (WY) and
9/16 will be round and yellow (RY).
Type of seed: RY WY RG WG Total
Observed number: 93 27 32 8 160
Do the data agree with the theory?
Let p
1
, p
2
, p
3
, p
4
be the proportions of RY, WY, RG, WG seeds in the population. We test the null
hypothesis
H
0
: p
1
= 9/16, p
2
= 3/16, p
3
= 3/16, p
4
= 1/16 .
If H
0
is true then in a sample of 160 observations we would expect 90, 30, 30, 10 of RY, WY, RG,
WG seeds respectively.
We can try to assess how well the data agree with the theory by comparing the observed and expected
frequencies in a table. How can we judge their agreement? To judge how well an expected frequency
E agrees with an observed frequency O we need a measure that depends on the dierence OE, and
also on the size of E. For example, we would consider O = 20 and E = 10 to be in poor agreement,
but O = 210 and E = 200 to be in rather good agreement, even though O E = 10 in both cases.
According to statistical theory, a good measure is the standardised residual dened by
O E

E
.
A useful informal rule of thumb is that O and E are in reasonably good agreement if (O E)/

E
is between 2 and +2. For example, when O = 20 and E = 10 we get (20 10)/

10 = 3.16 which
is poor agreement. But when O = 210 and E = 200 we get (210 200)/

200 = 0.71 which is good


agreement (between 2 and +2). Note that this rule applies to frequencies or counts (that is,
numbers on the counting scale 0, 1, 2, . . .) but not to measurements that have units, such as lengths,
times, etc.
We also need a measure of how well the set of observed frequencies O
1
, O
2
, . . . , O
k
agree with the
expected frequencies E
1
, E
2
, . . . , E
k
overall. The conventional measure, is the chi-square statistic
dened by

2
stat
=
k

i=1
(O
i
E
i
)
2
E
i
.
In words, the chi-square statistic is the sum of the squares of the standardised residuals (OE)/

E.
If
2
stat
= 0 the observed and expected frequencies agree exactly. The larger
2
stat
is, the worse is the
agreement. Furthermore, according to statistical theory, if the null hypothesis (H
0
) is true, then
2
stat
should be a random value from a
2
-distribution with degrees of freedom equal to the number of
categories minus one. We can then nd the P-value as the upper tail probability P(Y >
2
stat
) where Y
has a
2
-distribution with degrees of freedom. Table 8 give percentage points for
2
-distributions. As
54
usual, a small P-value (corresponding to a large
2
stat
) gives evidence against H
0
, and would therefore
suggest that the expected frequencies do not agree suciently well with the observed frequencies.
Returning to Example 9.1, here are the observed and expected frequencies and standardised residuals:
Type of seed RY WY RG WG Total
Observed number O
i
93 27 32 8 160
Expected number E
i
90 30 30 10 160
(O
i
E
i
)/

E
i
0.32 0.55 0.37 0.63
The standardised residuals are all comfortably between 2 and +2 so it looks as if the observed and
expected numbers are in good agreement. To do a formal test, we calculate the chi-square statistic:

2
stat
= (0.32)
2
+ (0.55)
2
+ (0.37)
2
+ (0.63)
2
= 0.933. The degrees of freedom are = 4 1 = 3.
The relevant extract from Table 8 is
tail probability p 0.90 0.75

2
p
for = 3 0.58 1.21
We have 0.58 < 0.933 < 1.21 so our P-value is between 0.75 and 0.90. The P-value is not small (much
greater than 0.05) so the data are consistent with H
0
. We conclude that the data do support the
theory.
9.2 Fitting a binomial distribution
Example 9.2 A factory has 4 machines and the number of machines breaking down each
week is observed for 100 weeks with the following results.
Number of breakdowns 0 1 2 3 4 Total
Observed number of weeks 63 28 6 2 1 100
We are interested in whether breakdowns are random in the sense that each machine has
the same chance of breaking down in any week, independently of what may happen to
other machines and in other weeks.
Let X denote the number of machines breaking down in a week. Then if breakdowns are random in the
above sense, X will have a binomial distribution with index n = 4 and unknown success probability p,
where p is the probability that any given machine breaks down in any week. (Observing a machine for
a week is a trial and a machine breaking down is a success. The 100 weeks represent 100 realisations
of the random variable X.) We will therefore t a binomial distribution to these data and test the
goodness of t. The probabilities of 0, 1, 2, 3 or 4 breakdowns in any week are
P(X = r) =
_
4
r
_
p
r
(1 p)
4r
for r = 0, 1, 2, 3, 4.
We want to calculate expected frequencies given by 100 P(X = r). We do not know the value of p,
so we estimate it from the data as:
p =
total number of successes
total number of trials
=
0 63 + 1 28 + 2 6 + 3 2 + 4 1
100 4
=
50
400
= 0.125 .
Note that this is equivalent to calculating the sample mean number of breakdowns per week ( x =
50/100 = 0.5) and equating this to the theoretical mean of the binomial distribution 4p, so that
p = 0.5/4 = 0.125.
Here are the observed and expected frequencies for Example 9.2:
55
Number of breakdowns 0 1 2 3 4 Total
Observed number of weeks 63 28 6 2 1 100
Probability P(X = r) 0.5862 0.3350 0.0718 0.0068 0.0002 1
Expected number of weeks 58.62 33.50 7.18 0.68 0.02 100
In Example 9.1 all of the expected values were reasonably large. But if one or more of the expected
values is too small (less than about 5) then the above rules for interpreting the standardised residuals
and
2
stat
can fail. It is standard practice then to pool categories so that all expected frequencies are
greater or equal to 5. This gives the table:
Number of breakdowns 0 1 2
Observed frequency O
i
63 28 9
Expected frequency E
i
58.62 33.50 7.88
(O
i
E
i
)/

E
i
0.57 0.95 0.40
Again, the standardised residuals are comfortably between 2 and +2, suggesting that the O
i
and E
i
agree well.
The chi-square statistic is
2
stat
= (0.57)
2
+ (0.95)
2
+ (0.40)
2
= 1.39. Also, because p was estimated
from the data, we must reduce the degrees of freedom by one, so = 3 1 1 = 1. From Table 8
with = 1, you can check that the P-value is between 0.20 and 0.25. Again, this is not small and the
binomial distribution ts the data well. We conclude that the data are consistent with the hypothesis
that each machine has, independently, the same chance of breaking down in any week.
9.3 Fitting a Poisson distribution
Example 9.3 Twenty ve leaves were selected at random from each of six McIntosh apple
trees in a single orchard. The following table shows the distribution of European red mites
on these 150 apple leaves.
Number of mites on a leaf 0 1 2 3 4 5 6 7 Total
Observed number of leaves 70 38 17 10 9 3 2 1 150
We are interested in whether mites are distributed randomly on leaves, in the sense that
any leaf has the same chance of receiving a mite, independently of whether it has other
mites.
If mites are distributed randomly in the above sense, then the numbers of mites per leaf would follow
a Poisson distribution. So we will t a Poisson distribution and test the goodness of t.
The mean rate, , of mites per leaf is not known. We may use the sample mean as an estimate of :
= x =
0 70 + 1 38 + 2 17 + 3 10 + 4 9 + 5 3 + 6 2 + 7 1
150
= 1.1467 .
Using this estimate, the expected number of leaves with r mites is therefore
150
(1.1467)
r
e
1.1467
r !
r = 0, 1, 2, . . . .
Here are the results of the calculations, where the last three columns have pooled results from leaves
with three or more mites:
56
Number Observed Probability Expected Obs Exp Std
of mites number number O E res
0 70 0.3177 47.65 70 47.65 3.24
1 38 0.3643 54.64 38 54.64 -2.25
2 17 0.2089 31.33 17 31.33 -2.56
3 10 0.0798 11.97 25 16.66 2.14
4 9 0.0229 3.43
5 3 0.0052 0.79
6 2 0.0010 0.15
7 1 0.0002 0.02
>7 0 0.0000
Total 150 1.0000 149.98
Note that although no leaves have been observed with 8 or more mites on them it is possible that such
leaves exist and this must be allowed for in the calculations. Thus the last group is > 7 and the last
probability is 1(the sum of all the previous probabilities) In this case it is zero to 4 decimal places,
but sometimes it may be greater.
All four standardised residuals are large in absolute value (outside the range 2 to +2) so the t does
not look good. The chi-square statistic is
2
stat
= (3.24)
2
+ (2.25)
2
+ (2.56)
2
+ (2.14)
2
= 26.67
with degrees of freedom = 4 1 1 = 2. Again we subtract an extra degree of freedom because
was estimated from the data. From Table 8 the P-value is less than 0.001, which provides signicant
evidence against H
0
. (The null hypothesis here is that mites are distributed randomly on the leaves.)
The data therefore suggest that the mites are not distributed at random on the leaves. This could be
due to the fact that the mites exist in colonies so that a leaf is more likely to be attacked by several
mites than a single mite, or that the mites have only attacked a few trees and the rest are free from
infestation. More information is needed to discover the reason.
9.4 Contingency tables
Example 9.4 A researcher, investigating public attitudes to the level of welfare benets
in Britain, carried out a pilot survey in the town where she lived. A simple random sample
of 200 individuals was drawn from the electoral register, and each member of the sample
was asked to complete a questionnaire. One question asked what the respondent felt about
the current level of child benet. Respondents were also assigned to an occupational group
according to the occupation of the principal provider of nancial support within their
household. The following table gives the numbers of people giving each response, for each
occupation group:
Response about level of benet
Too high About right Too low Dont know Total
Occupation
Non-manual 18 29 10 15 72
Manual 13 40 26 14 93
None currently 3 13 11 8 35
Total 34 82 47 37 200
We are interested in whether a persons occupation aects their feeling about child benet;
i.e., is the pattern of responses the same for each occupation group, and if not, how does
it dier?
57
We will formally test the null hypothesis that there is no association between occupation and re-
sponse. But before doing so it is helpful to look at summary statistics in the form of row proportions.
Sometimes it may be the column proportions that are of interest, but in this example we want to look
at the row proportions the proportion of each response with in each occupation group. We divide
each entry in the contingency table by the relevant row total; two decimal places is good enough to
see any pattern:
Table of row proportions
Too high About right Too low Dont know Total
Occupation
Non-manual .25 .40 .14 .21 1.00
Manual .14 .43 .28 .15 1.00
None currently .09 .37 .31 .23 1.00
Total .17 .41 .24 .19 1.00
Some of the numbers of responses are quite small so we should not read too much into these proportions.
But the general pattern seems to be that a higher proportion of the non-manual group think that child
benet is too high, compared with the other two groups, about the same proportions (about 40%)
of the three groups think the benet is about right, and a lower proportion of the non-manual group
thinks it is too low.
We can formally test whether there is an association between response and occupation group using
a
2
-test as follows. We will calculate a table of expected frequencies and a table of standardised
residuals and hence a
2
stat
and a P-value.
We can express the null hypothesis H
0
in dierent ways:
H
0
: There is no association between response and occupation group,
H
0
: Response and occupation group are independent, or
H
0
: The population proportions of responses are the same for each occupation group.
It is often the third of these that is most easily interpreted. If the null hypothesis were true then we
would expect the same row proportions for each occupation group. These would be approximately
34/200 too high, 82/200 about right, 47/200 too low and 37/200 dont know (i.e., estimated from the
column totals).
There are 72 people with non-manual occupations so, if H
0
is true, the expected numbers of these
people in the four response groups are
34
200
72 = 12.24,
82
200
72 = 29.52,
47
200
72 = 16.92,
37
200
72 = 13.32 .
Similarly if H
0
is true, the expected numbers of the dierent responses in the manual occupation
group are
34
200
93 = 15.81,
82
200
93 = 38.13,
47
200
93 = 21.85,
37
200
93 = 17.21
And for the no current occupation group, the expected frequencies are
34
200
35 = 5.95,
82
200
35 = 14.35,
47
200
35 = 8.22,
37
200
35 = 6.48 .
58
Here is the table of expected frequencies:
Table of expected frequencies
Too high About right Too low Dont know Total
Occupation
Non-manual 12.24 29.52 16.92 13.32 72
Manual 15.81 38.13 21.85 17.21 93
None currently 5.95 14.35 8.22 6.48 35
Total 34 82 47 37 200
Note that it has the same row and column totals as the table of observed frequencies. The general
formula is
expected frequency =
row total column total
grand total
.
To compare the observed and expected frequencies, it is useful to calculate a table of standardised
residuals (O E)/

E:
Table of standardised residuals
Too high About right Too low Dont know
Occupation
Non-manual 1.65 0.10 1.68 0.46
Manual 0.71 0.30 0.89 0.77
None currently 1.21 0.36 0.97 0.60
In spite of the pattern we saw above, these standardised residuals are all between 2 and +2, so the
observed and expected frequencies seem to agree reasonably well. The chi-square statistic is

2
stat
= (1.65)
2
+ (0.10)
2
+ + (0.60)
2
= 10.62 .
What are the degrees of freedom? Statistical theory tells us that for a contingency table with r rows
and c columns the degrees of freedom is = (r 1) (c 1). (This is because the table of expected
frequencies is forced to have the same row and column tables as the table of observed frequencies.)
So in the present example, with 3 rows and 4 columns, = 2 3 = 6. From Table 8 the P-value is
therefore approximately 0.10. (P(Y > 10.62) 0.10, where Y has a
2
-distribution with 6 degrees of
freedom.)
On the basis of these data there is therefore insucient evidence to conclude that there is any rela-
tionship between a persons occupational group and his or her attitude towards the current level of
child benet.
Example 9.5 The following data give the sex (S) and the usual means of travel to work
(T) for a random sample of 140 employees of a large company (E). The data are coded
for sex as 1 = male and 2 = female and for usual means of travel to work as 1 = car or
motor cycle driver, 2 = car or motor cycle passenger, 3 = public transport and 4 = walk
or pedal cycle.
E S T E S T E S T E S T E S T E S T E S T
1 1 1 21 2 4 41 1 4 61 1 4 81 1 1 101 1 1 121 1 1
2 1 1 22 1 1 42 2 1 62 1 1 82 1 1 102 2 3 122 1 1
3 1 1 23 2 1 43 2 2 63 1 4 83 2 2 103 1 1 123 1 3
4 1 3 24 1 1 44 1 1 64 1 1 84 1 1 104 1 1 124 1 3
5 1 1 25 2 2 45 2 3 65 2 4 85 1 1 105 2 4 125 1 4
59
6 1 1 26 2 2 46 1 1 66 1 1 86 2 4 106 1 2 126 2 1
7 1 1 27 1 4 47 1 1 67 1 4 87 2 3 107 1 1 127 2 3
8 2 4 28 1 1 48 2 1 68 1 4 88 1 1 108 1 3 128 1 1
9 2 2 29 1 1 49 1 4 69 1 3 89 1 4 109 2 1 129 1 1
10 1 1 30 2 4 50 2 1 70 1 1 90 1 4 110 1 1 130 2 4
11 2 4 31 2 1 51 2 4 71 2 3 91 1 1 111 2 1 131 1 4
12 1 1 32 2 1 52 1 4 72 1 1 92 1 4 112 1 2 132 1 1
13 1 1 33 1 1 53 1 4 73 1 1 93 2 2 113 1 2 133 1 1
14 2 4 34 1 1 54 2 1 74 1 3 94 1 4 114 1 1 134 2 1
15 1 1 35 2 4 55 1 1 75 1 3 95 2 4 115 2 1 135 1 3
16 1 3 36 2 2 56 2 3 76 1 2 96 2 3 116 1 1 136 1 1
17 2 4 37 2 4 57 1 1 77 1 1 97 2 4 117 1 4 137 1 2
18 1 4 38 1 1 58 2 1 78 1 1 98 2 1 118 1 1 138 1 4
19 1 1 39 2 2 59 2 1 79 1 1 99 1 1 119 1 1 139 2 4
20 2 4 40 2 1 60 2 2 80 2 2 100 1 1 120 1 4 140 1 4
Is there any association between a persons gender and his or her mode of transport?
Here is a contingency table compiled from the raw data:
Male (1) Female (2) Total
Car or motor cycle driver (1) 56 16 72
Car or motor cycle passenger (2) 5 10 15
Public transport (3) 9 7 16
Walk or pedal cycle (4) 20 17 37
Total 90 50 140
Let us rst carry out the
2
-test. The null hypothesis is that there is no association between sex and
mode of transport. You can check that the expected frequencies and standardised residuals are:
Expected frequencies Standardised residuals
Male Female Male Female
Transport 1 46.29 25.71 1.43 -1.92
2 9.64 5.36 -1.50 2.01
3 10.29 5.71 -0.40 0.54
4 23.79 13.21 -0.78 1.04
So
2
stat
= (1.43)
2
+ (19.92)
2
+ (1.50)
2
+ (2.01)
2
+ (0.40)
2
+ (0.54)
2
+ (0.78)
2
+ (1.04)
2
= 14.10,
with degrees of freedom = 3 1 = 3. From Table 8, the P-value is between 0.001 and 0.005,
which represents quite strong evidence against H
0
. Note also that there are two fairly substantial
standardised residuals of 1.92 and 2.01. We may conclude that there is evidence of a relationship
between a persons sex and his or her means of transport to work.
Now let us look at some proportions to see what the relationship is. The row proportions are
Male Female Total
Car or motor cycle driver (1) .78 .22 1.00
Car or motor cycle passenger (2) .33 .67 1.00
Public transport (3) .56 .44 1.00
Walk or pedal cycle (4) .54 .46 1.00
Total .64 .36 1.00
60
These tell us the relative numbers of males and females for each mode of transport. There is a majority
of males in the sample (64% male and 36% female). The main pattern is that, among those who drive a
car of motorcycle to work, a higher proportion are male (0.78), while among those who are passengers
a lower proportion are male.
Here are the column proportions:
Male Female Total
Car or motor cycle driver (1) .62 .32 .51
Car or motor cycle passenger (2) .06 .20 .11
Public transport (3) .10 .14 .11
Walk or pedal cycle (4) .22 .34 .26
Total 1.00 1.00 1.00
These tell us, for each sex, the relative frequencies of each mode of transport. From the marginal
proportions, half of the sample work force drive to work, while a quarter walk or cycle; 11% take public
transport and another 11% are passengers. Looking at males and females separately, a rather higher
proportion of males drive (.62 compares with .32) and more females are passengers (.20 compared with
.06). These proportions seem more interesting than the row proportions in this case. The chi-square
test has conrmed that there is good evidence that the patterns dier for males and females.
9.5 Comparing two or more binomial proportions
Example 9.6 Four adjacent areas of heathland were burned in turn in successive years.
100 quadrat samples were then taken at random from each area and the presence or absence
of the grass Agrostis tenuis was noted with the following results:
Area Number of quadrats with
(Number of years since Agrostis tenuis
burning in brackets) Present Absent Total
A (1) 26 74 100
B (2) 40 60 100
C (3) 39 61 100
D (4) 47 53 100
Total 152 248 400
We wish to test whether the frequency of Agrostis tenuis varies from area to area i.e. if
the proportion of quadrats containing the grass diers over the four areas.
Let p
A
, p
B
, p
C
and p
D
be the population proportions of quadrats containing Agrostis tenuis in the 4
areas. We test the null bypothesis
H
0
: p
A
= p
B
= p
C
= p
D
against the alternative hypothesis that at least two of these proportions are dierent.
If H
0
is true an estimate of p, the common proportion of quadrats containing Agrostis tenuis, is
p = 152/400 = 0.38 and we would expect Agrostis tenuis to be present in 100 152/400 = 38 and
absent in 100 248/400 = 62 quadrats from area A. Similarly for the other three areas. Thus the
method is the same as for contingency tables. Here are the expected frequencies and standardised
residuals:
61
Expected frequencies Standardised residuals
Area Present Absent Area Present Absent
A 38 62 A -1.95 1.52
B 38 62 B 0.32 -0.25
C 38 62 C 0.16 -0.13
D 38 62 D 1.46 1.14
The chi-square statistic is
2
stat
= (1.95)
2
+ +(1.14)
2
= 9.76 with degrees of freedom = 31 = 3.
From Table 8 the P value is between 0.025 and 0.05, so there is some (moderate) evidence against
H
0
. Note that all of the standardised residuals are between 2 and +2 but one is close to 2. We
conclude that there is some evidence that some of the four proportions dier.
We may also calculate estimates and condence intervals for proportions of interest and also for
dierences between proportions. For example:
For area D: p
D
= 0.47 with standard error se( p
D
) =
_
0.47 0.53/100 = 0.0499. So a 95% condence
interval for p
D
is 0.47 1.96 0.0499 = 0.47 0.0978 = (0.372, 0.568).
To estimate the dierence in proportions for areas D and A: p
D
p
A
= 0.47 0.26 = 0.21 with
standard error
se( p
D
p
A
) =

p
D
(1 p
D
)
n
D
+
p
A
(1 p
A
)
n
A
=
_
0.47 0.53
100
+
0.26 0.74
100
= 0.0664 .
A 95% condence interval for the dierence p
D
p
A
is therefore 0.211.960.0664 = 0.210.1302 =
(0.08, 0.34).
62
10 The Normal Linear Regression Model
10.1 Theory
Consider a sample of values of two variables x and y for n individuals. Denote these data by x
i
, y
i
for
individual i = 1, 2, . . . , n. Section 2.1 gives formulae for the sample means x, y, standard deviations
s
x
, s
y
, correlation coecient r
xy
, the slope b and intercept a of the least squares regression line y =
a +bx, and for the residual standard deviation about this line s
res
.
Here we consider a statistical model, in which we imagine that y
i
is drawn from a Normal population
with mean +x
i
and with standard deviation , where , and are unknown parameters. This is
a model for how the observations y
i
are related to x
i
. In such models y is called a response variable
and x is an explanatory variable.
Thus, each y
i
is from a dierent Normal population: these populations have the same standard
deviation but dierent means; and the means all lie on a straight line y = + x. Some text books
write this model as an equation
y
i
= +x
i
+e
i
where e
1
, e
2
, . . . , e
n
are unobserved random errors. That is, for a particular x
i
we can imagine
generating a y
i
by taking the value + x
i
and adding a random number drawn from a Normal
distribution with mean 0 and standard deviation .
Under this model, the slope b and intercept a of the least squares regression line are good estimates
of and , and s
res
is a good estimate of . Furthermore, standard errors and condence intervals
for various parameters can be calculated. The formulae are as follows:
Estimates of slope , intercept and error standard deviation :

= b =
C
xy
C
xx
, = a = y b x, = s
res
=

RSS
n 2
=

(n 1)s
2
y
(1 r
2
xy
)
n 2
,
where is estimated with = n 2 degrees of freedom.
Standard errors of the estimates of slope and intercept are given by:
se(

) =

C
xx
, se( ) =

1
n
+
x
2
C
xx
where C
xx
=

n
i=1
(x
i
x)
2
= (n 1)s
2
x
.
The mean response at a given x is denoted by
x
, where
x
= +x. Thus for a particular value
of x,
x
is the average y for individuals with this x. The estimate of
x
and its standard error are

x
= +

x, se(
x
) =

1
n
+
(x x)
2
C
xx
.
Note that corresponds to the mean response at x = 0.
We can construct condence intervals for and for
x
by using the t-distribution in exactly the
same way as for the mean of a normal population, except that we use n2 degrees of freedom instead
63
of n 1. Thus a 95% condence interval for is

t
p
se(

)
and a 95% condence interval for
x
is

x
t
p
se(
x
)
where t
p
is the upper 2
1
2
percentile of the t-distribution with n 2 degrees of freedom.
10.2 Examples
Example 10.1 Here are some data for 26 babies born in University College Hospital in a
particular week. The babies are all boys of the same race. The data are their birth weights
in gm (y) and gestational ages in weeks x, to the nearest week.
x 42 41 39 40 40 40 39 39 41 42 41 43 42
y 3180 2780 3630 3900 3310 2896 2780 3800 3900 4020 4180 3460 4400
x 41 38 37 38 43 35 37 35 38 40 42 39 34
y 3800 2990 3160 2720 3560 2640 2400 2320 2910 3200 3800 3560 2538
We want to see if the relation between birth weight and gestational age is well described
by the normal linear regression model, and to estimate parameters of interest.
As always, we start by looking at a scatter plot of y against x. Here we regard birth weight as the
response variable and age as the explanatory variable. The model says that weights of babies with

gestational age (weeks)


b
i
r
t
h

w
e
i
g
h
t

(
g
m
)
32 34 36 38 40 42 44
2000
2500
3000
3500
4000
4500

gestational age x have a Normal distribution with standard deviation , say, and mean +x. This
does not look unreasonable, given the small number of babies in our sample.
You may verify the following calculations: x = 39.46, y = 3301.3, s
x
= 2.45, s
y
= 578.3, r
xy
= 0.7054,
C
xx
= 150.4615, C
xy
= 25020.31, C
yy
= 8361816, a = 3260.8, b = 166.3 and s
res
= 418.4.
Applying the formulae in 2.1, you may check that the least squares estimates of and are =
3260.8 gm and

= 166.3 grams per week, and the estimate of is = 418.4 gm. The line
y = 3260.8 + 166.3x is drawn on the scatter plot and seems to be a good description of how the
average birth weight depends on age. It is not always easy to judge this from the scatter plot, and
it is customary to plot the residuals y
i
a bx
i
against x
i
, which makes it easier to see systematic
departures form the model. There does not appear to be a dectectable systematic pattern: about half
64

gestational age (weeks)


r
e
s
i
d
u
a
l

(
g
m
)
32 34 36 38 40 42 44
-800
-600
-400
-200
0
200
400
600

of the residuals are positive. Perhaps there is a suggestion of more scatter at higher x values, which
suggests that the assumption that is the same for all x may be questionable, but the sample size is
really too small to take this seriously.
The parameter is the standard deviation of birth weights for baby boys with the same gestational
age. This is estimated to be 418 gm, which is quite large but not as large as s
y
= 578 gm, the
estimated standard deviation of birth weights for babies with dierent ages.
The parameter does not have a physical meaning; the model does not make sense when x = 0, and
anyway we would not expect a straight line relationship to hold for abnormally low gestational ages.
The parameter is the change in average birth weight when gestational age increases by one week.
The point estimate is

= 166 gm. This change is quite small compared with the standard deviation of
418 gm, which is why there is a lot of overlap between points at meighbouring x-values. The standard
error of

is
se(

) =

C
xx
=
418.4

150.4615
= 34.11 ,
and the upper 0.025 percentage point of the t distribution with = n 2 = 24 is t
p
= 2.064. Hence
a 95% condence interval for is

t
p
se(

) = 166.3 2.064 34.11 = (95.9, 236.7) .


Thus the incease in average weight when age increases by one week is estimated to be between 96 and
237 gm (with conventional 95% condence).
Suppose we want to estimate the average weight of baby boys born at 36 weeks. Our parameter is

x
= + 36. The point estimate is
x
= 3260.8 + 36 166.3 = 2726 gm. This is the point on the
tted line at x = 36. This estimate has standard error
se(
x
) =

1
n
+
(x x)
2
C
xx
= 418.4

1
26
+
(36 39.46)
2
150.4615
= 143.74 ,
and a 95% condence interval for
x
is

x
t
p
se(
x
) = 2726 2.064 143.74 = (2429, 3023) .
Thus the average weight of baby boys born at 36 weeks is estimated to be between 2429 and 3023 gm.
65

You might also like