Professional Documents
Culture Documents
1.1 Definitions:
The set of all possible elementary outcomes of an experiment is called the sample space.
A random variable is a mapping of a sample space to the real line.
Note: we will usually a random variable by a capital letter (e.g. Y) and the value taken by a random
variable by a lower case letter (e.g. y).
A discrete random variable is a random variable that can take only a finite or countably infinite
number of values.
Note: in this module discrete random variables will usually be integer-valued.
[see 1024 P10]
1.2 Examples
If the experiment consists of tossing a coin with outcomes head or tail and we toss the coin once,
then clearly the sample space is {head, tail}. One possible random variable, X say, is the number of
heads obtained. X can only take values 0 or 1 and so is a discrete random variable.
Suppose we are interested in monitoring the number of hits at a web site in a year. Denote the
number by Y. Clearly, Y must be integer-valued and so is a discrete random variable, but there is no
obvious upper bound to Y, so it may be convenient to take the set of possible values of Y to be the
countably infinite set {0, 1, 2, }.
1
[see 1024 P10]
1.4 Example
If the experiment consists of tossing a coin with outcomes head or tail and we toss the coin once,
then clearly the sample space is {head, tail}. Suppose X, the number of heads obtained, is our
random variable of interest. Then, if the coin is fair, the probability function of X is
p(0) = p(1) = .
However, if the fairness of the coin is unknown the probability function of X could be taken to be
p(x) x (1 )1x , x = 0, 1.
Here is a parameter, i.e. a fixed but unknown constant. Clearly, must lie between 0 and 1 in
this case since it is a probability. This is a common situation encountered in Statistics: we might
assume we know the form of a probability function but it contains one or more unknown quantities
(parameters) whose value(s) we need estimate from sample data.
A Bernoulli trial is an experiment with just two possible outcomes success and failure that
occur with probabilities and 1 respectively, where is the success probability.
A Bernoulli random variable X has probability function
p(x) x (1 )1x , x = 0, 1,
where 0 1.
This is a basic building block for some familiar but more complex discrete random variables.
[see 1024 P10]
x
where 0 1.
We will often say in such circumstances X is Binomially distributed or X is Binomial(n, ) or
X~ Binomial(n, ).
[see 1024 P10]
2
1.6.2 Negative Binomial random variables (including Geometric random variables)
Suppose we undertake a sequence of independent Bernoulli trials, each with success probability .
Let X be the number failures that occur before the kth success. Then X is called a Negative
Binomial random variable with probability function
(k x 1)!
p(x)
(k 1)!x!
1 x k , for x = 0, 1, 2, ,
where 0 1.
In the special case with k = 1 X is called a Geometric random variable with probability function
p(x) 1 x , for x = 0, 1, 2, .
et t
y
p(y) = , y = 0, 1, ...,
y!
where > 0.
This result was obtained in detail in MATH1024.
Of course, by defining the time unit appropriately one can take t = 1, giving probability function
e y
p(y) = , y = 0, 1, ...,
y!
where > 0. This is a useful form of the probability function if one is aiming to model count data
more generally, particularly when time is not the main focus.
[see 1024 P13]
So for large n and small , Binomial probabilities can be approximated by Poisson probabilities.
3
1.9 A practical illustration of using discrete random variables
Though this module is concerned primarily with obtaining theoretical results about random
variables, it is useful to remember that the results are often useful in applications. Here is an
illustration of a simple example in statistical modelling.
Suppose that we have data consisting of the number of oil producing wells in a region of Texas
(Data given in Davis (1986)). This shows the locations of oil-field discovery wells in part of the
Eastern Shelf area of the Permian Basin, Fisher and Noland counties, Texas. One question that
could be asked of these data is Are the oil wells occurring at random in this region, or is there some
pattern to their distribution?
We may investigate this by defining a suitable random variable and investigating its distribution to
see whether a Poisson distribution might be appropriate. If so, then this would confirm that the
wells are occurring at random. Suppose that we define selected areas according to the grid of
squares in the picture below (these squares are called quadrats), and count the number of wells in
each area. Then the pictorial representation has been transformed into data consisting of counts of
wells over the grid. The data are discrete, in that only integer values are obtainable.
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
FIGURE 1. Locations of oil field discovery wells in part of the Eastern Shelf
area of the Permian Basin, Fisher and Noland counties, Texas. Quadrats are
approximately 10 square miles in size.
4
Suppose we count the number of wells in each quadrat. The data set produced is shown below,
from which we might ask the following questions.
0 1 2 2 1 2 0 0 0 0
0 0 2 0 3 3 1 1 0 0
0 3 0 0 3 5 2 1 1 0
2 3 1 0 1 4 1 1 0 1
3 2 2 0 0 1 2 1 2 0
3 0 0 3 1 0 2 1 0 0
2 0 0 0 0 3 1 1 1 0
4 0 0 1 0 0 4 2 1 1
2 0 1 0 0 0 0 1 2 1
1 0 0 1 0 0 2 2 1 0
0 2 2 0 0 1 0 0 0 1
0 1 0 0 1 1 1 0 2 2
1 1 3 4 2 1 3 0 1 0
0 3 3 3 1 0 0 0 0 1
2 2 3 1 1 0 0 0 0 0
6 3 1 2 2 1 0 0 0 0
The first step is to summarise these data and extract answers to some or all of these questions?
If we count the number of 0s, 1s etc and arrange the results in a table the following is produced:
This table shows the frequency of the number of wells in a quadrat, the cumulative frequency, the
proportion and the cumulative proportion. We can see that a proportion 0.4313 of the quadrats
contain no wells. Multiplying these proportions by 100 gives percentage frequencies which show
the percentage of the sample which has 0, 1, etc wells.
5
To obtain a diagram illustrating this data the frequencies or relative frequencies may be plotted
against the count number as in the diagram below.
Frequency Relative
70 frequency
60 0.375
50
40 0.25
30
20 0.125
10
0 1 2 3 4 5 6
Number of wells
Figure 2 Frequency diagram for the discrete well data
The average number of wells per quadrat is found by adding the 160 observations and dividing by
the sample size ( i.e. number of quadrats = 160).
Does the Poisson distribution provide a good explanation for these data?
Notice that if t is the length of the interval considered (in this case the area of a quadrat), and is
the rate of occurrence per unit area, then we would expect on average an interval to contain = t
events. This value is the mean of the random variable of interest and is the parameter in the
Poisson probability function which must be determined in order to try to answer the question. Since
6
we do not know the true mean or the true rate of the occurrence of the oil wells, we shall need
to use the sample mean 1.06 as an estimate of
If the Poisson distribution fits the data, the relative frequencies will reflect the probabilities that
would be given by the corresponding Poisson probabilities with parameter 1.06;
The expected frequencies for this model may be found using 160 p(x), for x = 0, 1, 2,..., giving:
There are considerable differences between these results with too many zero observations and too
many in the larger number of wells relative to the results that assume the Poisson assumptions
(which themselves depend on events, i.e. the presence of a well, being randomly distributed in
two-dimensional space). The problem here appears to be that the wells are clustering together too
much for the Poisson model, which depends on random occurrence, to be valid. So our Poisson
model appears not to be suitable in this case. In fact, a version of the Negative Binomial model (see
section 1.6.2) turns out to fit the data very well. The details are beyond what we have done so far
(they require results that you will meet in the second half of the module) but, purely for illustration
that a good model for these data can be found, here are the corresponding results assuming a
Negative Binomial model.
7
Number of wells Observed Relative Neg. Bin. Expected
frequencies Probabilities
0 69 0.4313 0.4124 66.0
1 43 0.2687 0.3112 49.8
2 26 0.1625 0.1611 25.8
3 16 0.1000 0.0706 11.3
4 4 0.0250 0.0281 4.5
5 1 0.0062 0.0106 1.7
6 1 0.0062 0.0038 0.6
The fit here is seen to be very much better, indicating that these oil-producing wells tend to cluster.
The Negative Binomial is thus seen to be a more appropriate model than the Poisson in explaining
this data set.
Final remarks
In this chapter and in subsequent chapters, where material appeared in MATH1024 lectures I refer
you back to the relevant notes from 2015-16 using the notation 1024 Px or 1024 Sx to mean
Probabilty Lecture x or Statistics Lecture x from the MATH1024 notes.
Where relevant I also give information where the material can be found in Mood, Graybill and Boes
(MGB), though the notation in MGB is not always the same as in these notes.
For this chapter the MGB reference is:
MGB Chapter II section 3.1 and Chapter III sections 2.2, 2.4 and 2.5.
8
MATH2011 Statistical Distribution Theory
If for a random variable Y there exists a function f(y) such that F(y) = f ( u)du , then f(y) is called
b) f (y)dy = 1.
c) f ( y)dy =
a
F(b) F(a) = P(a<Yb).
General relationship between the density function and the distribution function
We may find f(y) from F(y) by noting that
dFy
f(y) =
dy
or, if we have f(y), we may find
y
F(y) = f (u )du .
Typically, in many situations of practical interest, the pdf is more convenient to use than the cdf.
[1024 P17]
0.01e
0.01y
P(100 Y 200) dy
100
2
2.5 Normal random variables
The most important continuous probability model is the Normal distribution. Two examples of
Normal distributions superimposed on observed sets of data are given below. The first example
shows a frequency diagram of heights of young adult males while the second involves diastolic
blood pressures of schoolboys.
Frequency
12000
6000
0
60 63 66 69 72 75 78
Height (in)
Frequency
30
15
0
40 50 60 70 80
Diastolic blood pressure (mm Hg)
3
Both distributions are approximately symmetrical about their central values and they exhibit a
similar shape, even though the units of the measurements are very different.
The observed frequencies have been approximated by a smooth curve which in each case is based
on a Normal probability distribution model with appropriately chosen mean and standard deviation
(see later).
Normal random variables are ubiquitous in theoretical statistics and in application areas. One reason
for this is the central limit theorem (which you met last year and which we shall prove later in this
module).
[1024 P19-20]
1 ( y ) 2
f(y) exp , y
2 2 2
(where exp(z) is a convenient way of writing the exponential function ez). Note that are and > 0
are parameters. If Y is a random variable with the above pdf, we will write Y ~N(, 2).
The curve is shown in the Figure 3. On the horizontal axis of this figure are marked the positions of
the mean and the values of y that differ from by , 2 and 3. The symmetry of the curve
is evident from the mathematical model, since changing the sign of y leaves f(y) unchanged.
The figure shows that a relatively small proportion of the area under the curve lies outside the two
values y = 2 and y = + 2. The vertical scale is arranged so that the area under the curve is
equal to one. This implies that the area between any two points on the horizontal axis represents
the probability that the variable takes a value between these two points. For example, the
probability that the variable takes a value in the interval y = 2 up to y = + 2 is very nearly
0.95 and the probability that Y lies outside this range is correspondingly approximately 0.05. It is
important to be able to find the area under any part of a Normal pdf.
4
0.4
Probability
density x
0.2
0.0
-3 -2 - + +2 +3
Original variable, y
-3 -2 -1 0 1 2 3
Standardised variable, z
Figure 3 The probability density function of a Normal random variable showing the scales of
the original variable and the standardised variable.
Now f(y) depends on two parameters, the mean and standard deviation . It might be thought
therefore that any relevant probabilities would have to be worked out separately for every pair of
values , . Fortunately this is not so. We have seen that the probability that Y lies in the interval
2 up to + 2 is about 0.95, which is true without specifying the values of and . In fact
the probabilities depend on an expression of the departure of y from as a multiple of . The
statement above is equivalent to saying that there is a probability of approximately 0.95 that y lies
within two standard deviations of the mean. On the diagram these multiples are marked on the axis
as 1, 2 and 3 as shown on the lower scale. The probabilities under various parts of any Normal
pdf may be expressed in terms of the standardised deviate
y
z
A few important results are given in the table below. More detailed tables of the Normal
probabilities are available (just search on standard normal tables online).
5
Some probabilities associated with Normal random variables
Standardised deviate Probability of greater deviation
z = (y )/ In either direction In one direction
0.0 1.000 0.500
1.0 0.317 0.159
2.0 0.046 0.023
3.0 0.0027 0.0013
This table shows probabilities of obtaining a standardised deviate z = (y )/ more extreme (in
either direction or in one direction) than the tabulated value. For example, for z = 2.0 the
probability of obtaining a value of (y )/ outside 2.0 is 0.046, while the probability of (y )/
being greater than 2.0 is 0.023. (By symmetry, the probability that (y )/ is less than 2.0 is
also 0.023.) The figure below illustrates these probabilities.
0.4
density
0.3
0.2
0.1
0.023 0.023
0.0
-3 -2 -1 0 1 2 3
z
The usual tabulation of Normal probabilities is in the form of the cumulative probability that
z = (y - )/ is less than the tabulated value. This may be used for any Normally distributed
random variable Y ~N(, 2) because
y1 (y ) 2
P(Y y) F(y) exp dy
2 22
( y )/ 1 z2 y
exp dz (z) P(Z ),
2 2
[1024 P19-20]
6
2.7 Example of Normal probability calculations
Suppose that daily water use at a factory varies about a mean of 15,500 gallons with standard
deviation 1,140 gallons. If demand is Normally distributed
(i) What proportion of days does the demand fall short of 14,000 gallons?
(ii) What proportion of days does demand exceed 18,000 gallons?
(iii) What is your reaction to a demand of 35,000 gallons?
In each case we first require to calculate the standard normal deviate, z = (y )/. Using the table
of the Normal distribution function, and using the symmetry property where necessary, we have
(i) z = (14,000 15,500)/1,140 = 1.32.
From tables, the upper tail probability for z = 1.32 is 0.0934, and the lower tail
probability for z = 1.32 will be identical.
Thus 9.34% of daily demands fall short of 14,000 gallons.
(ii) z = (18,000 15,500)/1,140 = 2.19, with upper tail probability 0.01426. i.e. about
1% of daily demands exceed 18,000 gallons.
(iii) z = (35,000 15,500)/1,140 = 17.11. This lies beyond the range of the tables, but the
tail probability is less than one in a billion. One would be surprised and an
explanation may be sought. It is possible that a mis-recording error has occurred, such
as two days data being taken together. This idea of surprise at an extreme result of
low probability, as predicted by a statistical model, will be important later in this
module and also in modules such as MATH2010 Statistical Methods I.
Frequency distributions resembling the Normal pdf in shape are often observed but this form should
not be taken as the norm - despite the use of the name 'Normal'. Many observed distributions are
undeniably far from 'Normal' in shape yet should not be regarded as abnormal in any way.
The importance of Normal random variables lies in the central place that it occupies in sampling
theory which we shall discuss later. Many of the usual estimation and testing procedures require
that the Normal model for the behaviour of the measurement is reasonably valid.
7
2.8 The use of a Normal approximation to Binomial probabilities
We have seen in 1.6.1 that the Binomial model is appropriate when considering the number of
successes in independent Bernoulli trials. However, Binomial probabilities can often be
approximated by Normal probabilities when n, the number of trials, is large.
Suppose that we have a Binomial situation, i.e. n trials of a dichotomous random variable (success
or failure) with constant probability of success. The probability that the number of successes is r is
given by the Binomial probabilities
n!
P(Y = r) = r (1 )n r for r = 0, 1, , n.
r!(n r)!
Therefore the probability that Y takes a value between r1 and r2 is given by
r r2
n!
P(r1 Y r2 ) = r!(n r)! (1 )
r r1
r n r
The Binomial mean = n and variance 2 = n (1 ) may be used here (see Chapter 4 for
details).
If n is sufficiently large, the observed number of individuals Y (with the particular characteristic,
success) in the sample of size n is approximately Normally distributed with mean value = n and
standard deviation = {n (1 )}.
Account should be taken of the fact that Binomial random variables are discrete, while the Normal
is continuous. Slightly more accurate approximations are provided if a continuity correction is
used. Thus the probability that the sample will contain between r1 and r2 individuals with the
characteristic of interest is approximately given by the standard Normal probability from
r1 0.5 n r2 0.5 n
z1 to z 2 .
n(1 ) n(1 )
[1024 P21]
Perhaps the second most commonly occurring distribution in scientific investigations is the
Lognormal. The random variable Y is said to be Lognormal if X = logY is a Normally random
variable. (Note that all logarithms are assumed to be to base e in this module.)
8
1 (log y ) 2
f(y) exp y 0.
y 2 2 2
Note that, as in the Normal case, the Lognormal model is a two-parameter model. It has applications
in a variety of fields, such as Economics, where a multiplicative form of the central limit theorem may
apply.
[1024 P additional handout]
Final remarks
In MGB the material in this chapter is covered in Chapter II section 2 and Chapter III sections 3.2,
3.3 and 3.5.
9
MATH2011 Statistical Distribution Theory
The y(i), i = 1, , n, are called the order statistics corresponding to y1, , yn.
1
You have already met certain order statistics. For example, the sample median is an order
statistic: for odd values of n the sample median is equal to y({n+1}/2), while for even n the
sample median is defined as
We shall concentrate, however, on two particular order statistics: y(1), the sample
minimum, and y(n), the sample maximum. We define the corresponding random variables:
2
6.2 Applications where maxima and/or minima are of interest
In reliability engineering a system will tend to fail at its weakest point (which might
be thought of as the point with the minimum strength).
In designing coastal defences one needs to understand the distribution of the wave
heights of the highest tides.
There is a whole area of statistics devoted to the study of extremes (Extreme value
theory). In this short chapter we just give a brief introduction to the subject.
3
6.3 The cdf of Y(n), the largest value in a random sample of size n.
Since Y(n) = max{Y1, , Yn}, the probability that Y(n) y gives the cumulative distribution
function of Y(n), the sample maximum.
Now the event {Y(n) y } is identical to the event {Y1 y and Y2 y and Yn y}.
So
Gn(y) = P(Y(n) y) = P( all Yi y) = P(Y1 y and Y2 y and Yn y).
Thus, by independence
4
6.4 Example: a simple discrete experiment
Suppose I roll a fair die twice. What is the probability function of the maximum of the two
scores?
We know that F(y) = y/6 for y= 1, 2, 3, 4, 5, 6. Also, n = 2 in this example.
So the distribution function of the higher score is:
G2(y) = (y/6)2 for y= 1, 2, ... , 6.
Hence
P(Y(2) = y) = (y/6)2 [(y1)/6]2 for y= 1, 2, ... , 6.
5
6.5 The pdf of the maximum in the continuous case
If the Yi are continuous, each with density function f, then the density function of Y(n) may
be found by differentiating Gn(y) with respect to y to give:
d
gn(y) = [F(y)]n n[F(y)]n 1 f (y) ,
dy
where the domain of the maximum is the same as that of each of the Yi.
6
6.6 Example: the maximum of a uniform random sample
7
Note how the probability piles up against the upper end point of the domain of the pdf.
How do you think the expected value and variance of the sample maximum in this case
will change as n increases?
What would happen to Y(n) when the domain of the Yi has no finite upper end point?
8
6.7 The cdf Y(1), the smallest value in a random sample of size n.
Since Y(1) = min{Y1, , Yn}, the probability that Y(1) y gives the cumulative distribution
function of Y(1), the smallest value in the sample.
Now
P(Y(1) y) = 1 P(Y(1) > y) = 1 P( all Yi > y) = 1 P(Y1 > y and Y2 > y and Yn > y),
9
6.8 The pdf of the minimum in the continuous case
If the Yi are continuous, each with probability density function f, then the pdf of Y(1) may
be found by differentiating G1(y) with respect to y to give:
g1(y) = n 1 F(y)
n 1
f (y) ,
where the domain of the minimum is the same as that of each of the Yi.
10
6.9 Example: The distribution of the minimum of an Exponential random sample.
Suppose that Yi for i = 1, , n, are independent Exponential random variables, each with
probability density function
f(y) = ey, 0 < y < .
g1(y) = n 1 F(y)
n 1
f (y) for 0 < y < .
11
Now F(y) = 1 ey so that
n 1
g1(y) = n ey ey = ne ny , for 0 < y < .
That is, the distribution of the smallest value in an Exponential random sample of size n
with parameter (i.e. with mean value 1/) is also an Exponential random variable but
with parameter n (i.e. with mean value 1/(n)).
So in this case Y(1) has expected value 1/(n) and variance 1/(n)2, which both decrease as
n increases.
Is that a surprise?
12
6.10 Closing remarks
We have only given a taster here of an interesting area of statistics.
We have only looked at the marginal behaviour of the minimum and the maximum,
and we have only considered the extreme order statistics. The results can be extended
to include other order statistics.
The closure result in 6.9 hints at some interesting structure in the probabilistic
behaviour of maxima and minima. The central limit theorem basically says that under
certain conditions the sum of n independent, identically distributed random variables
is approximately Normal as n grows large. There are corresponding results for
maxima and minima, though the large-n distribution is not Normal (it is the so-called
generalised extreme value distribution).
MGB go into much greater detail on this topic in their Chapter V1 section 5.
13