Stat Infer

LECTURE NOTES ON
STATISTICAL INFERENCE
KRZYSZTOF PODG
ORSKI
Department of Mathematics and Statistics
University of Limerick, Ireland
November 23, 2009
Contents
1 Introduction 4
1.1 Models of Randomness and Statistical Inference . . . . . . . . . . . . 4
1.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Probability vs. likelihood . . . . . . . . . . . . . . . . . . . . 8
1.2.2 More data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . . 15
1.4 Computationally intensive methods of statistics . . . . . . . . . . . . 15
1.4.1 Monte Carlo methods studying statistical methods using com-
puter generated random samples . . . . . . . . . . . . . . . . 16
1.4.2 Bootstrap performing statistical inference using computers . 18
2 Review of Probability 21
2.1 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Distribution of a Function of a Random Variable . . . . . . . . . . . . 22
2.3 Transforms Method Characteristic, Probability Generating and Mo-
ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Sums of Independent Random Variables . . . . . . . . . . . . 26
2.4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . 27
2.4.3 The Bivariate Change of Variables Formula . . . . . . . . . . 28
2.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . 29
1
2.5.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 Negative Binomial and Geometric Distribution . . . . . . . . 30
2.5.4 Hypergeometric Distribution . . . . . . . . . . . . . . . . . 31
2.5.5 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . 32
2.5.6 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . 33
2.5.7 The Multinomial Distribution . . . . . . . . . . . . . . . . . 33
2.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . 35
2.6.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . 35
2.6.4 Gaussian (Normal) Distribution . . . . . . . . . . . . . . . . 36
2.6.5 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . 38
2.6.6 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.7 Chi-square Distribution . . . . . . . . . . . . . . . . . . . . . 39
2.6.8 The Bivariate Normal Distribution . . . . . . . . . . . . . . . 39
2.6.9 The Multivariate Normal Distribution . . . . . . . . . . . . . 40
2.7 Distributions further properties . . . . . . . . . . . . . . . . . . . . 42
2.7.1 Sum of Independent Random Variables special cases . . . . 42
2.7.2 Common Distributions Summarizing Tables . . . . . . . . 45
3 Likelihood 48
3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . 48
3.2 Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Estimation 61
4.1 General properties of estimators . . . . . . . . . . . . . . . . . . . . 61
4.2 Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . . 64
4.3 Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . . 69
5 The Theory of Condence Intervals 71
5.1 Exact Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . 71
2
5.2 Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . . 75
5.3 Approximate Condence Intervals . . . . . . . . . . . . . . . . . . . 80
6 The Theory of Hypothesis Testing 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . . 92
6.3 Generally Applicable Test Procedures . . . . . . . . . . . . . . . . . 97
6.4 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . 101
6.5 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.6 The
2
Test for Contingency Tables . . . . . . . . . . . . . . . . . . 109
3
Chapter 1
Introduction
Everything existing in the universe is the fruit of chance.
Democritus, the 5th Century BC
1.1 Models of Randomness and Statistical Inference
Statistics is a discipline that provides with a methodology allowing to make an infer-
ence from real random data on parameters of probabilistic models that are believed to
generate such data. The position of statistics with relation to real world data and corre-
sponding mathematical models of the probability theory is presented in the following
diagram.
The following is the list of few from plenty phenomena to which randomness is
attributed.
Games of chance
Tossing a coin
Rolling a die
Playing Poker
Natural Sciences
4
Real World Science & Mathematics
Random Phenomena
Probability Theory
Data Samples
Models
Statistics
Prediction and Discovery
Statistical Inference
-
?
?
-
?
?
-
H
H
H
H
H
H
H
H
H
HHj ?
?
Figure 1.1: Position of statistics in the context of real world phenomena and mathe-
matical models representing them.
5
Physics (notable Quantum Physics)
Genetics
Climate
Engineering
Risk and safety analysis
Ocean engineering
Economics and Social Sciences
Currency exchange rates
Stock market uctations
Insurance claims
Polls and election results
etc.
1.2 Motivating Example
Let X denote the number of particles that will be emitted from a radioactive source
in the next one minute period. We know that X will turn out to be equal to one of
the non-negative integers but, apart from that, we know nothing about which of the
possible values are more or less likely to occur. The quantity X is said to be a random
variable.
Suppose we are told that the random variable X has a Poisson distribution with
parameter = 2. Then, if x is some non-negative integer, we know that the probability
that the random variable X takes the value x is given by the formula
P(X = x) =

x
exp()
x!
where = 2. So, for instance, the probability that X takes the value x = 4 is
P(X = 4) =
2
4
exp(2)
4!
= 0.0902 .
6
We have here a probability model for the random variable X. Note that we are using
upper case letters for random variables and lower case letters for the values taken by
random variables. We shall persist with this convention throughout the course.
Let us still assume that the random variable X has a Poisson distribution with
parameter but where is some unspecied positive number. Then, if x is some non-
negative integer, we know that the probability that the random variable X takes the
value x is given by the formula
P(X = x|) =

x
exp()
x!
, (1.1)
for R
+
. However, we cannot calculate probabilities such as the probability that X
takes the value x = 4 without knowing the value of .
Suppose that, in order to learn something about the value of , we decide to measure
the value of X for each of the next 5 one minute time periods. Let us use the notation
X
1
to denote the number of particles emitted in the rst period, X
2
to denote the
number emitted in the second period and so forth. We shall end up with data consisting
of a random vector X = (X
1
, X
2
, . . . , X
5
). Consider x = (x
1
, x
2
, x
3
, x
4
, x
5
) =
(2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the
probability that X
1
takes the value x
1
= 2 is given by the formula
P(X = 2|) =

2
exp()
2!
and similarly that the probability that X
2
takes the value x
2
= 1 is given by
P(X = 1|) =
exp()
1!
and so on. However, what about the probability that X takes the value x? In order for
this probability to be specied we need to know something about the joint distribution
of the random variables X
1
, X
2
, . . . , X
5
. A simple assumption to make is that the ran-
dom variables X
1
, X
2
, . . . , X
5
are mutually independent. (Note that this assumption
may not be correct since X
2
may tend to be more similar to X
1
that it would be to X
5
.)
However, with this assumption we can say that the probability that Xtakes the value x
7
is given by
P(X = x|) =
5
i=1
xi
exp()
x
i
!
,
=

2
exp()
2!

1
exp()
1!

0
exp()
0!
3
exp()
3!

4
exp()
4!
,
=

10
exp(5)
288
.
In general, if X = (x
1
, x
2
, x
3
, x
4
, x
5
) is any vector of 5 non-negative integers, then
the probability that Xtakes the value x is given by
P(X = x|) =
5
i=1
xi
exp()
x
i
!
,
=

5
i=1
xi
exp(5)
5
i=1
x
i
!
.
We have here a probability model for the random vector X.
Our plan is to use the value x of X that we actually observe to learn something
about the value of . The ways and means to accomplish this task make up the subject
matter of this course. The central tool for various statistical inference techniques is
the likelihood method. Below we present a simple introduction to it using the Poisson
model for radioactive decay.
1.2.1 Probability vs. likelihood
. In the introduced Poisson model for a given , say = 2, we can observe a function
p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to
as probability mass function . The graph of it is presented below
The usage of such function can be utilized in bidding for a recorded result of future
experiments. If one wants to optimize chances of correctly predicting the future, the
choice of the number of recorded particles would be either on 1 or 2.
So far, we have been told that the random variable X has a Poisson distribution with
parameter where is some positive number and there are physical reason to assume
8
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Number of particles
P
r
o
b
a
b
i
l
i
t
y
Figure 1.2: Probability mass function for Poisson model with = 2.
that such a model is correct. However, we have arbitrarily set = 2 and this is more
questionable. How can we know that it is correct a correct value of the parameter? Let
us analyze this issue in detail.
If x is some non-negative integer, we know that the probability that the random
variable X takes the value x is given by the formula
P(X = x|) =

x
e
x!
,
for > 0. But without knowing the true value of , we cannot calculate probabilities
such as the probability that X takes the value x = 1.
Suppose that, in order to learn something about the value of , an experiment is
performed and a value of X = 5 is recorded. Let us take a look at the probability mass
function for = 2 in Figure 1.2. What is the probability of X to take value 2? Do we
like what we see? Why? Would you bet 1 or 2 in the next experiment?
We certainly have some serious doubt about our choice of = 2 which was arbi-
trary anyway. One can consider, for example, = 7 as an alternative to = 2. Here
are graphs of the pmf for the two cases. Which of the two choices do we like? Since it
9
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Number of particles
P
r
o
b
a
b
ilit
y
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
Number of particles
P
r
o
b
a
b
ilit
y
Figure 1.3: The probability mass function for Poisson model with = 2 vs. the one
with = 7.
was more probable to get X = 5 under the assumption = 7 than when = 2, we say
= 7 is more likely to produce X = 5 than = 2. Based on this observation we can
develop a general strategy for chosing .
Let us summarize our position. So far we know (or assume) about the radioactive
emission that it follows Poisson model with some unknown > 0 and the value x = 5
has been once observed. Our goal is somehow to utilized this knowledge. First, we
note that the Poisson model is in fact not only a function of x but also of
p(x|) =

x
e
x!
.
Let us plug in the observed x = 5, so that we get a function of that is called
likelihood function
l() =

5
e
120
.
The graph of it is presented on the next gure. Can you localize on this graph the values
of probabilities that were used to chose = 7 over = 2? What value of appears to
be the most preferable if the same argument is extended to all possible values of ? We
observe that the value of = 5 is most likely to produce value x = 5. In the result of
our likelihood approach we have used the data x = 5 and the Poisson model to make
inference - an example of statistical inference .
10
0 5 10 15
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
theta
L
i
k
e
l
i
h
o
o
d
Figure 1.4: Likelihood function for the Poisson model when the observed value is
x = 5.
Likelihood Poisson model backward
Poisson model can be stated as a probability mass function that maps possible values
x into probabilities p(x) or if we emphasize the dependence on into p(x|) that is
given below
p(x|) = l(|x) =

x
e
x!
,
With the Poisson model with given one can compute probabilities that various
possible numbers x of emitted particles can be recorded, i.e. we consider
x p(x|)
with xed. We get the answer how probable are various outcomes x.
With the Poisson model where x is observed and thus xed one can evaluate how
likely it would be to get x under various values of , i.e. we consider
l(|x)
with xed. We get the answer howlikely various could produced the observed
x.
11
Exercise 1. For the general Poisson model
p(x|) = l(|x) =

x
e
x!
,
1. for a given > nd the most probable value of the observation x.
2. for a given observation x nd the most likely value of .
Give a mathematical argument for your claims.
1.2.2 More data
Suppose that we perform another measurement of the number of emitted particles. Let
us use the notation X
1
to denote the number of particles emitted in the rst period, X
2
to denote the number emitted in the second period. We shall end up with data consisting
of a random vector X = (X
1
, X
2
). The second measurement yielded x
2
= 2, so that
x = (x
1
, x
2
) = (5, 2).
We know that the probability that X
1
takes the value x
1
= 5 is given by the formula
P(X = 5|) =

5
e
5!
and similarly that the probability that X
2
takes the value x
2
= 2 is given by
P(X = 2|) =

2
e
2!
.
However, what about the probability that X takes the value x = (5, 2)? In order for
this probability to be specied we need to know something about the joint distribution
of the random variables X
1
, X
2
. A simple assumption to make is that the random
variables X
1
, X
2
are mutually independent. In such a case the probability that Xtakes
the value x = (x
1
, x
2
) is given by
P(X = (x
1
, x
2
)|) =

x1
e
x
1
!

x2
e
x
2
!
= e
2
x1+x2
x
1
!x
2
!
.
After little of algebra we easily nd the likelihood function of observing X = (5, 2)
as
l(|(5, 2)) = e
2

7
240
12
0 5 10 15
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
0
.
0
2
0
0
.
0
2
5
theta
L
i
k
e
l
i
h
o
o
d
0 5 10 15
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
theta
L
i
k
e
l
i
h
o
o
d
Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom).
and its graph is presented in Figure 1.5 in comparison with the previous likelihood for
a single observation.
Two important effects of adding an extra information should be noted
We observe that the location of the maximum shifted from 5 to 3 compared to
single observation.
We also note that the range of likely values for has diminished.
Let us suppose that eventually we decide to measure three more values of X.
Let us use the vector notation X = (X
1
, X
2
, . . . , X
5
) to denote observable random
13
vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x =
(x
1
, x
2
, x
3
, x
4
, x
5
) = (5, 2, 3, 7, 7). Under the assumption of independence the proba-
bility that Xtakes the value x is given by
P(X = x|) =
5
i=1
xi
e
x
i
!
.
The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can
be easily derived to be
24
e
5
14515200
.
In general, if X = (x
1
, . . . , x
n
) is any vector of 5 non-negative integers, then the
likelihood is given by
l(|(x
1
, . . . , x
n
) =

n
i=1
xi
e
n
n
i=1
x
i
!
.
The value
that maximizes this likelihood is called the maximum likelihood estimator

of .
In order to nd values that effectively maximize likelihood, the method of calculus
can be implemented. We note that in our example we deal only with one variable and
computation of derivative is rather straightforward.
Exercise 2. For the general case of likelihood based on Poisson model
l(|x
1
, . . . , x
n
) =

n
i=1
xi
e
n
n
i=1
x
i
!
using methods of calculus derive a general formula for the maximum likelihood esti-
mator of . Using the result nd

for (x
1
, x
2
, x
3
, x
4
, x
5
) = (5, 2, 3, 7, 7).
Exercise 3. It is generally believed that time X that passes until there is half of the
original radioactive material follow exponential distribution f(x|) = e
x
, x > 0.
For beryllium 11 ve experiments has been performed and values 13.21, 13.12, 13.95,
13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for and
based on this determine the most likely .
14
1.3 Likelihood and theory of statistics
The strategy of making statistical inference based on the likelihood function as de-
scribed above is the recurrent theme in mathematical statistics and thus in our lecture.
Using mathematical argument we would compare various strategies to infering about
the parameters and often we will demonstrate that the likelihood based methods are
optimal. It will show its strength also as a criterium deciding between various claims
about parameters of the model which is the leading story of testing hypotheses.
In the modern days, the role of computers has increased in statistical methodology.
New computationally intense methods of data explorations become one of the central
areas of modern statistcs. Even there, methods that refer to likelihood play dominant
roles, in particular, in Bayesian methodology.
Despite this extensive penetration of statistical methodology by likelihood techin-
ques, by no means statistics can be reduced to analysis of likelihood. In every area of
statistics, there are important aspects that require reaching beyond likelihood, in many
cases, likelihood is not even a focus of studies and development. The purpose of this
course is to present both the importance of likelihood approach across statistics but also
presentation of topics for which likelihood plays a secondary role if any.
1.4 Computationally intensive methods of statistics
The second part of our presentation of modern statistical inference is devoted to compu-
tationally intensive statistical methods. The area of data explorations is rapidly growing
in importance due to
common access to inexpensive but advance computing tools,
emerging of new challenges associated with massive highly dimensional data far
exceeding traditional assumptions on which traditional methods of statistics have
been based.
In this introduction we give two examples that illustrate the power of modern computers
and computing software both in analysis of statistical models and in performing actual
15
statistical inference. We start with analyzing a performance of a statistical procedure
using random sample generation.
1.4.1 Monte Carlo methods studying statistical methods using
computer generated random samples
Randomness can be used to study properties of a mathematical model. The model itself
may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it
is based on repetitive simulations of random samples corresponding to the model and
observing behavior of objects of interests. An example of Monte Carlo method is ap-
proximate the area of circle by tossing randomly a point (typically computer generated)
on the paper where a circle is drawn. The percentage of points that fall inside the circle
represents (approximately) percentage of the area covered by the circle, as illustrated
in Figure 1.6.
Exercise 4. Write an R code that would explore the area of an elipsoid using Monte
Carlo method.
Below we present an application of Monte Carlo approach to studying tting meth-
ods for the Poisson model.
Deciding for Poisson model
Recall that the Poisson model is given by
P(X = x|) =

x
e
x!
.
It is relatively easy to demonstrate that the mean value of this distribution is equal to
and standard deviation is also equal to .
Exercise 5. Present a formal argument showing that for a Poisson random variable X
with parameter , EX = and VarX = .
Thus for a sample of observations x = (x
1
, . . . , x
n
) it is reasonable to consider
16
Figure 1.6: Monte Carlo study of the circle area approximation for sample size of
10000 is 3.1248 which compares to the true value of = 3.141593.
both
1
= x,
2
=

x
2
x
2
as estimators of .
We want to employ Monte Carlo method to decide which one is better. In the
process we run many samples from the Poisson distribution and check which of the
17
Histogram of means
means
F
r
e
q
u
e
n
c
y
2.5 3.0 3.5 4.0 4.5 5.0 5.5
0
1
0
0
Histogram of vars
vars
F
r
e
q
u
e
n
c
y
0 5 10 15
0
1
5
0
3
0
0
Figure 1.7: Monte Carlo results of comparing estimation of = 4 by the sample mean
(left) vs. estimation using the sample standard deviation right.
estimates performs better. The resulting histograms of the values of estimator are pre-
sented in Figure 1.8. It is quite clear from the graphs that the estimator based on the
mean is better than the one based on the variance.
1.4.2 Bootstrap performing statistical inference using computers
Bootstrap (resampling) methods are one of the examples of Monte Carlo based statis-
tical analysis. The methodology can be summarized as follows
Collect statistical sample, i.e. the same type of data as in classical statistics.
Used a properly chosen Monte Carlo based resampling from the data using RNG
create so called bootstrap samples.
Analyze bootstrap samples to draw conclusions about the random mechanism
18
that produced the original statistical data.
This way randomness is used to analyze statistical samples that, by the way, are also a
result of randomness. An example illustrating the approach is presented next.
Estimating nitrate ion concentration
Nitrate ion concentration measurements in a certain chemical lab has been collected
and their results are given in the following table. The goal is to estimate, based on
0.51 0.51 0.51 0.50 0.51 0.49 0.52 0.53 0.50 0.47
0.51 0.52 0.53 0.48 0.49 0.50 0.52 0.49 0.49 0.50
0.49 0.48 0.46 0.49 0.49 0.48 0.49 0.49 0.51 0.47
0.51 0.51 0.51 0.48 0.50 0.47 0.50 0.51 0.49 0.48
0.51 0.50 0.50 0.53 0.52 0.52 0.50 0.50 0.51 0.51
Table 1.1: Results of 50 determinations of nitrate ion concentration in g per ml.
these values, the actual nitrate ion concentration. The overall mean of all observations
is 0.4998. It is natural to ask what is the error of this determination of the nitrate
concentration. If we would repeat our experiment of collecting 50 samples of nitrate
concentrations many times we would see the range of error that is made. However,
it would be a waste of resources and not a viable method at all. Instead we resample
new data from our data and use so obtained new samples for assessment of the error
and compare the obtained means (bootstrap means) with the original one. The differ-
ences of these represent the bootstrap estimation errors their distribution is viewed
as a good representation of the distribution of the true error. In Figure ??, we see the
bootstrap counterpart of the distribution of the estimation error.
Based on this we can safely say that the nitrate concentration is 49.99 0.005.
Exercise 6. Consider a sample of daily number of buyers in a furniture store
8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5
Consider the two estimators of for a Poisson distribution as discussed in the previous
section. Describe formally the procedure (in steps) of obtaining a bootstrap condence
19
Histogram of bootstrap
bootstrap
F
r
e
q
u
e
n
c
y
-0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008
0
2
0
4
0
6
0
8
0
Figure 1.8: Boostrap estimation error distribution.
interval for using each of the discussed estimatoand provide with 95% bootstrap
condence intervals for each of them.
20
Chapter 2
Review of Probability
2.1 Expectation and Variance
The expected value E[Y ] of a random variable Y is dened as
E[Y ] =
i=0
y
i
P(y
i
);
if Y is discrete, and
E[Y ] =
_

yf(y)dy;
if Y is continuous, where f(y) is the probability density function. The variance Var[Y ]
of a random variable Y is dened as
Var[Y ] = E(Y E[Y ])
2
;
or
Var[Y ] =
i=0
(y
i
E[Y ])
2
P(y
i
);
if Y is discrete, and
V ar[Y ] =
_

(y E[Y ])
2
f(y)dy;
if Y is continuous. When there is no ambiguity we often write EY for E[Y ], and VarY
for Var[Y ].
21
A function of a random variable is itself a random variable. If h(Y ) is function of
the random variable Y , then the expected value of h(Y ) is given by
E[h(Y )] =
i=0
h(y
i
)P(y
i
);
if Y is discrete, and if Y is continuous
E[h(Y )] =
_

h(y)f(y) dy.
It is relatively straightforward to derive the following results for the expectation
and variance of a linear function of Y .
E[aY +b] = aE[Y ] +b,
V ar[aY +b] = a
2
Var[Y ],
where a and b are constants. Also
Var[Y ] = E[Y 2] (E[Y ])
2
(2.1)
For expectations, it can be shown more generally that
E
k
i=1
a
i
h
i
(Y ) =
k
i=1
a
i
E[h
i
(Y )],
where a
i
, i = 1, 2, . . . , k are constants and h
i
(Y ), i = 1, 2, . . . , k are functions of the
random variable Y .
2.2 Distribution of a Function of a Random Variable
If Y is a random variable than for any regular function X = g(Y ) is also a random
variable. The cumulative distribution function of X is given as
F
X
(x) = P(X x) = P(Y g
1
(, x]).
The density function of X if exists can be found by differentiating the right hand side
of the above equality.
22
Example 1. Let Y has a density f
Y
and X = Y
2
. Then
F
X
(x) = P(Y
2
< x) = P(
x Y
x) = F
Y
(
x) F
Y
(
x).
By taking a derivative in x we obtain
f
X
(x) =
1
2
x
_
f
Y
(
x) +f
Y
(
x)
_
.
If additionally the distribution of Y is symmetric around zero, i.e. f
Y
(y) = f
Y
(y),
then
f
X
(x) =
1
x
f
Y
(
x).
Exercise 7. Let Z be a random variable with the density f
Z
(z) = e
z
2
/2
/
2, so
called the standard normal (Gaussian) randomvariable. Showthat Z
2
is a Gamma(1/2, 1/2)
random variable, i.e. that it has the density given by
1
2
x
1/2
e
x/2
.
The distribution of Z
2
is also called chi-square distribution with one degree of freedom.
Exercise 8. Let F
Y
(y) be a cumulative distribution function of some random variable
Y that with probability one takes values in a set R
Y
. Assume that there is an inverse
function F
1
Y
[0, 1] R
Y
so that F
Y
F
1
Y
(u) = u for u [0, 1]. Check that for U
Unif(0, 1) the random variable

Y = F
1
Y
(U) has F
Y
as its cumulative distribution
function.
The densities of g(Y ) are particularly easy to express if g is a strictly monotone as
shown in the next result
Theorem 2.2.1. Let Y be a continuous random variable with probability density func-
tion f
Y
. Suppose that g(y) is a strictly monotone (increasing or decreasing) differ-
entiable (and hence continuous) function of y. The random variable Z dened by
Z = g(Y ) has probability density function given by
f
Z
(z) = f
Y
_
g
1
(z)
_
d
dz
g
a
(z)
(2.2)
where g
1
(z) is dened to be the inverse function of g(y).
23
Proof. Let g(y) be a monotone increasing (decreasing) function and let F
Y
(y) and
F
Z
(z) denote the probability distribution functions of the random variables Y and Z.
Then
F
Z
(z) = P(Z z) = P(g(Y ) z) = P(Y ()g
1
(z)) = (1)F
Y
(g
1
(z))
By the chain rule,
f
Z
(z) =
d
dz
F
Z
(z) = ()
d
dz
F
Y
(g
1
(z)) = f
Y
(g
1
(z))
dg
1
dz
(z)
.
Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribu-
tion and g(z) = e
az+b
. Then Y = g(Z) is called a log-normal random variable.
Demonstrate that the density of Y is given by
f
Y
(y) =
1
2a
2
y
1
exp
_
log
2
(y/e
b
)
2a
2
_
.
2.3 Transforms Method Characteristic, Probability Gen-
erating and Moment Generating Functions
The probability generating function of a random variable Y is a function denoted by
G
Y
(t) and dened by
G
Y
(t) = E(t
Y
),
for those t Rfor which the above expectation is convergent. The expectation dening
G
Y
(t) converges absolutely if |t| 1. As the name implies, the p.g.f generates the
probabilities associated with a discrete distribution P(Y = j) = p
j
, j = 0, 1, 2, . . . .
G
Y
(0) = p
0
, G
Y
(0) = p
1
, G
Y
(0) = 2!p
2
.
In general the kth derivative of the p.g.f of Y satises
G
(
k)
Y
(0) = k!p
k
.
24
The p.g.f can be used to calculate the mean and variance of a random variable Y . Note
that in the discrete case G
Y
(t) =

j=1
jp
j
t
j1
for 1 < t < 1. Let t approach one
from the left, t 1
, to obtain
G
Y
(1) =
j=1
jp
j
= E(Y ) =
Y
.
The second derivative of G
Y
(t) satises
G
Y
(t) =
j=1
j(j 1)p
j
t
j2
,
and consequently
G
Y
(1) =
j = 1j(j 1)p
j
= E(Y
2
) E
2
(Y ).
The variance of Y satises
2
Y
= EY
2
EY +EY E
2
Y = G
Y
(1) +G
Y
(1) G
2
Y
(1).
The moment generating function (m.g.f) of a random variable Y is denoted by
M
Y
(t) and dened as
M
Y
(t) = E
_
e
tY
_
,
for some t R. The moment generating function generates the moments EY
k
M
Y
(0) = 1, M
Y
(0) =
Y
= E(Y ), M
Y
(0) = EY
2
,
and, in general,
M
(
k)
Y
(0) = EY
k
.
The characteristic function (ch.f) of a random variable Y is dened by
Y
(t) = Ee
itY
,
where i =
1.
A very important result concerning generating functions states that the moment
generating function uniquely denes the probability distribution (provided it exists in
an open interval around zero). The characteristic function also uniquely denes the
probability distribution.
25
Property 1. If Y has the characteristic function
Y
(t) and the moment generating
function M
Y
(t), then for X = a +bY
X
(t) =e
ait
Y
(bt)
M
X
(t) =e
at
M
Y
(bt).
2.4 Random Vectors
2.4.1 Sums of Independent Random Variables
Suppose that Y
1
, Y
2
, . . . , Y
n
are independent random variables. Then the moment gen-
erating function of the linear combination Z =

n
i=1
a
i
Y
i
is the product of the indi-
vidual moment generating functions.
M
Z
(t) =Ee
t
aiYi
=Ee
a1tY1
Ee
a2tY2
Ee
antYn
=
n
i=1
M
Yi
(a
i
Y
i
).
The same argument gives also that
Z
(t) =
n
i=1
Yi
(aiY i).
When X and Y are discrete random variables, the condition of independence is
equivalent to p
X,Y
(x, y) = p
X
(x)p
Y
(y) for all x, y. In the jointly continuous case
the condition of independence is equivalent to f
X,Y
(x, y) = f
X
(x)f
Y
(y) for all x, y.
Consider random variables X and Y with probability densities f
X
(x) and f
Y
(y) re-
spectively. We seek the probability density of the random variable X+Y . Our general
result follows from
F
X+Y
(a) =P(X +Y < a)
=
_ _
X+Y <a
f
X
(x)f
Y
(y) dxdy
=
_

_
ay
f
X
(x)f
Y
(y) dxdy
=
_

_
a
f
X
(z y) dz f
Y
(y) dy
=
_
a
f
X
(z y)f
Y
(y) dy dz
(2.3)
26
Thus the density function f
X+Y
(z) =
_
f
X
(z y)f
Y
(y) dy which is called the
convolution of the densities f
X
and f
Y
.
2.4.2 Covariance and Correlation
Suppose that X and Y are real-valued random variables for some random experiment.
The covariance of X and Y is dened by
Cov(X, Y ) = E[(X EX)(Y EY )]
and (assuming the variances are positive) the correlation of X and Y is dened by
(X, Y ) =
Cov(X, Y )
_
Var(X)
_
Var(Y )
.
Note that the covariance and correlation always have the same sign (positive, nega-
tive, or 0). When the sign is positive, the variables are said to be positively correlated,
when the sign is negative, the variables are said to be negatively correlated, and when
the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding
of correlation, suppose that we run the experiment a large number of times and that
for each run, we plot the values (X, Y ) in a scatterplot. The scatterplot for positively
correlated variables shows a linear trend with positive slope, while the scatterplot for
negatively correlated variables shows a linear trend with negative slope. For uncorre-
lated variables, the scatterplot should look like an amorphous blob of points with no
discernible linear trend.
Property 2. You should satisfy yourself that the following are true
Cov(X, Y ) =EXY EXEY
Cov(X, Y ) =Cov(Y, X)
Cov(Y, Y ) =Var(Y )
Cov(aX +bY +c, Z) =aCov(X, Z) +bCov(Y, Z)
Var
_
n
i=1
Y
i
_
=
n
i,j=1
Cov(Y
i
, Y
j
)
If X and Y are independent, then they are uncorrelated. The converse is not true
however.
27
2.4.3 The Bivariate Change of Variables Formula
Suppose that (X, Y ) is a random vector taking values in a subset S of R
2
with proba-
bility density function f. Suppose that U and V are random variables that are functions
of X and Y
U = U(X, Y ), V = V (X, Y ).
If these functions have derivatives, there is a simple way to get the joint probability
density function g of (U, V ). First, we will assume that the transformation (x, y)
(u, v) is one-to-one and maps S onto a subset T of R
2
. Thus, the inverse transformation
(u, v) (x, y) is well dened and maps T onto S. We will assume that the inverse
transformation is smooth, in the sense that the partial derivatives
x
u
,
x
v
,
y
u
,
y
v
,
exist on T, and the Jacobian
(x, y)
(u, v)
=
x
u
x
v
y
u
y
v
=
x
u
y
v

x
v
y
u
is nonzero on T. Now, let B be an arbitrary subset of T. The inverse transformation
maps B onto a subset A of S. Therefore,
P((U, V ) B) = P((X, Y ) A) =
_ _
A
f(x, y) dxdy.
But, by the change of variables formula for double integrals, this can be written as
P((U, V ) B) =
_ _
B
f(x(u, v), y(u, y))
(x, y)
(u, v)
dudv.
By the very meaning of density, it follows that the probability density function of
(U, V ) is
g(u, v) = f(x(u, v), y(u, v))
(x, y)
(u, v)
, (u, v) T.
The change of variables formula generalizes to R
n
.
Exercise 10. Let U
1
and U
2
be independent random variables with the density equal
to one over [0, 1], i.e. standard uniform random variables. Find the density of the
following vector of variables
(Z
1
, Z
2
) = (
_
2 log U
1
cos(2U
2
),
_
2 log U
1
sin(2U
2
)).
28
2.5 Discrete Random Variables
2.5.1 Bernoulli Distribution
A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,
success (Y = 1) or failure (Y = 0) and in which the probability of success is . We
refer to as the Bernoulli probability parameter. The value of the random variable Y is
used as an indicator of the outcome, which may also be interpreted as the presence or
absence of a particular characteristic. A Bernoulli random variable Y has probability
mass function
P(Y = y|) =
y
(1 )
1
y (2.4)
for y = 0, 1 and some (0, 1). The notation Y Ber() should be read as the
random variable Y follows a Bernoulli distribution with parameter .
A Bernoulli random variable Y has expected value E[Y ] = 0 P(Y = 0) + 1
P(Y = 1) = 0(1)+1 = , and variance Var[Y ] = (0)2(1)+(1)
2
=
(1 ).
2.5.2 Binomial Distribution
Consider independent repetitions of Bernoulli experiments, each with a probability of
success . Next consider the random variable Y , dened as the number of successes in
a xed number of independent Bernoulli trials, n . That is,
Y =
n
i=1
X
i
,
where X
i
Bernoulli() for i = 1, . . . , n. Each sequence of length n containing y
ones and (n y) zeros occurs with probability y(1 )
(
n y). The number of
sequences with y successes, and consequently (n y) fails, is
n!
y!(n y)!
=
_
n
y
_
.
The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities
P(Y = y|) =
_
n
y
_
y
(1 )
ny
. (2.5)
29
The notation Y Bin(n, ) should be read as the random variable Y follows a bi-
nomial distribution with parameters n and . Finally using the fact that Y is the sum
of n independent Bernoulli random variables we can calculate the expected value as
E[Y ] = E[
X
i
] = PE[X
i
] =

= n and variance as Var[Y ] = V ar[
X
i
] =
Var[X
i
] =
(1 ) = n(1 ).
2.5.3 Negative Binomial and Geometric Distribution
Instead of xing the number of trials, suppose now that the number of successes, r,
is xed, and that the sample size required in order to reach this xed number is the
random variable N. This is sometimes called inverse sampling. In the case of r = 1,
using the independence argument again, leads to geomoetric distribution
P(N = n|) = (1 )
n1
, n = 1, 2, . . . (2.6)
for n = 1, 2, . . . which is the geometric probability function with parameter . The
distribution is so named as successive probabilities form a geometric series. The no-
tation N Geo() should be read as the random variable N follows a geometric
distribution with parameter . Write (1 ) = q. Then
E[N] =
n=1
nq
n
=
n=0
d
dq
(q
n
) =
d
dq
_

n=0
qn
_
=
d
dq
_
1
1 q
_
=

(1 q)
2
=
1
.
Also,
E[N
2
] =
n=1
n
2
q
n1
=
n=1
d
dq
(nq
n
) =
d
dq
_

n=1
nq
n
_
=
d
dq
_
q
1 q
E(N)
_
=
d
dq
_
q(1 q)
2
_
=
_
1
2
+
2(1 )
3
_
=
2
2

1
.
Using Var[N] = E[N
2
] (E[N])
2
, we get Var[N] = (1 )/
2
.
Consider now sampling continues until a total of r successes are observed. Again,
let the random variable N denote number of trial required. If the rth success occurs
30
on the nth trial, then this implies that a total of (r 1) successes are observed by the
(n 1)th trial. The probability of this happening can be calculated using the binomial
distribution as
_
n 1
r 1
_
r1
(1 )
nr
.
The probability that the nth trial is a success is . As these two events are indepen-
dent we have that
P(N = n|r, ) =
_
n 1
r 1
_
r
(1 )
nr
(2.7)
for n = r, r + 1, . . . . The notation N NegBin(r, ) should be read as the random
variable N follows a negative binomial distribution with parameters r and . This is
also known as the Pascal distribution.
E[N
k
] =
n=r
n
k
_
n 1
r 1
_
r
(1 )
nr
=
r
n=r
n
k1
_
n
r
_
r+1
(1 )
nr
since n
_
n 1
r 1
_
= r
_
n
r
_
=
r
m=r+1
(m1)
k1
_
m1
r
_
r+1
(1 )
m(r+1)
=
r
E
_
(X 1)
k1
,
where X Negativebinomial(r +1, ). Setting k = 1 we get E(N) = r/. Setting
k = 2 gives
E[N2] =
r
E(X 1) =
r
_
r + 1
1
_
.
Therefore Var[N] = r(1 )/
2
.
2.5.4 Hypergeometric Distribution
The hypergeometric distribution is used to describe sampling without replacement.
Consider an urn containing b balls, of which w are white and b w are red. We
intend to draw a sample of size n from the urn. Let Y denote the number of white balls
selected. Then, for y = 0, 1, 2, . . . , n we have
P(Y = y|b, w, n) =
_
w
y
__
bw
ny
_
_
b
n
_ . (2.8)
31
The expected value of the jth moment of a hypergeometric random variable is
E[Y ] =
n
y=0
y
j
P(Y = y) =
n
y=1
y
j
_
w
y
__
bw
ny
_
_
b
n
_ .
The identities
y
_
w
y
_
= w
_
w 1
y 1
_
n
_
b
n
_
= b
_
b 1
n 1
_
can be used to obtain
E[Y
j
] =
nw
b
n
y=1
y
j1
_
w1
y1
__
bw
n1
_
_
b1
n1
_
=
nw
b
n1
x=0
(x + 1)
j1
_
w1
x
__
bw
n1x
_
_
b1
n1
_
=
nw
b
E[(X + 1)
j1
]
where X is a hypergeometric random variable with parameters n1, b1, w1. From
this it is easy to establish that E[Y ] = n and Var[Y ] = n(1 )(b n)/(b 1),
where = w/b is the fraction of white balls in the population.
2.5.5 Poisson Distribution
Certain problems involve counting the number of events that have occurred in a xed
time period. A random variable Y , taking on one of the values 0, 1, 2, . . . , is said to be
a Poisson random variable with parameter if for some > 0,
P(Y = y|) =

y
y!
e
, y = 0, 1, 2, . . . (2.9)
The notation Y Pois() should be read as random variable Y follows a Poisson
distribution with parameter . Equation 2.9 denes a probability mass function, since
y=0
y
y!
e
= e
y=0
y
y!
= e
= 1.
The expected value of a Poisson random variable is
E[Y ] =
y=0
ye
y
y!
= e
y=1
y1
(y 1)!
= e
j=0
j
(j)!
= .
32
To get the variance we rst compute the second moment
E[Y
2
] =
y=0
y
2
e
y
y!
=
y=1
ye

y1
y 1!
=
j=0
(j + 1)e
j
j!
= ( + 1).
Since we already have E[Y ] = , we obtain Var[Y ] = E[Y
2
] (E[Y ])
2
= .
Suppose that Y Binomial(n, p), and let = np. Then
P(Y = y|np) =
_
n
y
_
p
y
(1 p)
ny
=
_
n
y
__
n
_
y
_
1

n
_
ny
=
n(n 1) (n y + 1)
n
y
y
y!
(1 /n)
n
(1 /n)
y
.
For n large and moderate, we have that
_
1

n
_
n
e
,
n(n 1) (n y + 1)
n
y
1,
_
1

n
_
y
1.
Our result is that a binomial random variable Bin(n, p) is well approximated by a
Poisson random variable Pois( = np) when n is large and p is small. That is
P(Y = y|n, p) e
np
(np)
y
y!
.
2.5.6 Discrete Uniform Distribution
The discrete uniform distribution with integer parameter N has a random variable Y
that can take the vales y = 1, 2, . . . , N with equal probability 1/N. It is easy to show
that the mean and variance of Y are E[Y ] = (N + 1)/2, and Var[Y ] = (N
2
1)/12.
2.5.7 The Multinomial Distribution
Suppose that we perform n independent and identical experiments, where each ex-
periment can result in any one of r possible outcomes, with respective probabilities
p
1
, p
2
, . . . , p
r
, where

r
i=1
p
i
= 1. If we denote by Y
i
, the number of the n experi-
ments that result in outcome number i, then
P(Y
1
= n
1
, Y
2
= n
2
, . . . , Y
r
= n
r
) =
n!
n
1
!n
2
! n
r
!
p
n1
1
p
n2
2
p
nr
5
(2.10)
33
where

r
i=1
ni = n. Equation 2.10 is justied by noting that any sequence of out-
comes that leads to outcome i occurring n
i
times for i = 1, 2, . . . , r, will, by the
assumption of independence of experiments, have probability p
n1
1
p
n2
2
p
nr
r
of occur-
ring. As there are n! = (n
1
!n
2
! n
r
!) such sequence of outcomes equation 2.10 is
established.
2.6 Continuous Random Variables
2.6.1 Uniform Distribution
A random variable Y is said to be uniformly distributed over the interval (a, b) if its
probability density function is given by
f(y|a, b) =
1
b a
, if a < y < b
and equals 0 for all other values of y. Since F(u) =
_
u
f(y)dy, the distribution

function of a uniform random variable on the interval (a, b) is
F(u) =
_
_
0; u a,
(u a)/(b a); a < u b,
1; u > b
The expected value of a uniform random turns out to be the mid-point of the interval,
that is
E[Y ] =
_

yf(y)dy =
_
b
a
y
b a
dy =
b
2
a
2
2(b a)
=
b +a
2
.
The second moment is calculated as
E[Y
2
] =
_
b
a
y
2
b a
dy =
b
3
a
3
3(b a)
=
1
3
(b
2
+ab +a
2
),
hence the variance is
Var[Y ] = E[Y
2
] (E[Y ])
2
=
1
12
(b a)
2
.
The notation Y U(a, b) should be read as the random variable Y follows a uniform
distribution on the interval (a, b).
34
2.6.2 Exponential Distribution
A random variable Y is said to be an exponential random variable if its probability
density function is given by
f(y|) = e
y
, y > 0, > 0.
The cumulative distribution of an exponential random variable is given by
F(a) =
_
a
0
e
y
dy = e
y
|
a
0
= 1 e
a
, a > 0.
The expected value E[Y ] =
_
0
ye
y
dy requires integration by parts, yielding
E[Y ] = ye
y
|
0
+
_

0
e
y
dy =
e
y
0
=
1
.
Integration by parts can be used to verify that E[Y
2
] = 2
2
. Hence Var[Y ] = 1/
2
.
The notation Y Exp() should be read as the random variable Y follows an expo-
nential distribution with parameter .
Exercise 11. Let Y U[0, 1]. Find the distribution of Y = log U. Can you identify
it as a one of the common distributions?
2.6.3 Gamma Distribution
A random variable Y is said to have a gamma distribution if its density function is
given by
f(y|) =
e
y
y
1
/(), 0 < y, > 0, > 0
where (), is called the gamma function and is dened by
() =
_

0
e
u
u
1
du.
The integration by parts of () yields the recursive relationship
() = e
u
u
1
|
0
+
_

0
e
u
( 1)u
2
du (2.11)
= ( 1)
_

0
e
u
u
2
du = ( 1)( 1). (2.12)
35
For integer values = n, this recursive relationship reduces to (n + 1) = n!. Note,
by setting = 1 the gamma distribution reduces to an exponential distribution. The
expected value of a gamma random variable is given by
E[Y ] =

()
_

0
y
e
y
dy = !

()
_

0
u
e
u
du,
after the change of variable u = y. Hence E[Y ] = ( + 1)/(()) = /. Using
the same substitution
E[Y
2
] =

()
_

0
y
+1
e
y
dy =
( + 1)
2
,
so that Var[Y ] = /
2
. The notation Y Gamma(, ) should be read as the
random variable Y follows a gamma distribution with parameters and .
Exercise 12. Let Y Gamma(, ). Show that the moment generating function for
Y is given for t (, ) by
M
Y
(t) =
1
(1 t/)
.
2.6.4 Gaussian (Normal) Distribution
A random variable Z is a standard normal (or Gaussian) random variable if the density
of Z is specied by
f(z) =
1
2
e
z
2
/2
. (2.13)
It is not immediately obvious that (2.13) species a probability density. To show that
this is the case we need to prove
_

2
e
z
2
/2
dy = 1
or, equivalently, that I =
_
e
z
2
/2
dz =

2. This is a classic results and so is
well worth conrming. Consider
I
2
=
_

e z
2
/2 dz
_

ew
2
/2 dw =
_

e
(z
2
+w
2
)/2
dzdw.
The double integral can be evaluated by a change of variables to polar coordinates.
Substituting z = r cos , w = r sin, and dzdw = rddr, we get
I
2
=
_

0
_

0
e
r
2
/2
rddr = 2
_

0
rer
2
/2 dr = 2e
r
2
/2
|
1
0
= 2.
36
Taking the square root we get I =

2. The result I =

2 can also be used to
establish the result (1/2) =
. To prove that this is the case note that

(1/2) =
_

0
e
u
u
1/2
du = 2
_

0
e
z
2
dz =
.
The expected value of Z equals zero because ze
z
2
/2
is integrable and asymmetric
around zero. The variance of Z is given by
Var[Z] =
1
2
_

z
2
e
z
2
/2
dz.
Thus
Var[Z] =
1
2
_

z
2
e
z
2
/2
dz
=
1
2
_
ze
z
2
/2
|
+ +
_

e z
2
/2 dz
_
=
1
2
_

e
z
2
/2
dz
=1.
If Z is a standard normal distribution then Y = + Z is called general normal
(Gaussian distribution) with parameters and . The density of Y is given by
f(y|, ) =
1
2
2
e
(y)
2
2
2
.
We have obviously E[Y ] = and Var[Y ] =
2
. The notation Y N(,
2
) should
be read as the random variable Y follows a normal distribution with mean parameter
and variance parameter
2
. From the denition of Y it follows immediately that
a +bY , where a and b are known constants, is again normal distribution.
Exercise 13. Let Y N(,
2
). What is the distribution of X = a +bY ?
Exercise 14. Let Y N(,
2
). Show that the moment generating function if Y is
given by
M
Y
(t) = e
t+
2
t
2
/2
.
Hint Consider rst the standard normal variable and then apply Property 1.
37
2.6.5 Weibull Distribution
The Weibull distribution function has the form
F(y) = 1 exp
_
_
y
b
_
a
_
, y > 0.
The Weibull density can be obtained by differentiation as
f(y|a, b) =
_
a
b
__
y
b
_
a1
exp
_
_
y
b
_
a
_
.
To calculate the expected value
E[Y ] =
_

0
ya
_
1
b
_
a
y
a1
exp
_
_
y
b
_
a
_
dy
we use the substitutions u = (y/b)
a
, and du = ab
a
y
a1
dy. These yield
E[Y ] = b
_

0
u
1/a
e
u
du = b
_
a + 1
a
_
.
In a similar manner, it is straightforward to verify that
E[Y
2
] = b
2
_
a + 2
a
_
,
and thus
Var[Y ] = b
2
_
_
a + 2
a
_
2
_
a + 1
a
__
.
2.6.6 Beta Distribution
A random variable is said to have a beta distribution if its density is given by
f(y|a, b) =
1
B(a, b)
y
a1
(1 y)
b1
, 0 < y < 1.
Here the function
B(a, b) =
_
1
0
u
a1
(1 u)
b1
du
is the beta function, and is related to the gamma function through
B(a, b) =
(a)(b)
(a +b)
.
Proceeding in the usual manner, we can show that
E[Y ] =
a
a +b
Var[Y ] =
ab
(a +b)
2
(a +b + 1)
.
38
2.6.7 Chi-square Distribution
Let Z N(0, 1), and let Y = Z
2
. Then the cumulative distribution function
F
Y
(y) = P(Y y) = P(Z
2
y) = P(
y Z
y) = F
Z
(
y) F
Z
(
y)
so that by differentiating in y we arrive to the density
f
Y
(y) =
1
2
y
[f
z
(
y) +f
z
(
y)] =
1
2y
e
y/2
,
in which we recognize Gamma(1/2, 1/2). Suppose that Y =

n
i=1
Z
2
i
, where the
Z
i
N(0, 1) for i = 1, . . . , n are independent. From results on the sum of indepen-
dent Gamma random variables, Y Gamma(n/2, 1/2). This density has the form
f
Y
(y|n) =
e
y/2
y
n/21
2
n/2
(n/2)
, y > 0 (2.14)
and is referred to as a chi-squared distribution on n degrees of freedom. The notation
Y Chi(n) should be read as the random variable Y follows a chi-squared dis-
tribution with n degrees of freedom. Later we will show that if X Chi(u) and
Y Chi(v), it follows that X +Y Chi(u +v).
2.6.8 The Bivariate Normal Distribution
Suppose that U and V are independent randomvariables each, with the standard normal
distribution. We will need the following parameters
X
,
Y
,
X
> 0,
Y
> 0,
[1, 1]. Now let X and Y be new random variables dened by
X =
X
+
X
U,
V =
Y
+
Y
U +
Y
_
1
2
V.
Using basic properties of mean, variance, covariance, and the normal distribution, sat-
isfy yourself of the following.
Property 3. The following properties hold
1. X is normally distributed with mean
X
and standard deviation
X
,
2. Y is normally distributed with mean
Y
and standard deviation
Y
,
39
3. Corr(X, Y ) = ,
4. X and Y are independent if and only if = 0.
The inverse transformation is
u =
x
X
X
v =
y
Y
Y
_
1
2
(x
X
)
X
_
1
2
so that the Jacobian of the transformation is
(x, y)
(u, v)
=
1
Y
_
1
2
.
Since U and V are independent standard normal variables, their joint probability den-
sity function is
g(u, v) =
1
2
e
u
2
+v
2
2
.
Using the bivariate change of variables formula, the joint density of (X, Y ) is
f(x, y) =
1
2
X
Y
_
1
2
exp
_
(x
X
)
2
2
2
X
(1
2
)
+
(x
X
)(y
Y
)
Y
(1
2
)
(y
Y
)
2
2
2
Y
(1
2
)
_
Bivariate Normal Conditional Distributions
In the last section we derived the joint probability density function f of the bivariate
normal random variables X and Y . The marginal densities are known. Then,
f
Y |X
(y|x) =
f
Y,X
(y, x)
f
X
(x)
=
1
_
2
2
Y
(1
2
)
exp
_
(y (
Y
+
Y
(x
X
)/
X
))
2
2
2
Y
(1
2
)
_
.
Then the conditional distribution of Y given X = x is also Gaussian, with
E(Y |X = x) =
Y
+
Y
x
X
X
Var(Y |X = x) =
2
Y
(1
2
)
2.6.9 The Multivariate Normal Distribution
Let denote the 2 2 symmetric matrix
_
_

2
X

X
X

2
Y
_
_
40
Then
det|| =
2
X
2
Y
(
X
Y
)
2
=
2
X
2
Y
(1
2
)
and
1
=
1
1
2
_
_
1/
2
X
/(
X
Y
)
/(
X
Y
) 1/
2
Y
_
_
.
Hence the bivariate normal distribution (X, Y ) can be written in matrix notation as
f
(X,Y )
(x, y) =
1
2
_
det||
exp
_
_
_
1
2
_
_
x
X
y
Y
_
_
T
1
_
_
x
X
y
Y
_
_
_
_
_.
Let Y = (Y 1, . . . , Y p) be a random vector. Let E(Y
i
) =
i
, i = 1, . . . , p, and dene
the p-length vector = (
1
, . . . ,
p
). Dene the p p matrix through its elements
Cov(Y
i
, Y
j
) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multi-
variate Gaussian distribution if its density function is specied by
f
Y
(y) =
1
(2)
p/2
||
1/2
exp
_
1
2
(y )
T
1
(y )
_
. (2.15)
The notation Y MV N
p
(, ) should be read as the random variable Y follows a
multivariate Gaussian (normal) distribution with p-vector mean and p p variance-
covariance matrix .
41
2.7 Distributions further properties
2.7.1 Sum of Independent Random Variables special cases
Poisson variables
Suppose X Pois() and Y Pois(). Assume that X and Y are independent.
Then
P(X +Y = n) =
n
k=0
P(X = k, Y = n k)
=
n
k=0
P(X = k)P(Y = n k)
=
n
k=0
e
k
k!
e

nk
(n k)!
=
e
(+)
n!
n
k=0
n!
k!(n k)!
nk
=e ( +)
( +)
n
n!
.
That is, X +Y Pois( +).
Binomial Random Variables
We seek the distribution of Y + X, where Y Bin(n, ) and X Bin(m, ).
Since X + Y is modelling the situation where the total number of trials is xed at
n + m and the probability of a success in a single trial equals . Without performing
a calculations, we expect to nd that X + Y Bin(n + m, ). To verify that note
that X = X
1
+ +X
n
where X
i
are independent Bernoulli variables with parameter
while Y = Y
1
+ + Y
m
where Y
i
are also independent Bernoulli variables with
parameter . Assuming that X
i
s are independent of Y
i
s we obtain that X + Y is the
sum of n + m indpendent Bernoulli random variables with parameter , i.e. X + Y
has Bin(n +m, ) distribution.
42
Gamma, Chi-square, and Exponential Random Variables
Let X Gamma(, ) and Y Gamma() are independent. Then the moment
generating function of X +Y is given as
M
X+Y
(t) = M
X
(t)M
Y
(t) =
1
(1 +t/)
1
(1 +t/)
=
1
(1 +t/)
+
But this is the moment generating function of a Gamma random variable distributed
as Gamma( + , ). The result X + Y Chi(u + v) where X Chi(u) and
Y Chi(v), follows as a corollary.
Let Y
1
, . . . , Y
n
be n independent exponential randomvariables each with parameter
. Then Z = Y
1
+ Y
2
+ + Y
n
is a Gamma(n, ) random variable. To see that
this is indeed the case, write Y
i
Exp(), or alternatively, Y
i
Gamma(1, ). Then
Y
1
+Y
2
Gamma(2, ), and by induction
n
i=1
Y
i
Gamma(n, ).
Gaussian Random Variables
Let X N(
X
,
2
X
) and Y N(
Y
,
2
Y
). Then the moment generating function of
X +Y is given by
M
X+Y
(t) = M
X
(t)M
Y
(t) = e
X
t+
2
X
t
2
/2
e
Y
t+
2
Y
t
2
/2
= e
(
X
+
Y
)t+(
2
X
+
2
Y
)t
2
/2
which proves that X +Y N(
X
+
Y
,
2
X
+
2
Y
).
43
44
2.7.2 Common Distributions Summarizing Tables
Discrete Distributions
Bernoulli()
pmf P(Y = y|) =
y
(1 )
1y
, y = 0, 1, 0 1
mean/variance E[Y ] = , Var[Y ] = (1 )
mgf MY (t) = e
t
+ (1 )
Binomial(n, )
pmf P(Y = y|) =
n
y
y
(1 )
ny
, y = 0, 1, . . . , n, 0 1
mean/variance E[Y ] = n, Var[Y ] = n(1 )
mgf MY (t) = [e
t
+ (1 )]
n
Discrete uniform(N)
pmf P(Y = y|N) = 1/N, y = 1, 2, . . . , N
mean/variance E[Y ] = (N + 1)/2, Var[Y ] = (N + 1)(N 1)/12
mgf MY (t) =
1
N
e
t 1e
Nt
1e
t
Geometric()
pmf P(Y = y|N) = (1 )
y1
, y = 1, . . . , 0 1
mean/variance E[Y ] = 1/, Var[Y ] = (1 )/
2
mgf MY (t) = e
t
/[1 (1 )e
t
], t < log(1 )
notes The random variable X = Y 1 is NegBin(1, ).
Hypergeometric(b, w, n)
pmf P(Y = y|b, w, n) =
w
y
bw
ny
b
n
, y = 0, 1, . . . , n,
b (b w) y b, b, w, n 0
mean/variance E[Y ] = nw/b, Var[Y ] = nw(b w)(b n)/(b
2
(b 1))
Negative binomial(r, )
pmf P(Y = y|r, ) =
r+y1
y
r
(1 )
y
, y = 0, 1, . . . , n,
b (b w) y N, 0 < 1
mean/variance E[Y ] = r(1 )/, Var[Y ] = r(1 )/
2
mgf MY (t) = /(1 (1 )e
t
)
r
, t < log(1 )
notes
An alternative form of the pmf, used in the derivation in our notes, is
given by P(N = n|r, ) =
n1
r1
r
(1 )
nr
, n = r, r + 1, . . .
where the random variable N = Y +r. The negative binomial can also
be derived as a mixture of Poisson random variables.
Poisson()
pmf P(Y = y|) =
y
e
/y!, y = 0, 1, 2, . . . , 0 <
mean/variance E[Y ] = , Var[Y ] = ,
mgf MY (t) = e
(e
t
1)
45
Continuous Distributions
Uniform U(a, b)
pmf f(y|a, b) = 1/(b a), a < y < b
mean/variance E[Y ] = (b + a)/2, Var[Y ] = (b a)
2
/12,
mgf MY (t) = (e
bt
e
at
)/((b a)t)
notes
A uniform distribution with a = 0 and b = 1 is a special case of the
beta distribution where ( = = 1).
Exponential E()
pmf f(y|) = e
y
, y > 0, > 0
mean/variance E[Y ] = 1/, Var[Y ] = 1/
2
,
mgf MY (t) = 1/(1 t/)
notes
Special case of the gamma distribution. X = Y
1/
is Weibull, X =
2Y is Rayleigh, X = log(Y/) is Gumbel.

Gamma G(, )
pmf f(y|) =
e
y
y
1
/(), y > 0, , > 0
mean/variance E[Y ] = /, Var[Y ] = /
2
,
mgf MY (t) = 1/(1 t/)
notes
Includes the exponential ( = 1) and chi squared ( = n/2, = 1/2).
Normal N( ,
2
)
pmf f(y|,
2
) =
1
2
2
e
(y)
2
/(2
2
)
, > 0
mean/variance E[Y ] = , Var[Y ] =
2
,
mgf MY (t) = e
t+
2
t
2
/2
notes
Often called the Gaussian distribution.
Transforms
The generating functions of the discrete and continuous random variables discussed
thus far are given in Table 2.7.2.
46
Distrib. p.g.f. m.g.f. ch.f.
Bi(n, ) (t +

)
n
(e
t
+

)
n
(e
it
+

)
n
Geo() t/(1

t) /(e
t

) /(e
it

)
NegBin(r, )
r
(1

t)
r
r
(1

e
t
)
r
r
(1

e
it
)
r
Poi() e
(1t)
e
(e
t
1)
e
(e
it
1)
Unif(, ) e
t
(e
t
1)/(t) e
it
(e
it
1)/(it)
Exp() (1 t/)
1
(1 it/)
1
Ga(c, ) (1 t/)
c
(1 it/)
c
N(,
2
) exp
_
t +
2
t
2
/2
_
exp
_
it
2
t
2
/2
_
Table 2.1: Transforms of distributions. In the formulas

= 1 .
47
Chapter 3
Likelihood
3.1 Maximum Likelihood Estimation
Let x be a realization of the random variable X with probability density f
X
(x|) where
= (
1
,
2
, . . . ,
m
)
T
is a vector of m unknown parameters to be estimated. The set
of allowable values for , denoted by , or sometimes by
, is called the parameter

space. Dene the likelihood function
l(|x) = f
X
(x|). (3.1)
It is crucial to stress that the argument of f
X
(x|) is x, but the argument of l(|x) is .
It is therefore convenient to view the likelihood function l() as the probability of the
observed data x considered as a function of . Usually it is convenient to work with the
natural logarithm of the likelihood called the log-likelihood, denoted by
log l(|x) = log l(|x).
When R
1
we can dene the score function as the rst derivative of the log-
likelihood
S() =

log l().
The maximum likelihood estimate (MLE)

of is the solution to the score equation
S() = 0.
48
At the maximum, the second partial derivative of the log-likelihood is negative, so we
dene the curvature at

as I(
) where
I() =

2
2
log l().
We can check that a solution

of the equation S() = 0 is actually a maximum by
checking that I(
) > 0. A large curvature I(
) is associated with a tight or strong

peak, intuitively indicating less uncertainty about .
The likelihood function l(|x) supplies an order of preference or plausibility among
possible values of based on the observed y. It ranks the plausibility of possible values
of by how probable they make the observed y. If P(x| =
1
) > P(x| =
2
) then
the observed x makes =
1
more plausible than =
2
, and consequently from
(3.1), l(
1
|x) > l(
2
|x). The likelihood ratio l(
1
|x)/l(
2
|x) = f(
1
|x)/f(
2
|x) is
a measure of the plausibility of
1
relative to
2
based on the observed fact y. The
relative likelihood l(
1
|x)/l(
2
|x) = k means that the observed value x will occur k
times more frequently in repeated samples from the population dened by the value
1
than from the population dened by
2
. Since only ratios of likelihoods are meaningful,
it is convenient to standardize the likelihood with respect to its maximum.
When the random variables X
1
, . . . , X
n
are mutually independent we can write the
joint density as
f
X
(x) =
n
j=1
f
Xj
(x
j
)
where x = (x
1
, . . . , x
n
)
is a realization of the random vector X = (X

1
, . . . , X
n
)
,
and the likelihood function becomes
L
X
(|x) =
n
j=1
f
Xj
(x
j
|).
When the densities f
Xj
(x
j
) are identical, we unambiguously write f(x
j
).
Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth ob-
servation is either a success or failure coded x
j
= 1 and x
j
= 0 respectively,
and
P (X
j
= x
j
) =
xj
(1 )
1xj
49
for j = 1, . . . , n. The vector of observations y = (x
1
, x
2
, . . . , x
n
)
T
is a sequence of
ones and zeros, and is a realization of the random vector Y = (X
1
, X
2
, . . . , X
n
)
T
. As
the Bernoulli outcomes are assumed to be independent we can write the joint probabil-
ity mass function of Y as the product of the marginal probabilities, that is
l() =
n
j=1
P (X
j
= x
j
)
=
n
j=1
xj
(1 )
1xj
=
xj
(1 )
n
xj
=
r
(1 )
nr
where r =

n
i=1
x
j
is the number of observed successes (1s) in the vector y. The
log-likelihood function is then
log l() = r log + (n r) log(1 ),
and the score function is
S() =

log l() =
r

(n r)
1
.
Solving for S(
) = 0 we get

= r/n. We also have
I() =
r
2
+
n r
(1 )
2
> 0 ,
guaranteeing that

is the MLE. Each X
i
is a Bernoulli random variable and has ex-
pected value E(X
i
) = , and variance Var(X
i
) = (1 ). The MLE

(y) is itself a
random variable and has expected value
E(
) = E
_
r
n
_
= E
_
n
i=1
X
i
n
_
=
1
n
n
i=1
E(X
i
) =
1
n
n
i=1
= .
If an estimator has on average the value of the parameter that it is intended to estimate
than we call it unbiased, i.e. if E
= . From the above calculation it follows that

(y)
is an unbiased estimator of . The variance of

(y) is
Var(
) = Var
_
n
i=1
X
i
n
_
=
1
n
2
n
i=1
Var (X
i
) =
1
n
2
n
i=1
(1 ) =
(1 )
n
.
2
50
Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a
random variable R taking on values r = 0, 1, . . . , n with probability mass function
P(R = r) =
_
n
r
_
r
(1 )
nr
.
This is the exact same sampling scheme as in the previous example except that instead
of observing the sequence y we only observe the total number of successes r. Hence
the likelihood function has the form
L
R
(|r) =
_
n
r
_
r
(1 )
nr
.
The relevant mathematical calculations are as follows
log l
R
(|r) = log
_
n
r
_
+r log() + (n r) log(1 )
S () =
r
n
+
n r
1

=
r
n
I () =
r
2
+
n r
(1 )
2
> 0
E(
) =
E(r)
n
=
n
n
=

unbiased
Var(
) =
Var(r)
n
2
=
n(1 )
n
2
=
(1 )
n
.
2
Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a
certain genotype, observe that the genotype makes its rst appearance in the 22nd sub-
ject analysed. If we assume that the subjects are independent, the likelihood function
can be computed based on the geometric distribution, as l() = (1 )
n1
. The
score function is then S() =
1
(n 1)(1 )
1
. Setting S(
) = 0 we get
= n
1
= 22
1
. Moreover I() =
2
+ (n 1)(1 )
2
and is greater than zero
for all , implying that

is MLE.
Suppose that the geneticists had planned to stop sampling once they observed r =
10 subjects with the specied genotype, and the tenth subject with the genotype was
the 100th subject anaylsed overall. The likelihood of can be computed based on the
negative binomial distribution, as
l() =
_
n 1
r 1
_
r
(1 )
nr
51
for n = 100, r = 5. The usual calculation will conrm that

= r/n is MLE. 2
Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger
counted the number of scintillations in 72 second intervals caused by radioactive de-
cay of a quantity of the element polonium. Altogether there were 10097 scintillations
during 2608 such intervals
Count 0 1 2 3 4 5 6 7
Observed 57 203 383 525 532 408 573 139
Count 8 9 10 11 12 13 14
Observed 45 27 10 4 1 0 1
The Poisson probability mass function with mean parameter is
f
X
(x|) =

x
exp()
x!
.
The likelihood function equals
l() =
xi
exp()
x
i
!
=

xi
exp(n)
x
i
!
.
The relevant mathematical calculations are
log l() = (x
i
) log () n log [(x
i
!)]
S() =
x
i
x
i
n
= x
I() =
x
i
2
> 0,
implying

is MLE. Also E(
) =
E(x
i
) =
1
n
= , so

is an unbiased estimator.
Next Var(
) =
1
n
2
Var(x
i
) =
1
n
. It is always useful to compare the tted values
from a model against the observed values.
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
O
i
57 203 383 525 532 408 573 139 45 27 10 4 1 0 1
E
i
54 211 407 525 508 393 254 140 68 29 11 4 1 0 0
+3 -8 -24 0 +24 +15 +19 -1 -23 -2 -1 0 -1 +1 +1
The Poisson law agrees with the observed variation within about one-twentieth of its
range. 2
52
Example 6 (Exponential distribution). Suppose random variables X
1
, . . . , X
n
are i.i.d.
as Exp(). Then
l() =
n
i=1
exp(x
i
)
=
n
exp
_
x
i
_
log l() = nlog
x
i
S() =
n
i=1
x
i

=
n
x
i
I() =
n
2
> 0 .
Exercise 15. Demonstrate that the expectation and variance of

are given as follows
E[
] =
n
n 1
Var[
] =
n
2
(n 1)
2
(n 2)
2
.
Hint Find the probability distribution of Z =
n
i=1
X
i
, where X
i
Exp().
Exercise 16. Propose the alternative estimator

=
n1
n
. Show that

is unbiased
estimator of with the variance
Var[
] =

2
n 2
.
As this example demonstrates, maximum likelihood estimation does not automati-
cally produce unbiased estimates. If it is thought that this property is (in some sense)
desirable, then some adjustments to the MLEs, usually in the form of scaling, may be
required.
Example 7 (Gaussian Distribution). Consider data X
1
, X
2
. . . , X
n
distributed as N(, ).
Then the likelihood function is
l(, ) =
_
1
_
n
exp
_
i=1
(x
i
)
2
2
_
_
53
and the log-likelihood function is
log l(, ) =
n
2
log (2)
n
2
log ()
1
2
n
i=1
(x
i
)
2
(3.2)
Unknown mean and known variance As is known we treat this parameter as a con-
stant when differentiating wrt . Then
S() =
1
i=1
(x
i
), =
1
n
n
i=1
x
i
, and I() =
n
> 0 .
Also, E[ ] = n/n = , and so the MLE of is unbiased. Finally
Var[ ] =
1
n
2
Var
_
n
i=1
x
i
_
=

n
= (E[I()])
1
.
Known mean and unknown variance Differentiating (3.2) wrt returns
S() =
n
2
+
1
2
2
n
i=1
(x
i
)
2
,
and setting S() = 0 implies
=
1
n
n
i=1
(x
i
)
2
.
Differentiating again, and multiplying by 1 yields
I() =
n
2
2
+
1
3
n
i=1
(x
i
)
2
.
Clearly is the MLE since
I( ) =
n
2
2
> 0.
Dene
Z
i
= (X
i
)
2
/
,
so that Z
i
N(0, 1). From the appendix on probability
n
i=1
Z
2
i

2
n
,
implying E[
Z
2
i
] = n, and Var[
Z
2
i
] = 2n. The MLE
= (/n)
n
i=1
Z
2
i
.
54
Then
E[ ] = E
_
n
n
i=1
Z
2
i
_
= ,
and
Var[ ] =
_
n
_
2
Var
_
n
i=1
Z
2
i
_
=
2
2
n
.
2
Our treatment of the two parameters of the Gaussian distribution in the last example
was to (i) x the variance and estimate the mean using maximum likelihood; and then
(ii) x the mean and estimate the variance using maximum likelihood. In practice we
would like to consider the simultaneous estimation of these parameters. In the next
section of these notes we extend MLE to multiple parameter estimation.
3.2 Multi-parameter Estimation
Suppose that a statistical model species that the data y has a probability distribution
f(y; , ) depending on two unknown parameters and . In this case the likelihood
function is a function of the two variables and and having observed the value y is
dened as l(, ) = f(y; , ) with log l(, ) = log l(, ). The MLE of (, ) is a
value ( ,

) for which l(, ) , or equivalently log l(, ) , attains its maximum value.
Dene S
1
(, ) = log l/ and S
2
(, ) = log l/. The MLEs ( ,

) can
be obtained by solving the pair of simultaneous equations
S
1
(, ) = 0
S
2
(, ) = 0
Let us consider the matrix I(, )
I(, ) =
_
_
I
11
(, ) I
12
(, )
I
21
(, ) I
22
(, )
_
_
=
_
_
2
log l

2
log l
log l

2
2
log l
_
_
The conditions for a value (
0
,
0
) satisfying S
1
(
0
,
0
) = 0 and S
2
(
0
,
0
) = 0
to be a MLE are that
I
11
(
0
,
0
) > 0, I
22
(
0
,
0
) > 0,
55
and
det(I(
0
,
0
) = I
11
(
0
,
0
)I
22
(
0
,
0
) I
12
(
0
,
0
)
2
> 0.
This is equivalent to requiring that both eigenvalues of the matrix I(
0
,
0
) be positive.
Example 8 (Gaussian distribution). Let X
1
, X
2
. . . , X
n
be iid observations from a
N(,
2
) density in which both and
2
are unknown. The log likelihood is
log l(,
2
) =
n
i=1
log
_
1
2
2
exp[
1
2
2
(x
i
)
2
]
_
=
n
i=1
_
1
2
log [2]
1
2
log [
2
]
1
2
2
(x
i
)
2
_
=
n
2
log [2]
n
2
log [
2
]
1
2
2
n
i=1
(x
i
)
2
.
Hence for v =
2
S
1
(, v) =
log l
=
1
v
n
i=1
(x
i
) = 0
which implies that
=
1
n
n
i=1
x
i
= x. (3.3)
Also
S
2
(, v) =
log l
v
=
n
2v
+
1
2v
2
n
i=1
(x
i
)
2
= 0
implies that

2
= v =
1
n
n
i=1
(x
i
)
2
=
1
n
n
i=1
(x
i
x)
2
. (3.4)
Calculating second derivatives and multiplying by 1 gives that I(, v) equals
I(, v) =
_
_
_
_
n
v
1
v
2
n
i=1
(x
i
)
1
v
2
n
i=1
(x
i
)
n
2v
2
+
1
v
3
n
i=1
(x
i
)
2
_
_
_
_
Hence I( , v) is given by
_
_
n
v
0
0
n
2v
2
_
_
56
Clearly both diagonal terms are positive and the determinant is positive and so ( , v)
are, indeed, the MLEs of (, v).
Go back to equation (3.3), and

X N(, v/n). Clearly E(

X) = (unbiased) and
Var(

X) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below
we have
n v
v

2
n1
so that
E
_
n v
v
_
= n 1
E( v) =
_
n 1
n
_
v
Instead, propose the (unbiased) estimator of
2
S
2
= v =
n
n 1
v =
1
n 1
n
i=1
(x
i
x)
2
(3.5)
Observe that
E( v) =
_
n
n 1
_
E( v) =
_
n
n 1
__
n 1
n
_
v = v
and v is unbiased as suggested. We can easily show that
Var( v) =
2v
2
(n 1)
2
Lemma 1 (Joint distribution of the sample mean and sample variance). If X
1
, . . . , X
n
are iid N(, v) then the sample mean

X and sample variance S
2
are independent.
Also

X is distributed N(, v/n) and (n 1)S
2
/v is a chi-squared random variable
with n 1 degrees of freedom.
Proof. Dene
W =
n
i=1
(X
i

X)
2
=
n
i=1
(X
i
)
2
n(

X )
2
W
v
+
(

X )
2
v/n
=
n
i=1
(X
i
)
2
v
57
The RHS is the sum of n independent standard normal random variables squared, and
so is distributed
2
n
. Also,

X N(, v/n), therefore (

X )
2
/(v/n) is the square
of a standard normal and so is distributed
2
1
These Chi-Squared random variables have
moment generating functions (1 2t)
n/2
and (1 2t)
1/2
respectively. Next, W/v
and (

X )
2
/(v/n) are independent
Cov(X
i

X,

X) = Cov(X
i
,

X) Cov(

X,

X)
= Cov
_
X
i
,
1
n
X
j
_
Var(

X)
=
1
n
j
Cov(X
i
, X
j
)
v
n
=
v
n

v
n
= 0
But, Cov(X
i

X,

X ) = Cov(X
i

X,

X) = 0 , hence
i
Cov(X
i

X,

X ) = Cov
_
i
(X
i

X),

X
_
= 0
As the moment generating function of the sum of independent random variables is
equal to the product of their individual moment generating functions, we see
E
_
e
t(W/v)
_
(1 2t)
1/2
= (1 2t)
n/2
E
_
e
t(W/v)
_
= (1 2t)
(n1)/2
But (12t)
(n1)/2
is the moment generating function of a
2
random variables with
(n1) degrees of freedom, and the moment generating function uniquely characterizes
the random variable S = (W/v).
Suppose that a statistical model species that the data x has a probability distri-
bution f(x; ) depending on a vector of m unknown parameters = (
1
, . . . ,
m
).
In this case the likelihood function is a function of the m parameters
1
, . . . ,
m
and
having observed the value of x is dened as l() = f(x; ) with log l() = log l().
The MLE of is a value

for which l(), or equivalently log l(), attains its
maximum value. For r = 1, . . . , mdene S
r
() = log l/
r
. Then we can (usually)
nd the MLE

by solving the set of m simultaneous equations S
r
() = 0 for r =
58
1, . . . , m. The matrix I() is dened to be the m m matrix whose (r, s) element is
given by I
rs
where I
rs
=
2
log l/
r
s
. The conditions for a value

satisfying
S
r
(
) = 0 for r = 1, . . . , mto be a MLE are that all the eigenvalues of the matrix I(
)
are positive.
3.3 The Invariance Principle
How do we deal with parameter transformation? We will assume a one-to-one trans-
formation, but the idea applied generally. Consider a binomial sample with n = 10
independent trials resulting in data x = 8 successes. The likelihood ratio of
1
= 0.8
versus
2
= 0.3 is
l(
1
= 0.8)
l(
2
= 0.3)
=

8
1
(1
1
)
2
8
2
(1
2
)
2
= 208.7 ,
that is, given the data = 0.8 is about 200 times more likely than = 0.3.
Suppose we are interested in expressing on the logit scale as
log{/(1 )} ,
then intuitively our relative information about
1
= log(0.8/0.2) = 1.29 versus
2
= log(0.3/0.7) = 0.85 should be
L
(
1
)
L
(
2
)
=
l(
1
)
l(
2
)
= 208.7 .
That is, our information should be invariant to the choice of parameterization. ( For
the purposes of this example we are not too concerned about how to calculate L
(). )
Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and

is the
MLE of then g(
) is the MLE of g().

Proof. This is trivially true as we let = g
1
() then f{y|g
1
()} is maximized in
exactly when = g(
). When g is not one-to-one the discussion becomes more subtle,

but we simply choose to dene g
MLE
() = g(
)
It seems intuitive that if

is most likely for and our knowledge (data) remains
unchanged then g(
) is most likely for g(). In fact, we would nd it strange if

is an
59
estimate of , but

2
is not an estimate of
2
. In the binomial example with n = 10 and
x = 8 we get

= 0.8, so the MLE of g() = /(1 ) is
g(
) =

/(1

) = 0.8/0.2 = 4.
60
Chapter 4
Estimation
In the previous chapter we have seen an approach to estimation that is based on the
likelihood of observed results. Next we study general theory of estimation that is used
to compare between different estimators and to decide on the most efcient one.
4.1 General properties of estimators
Suppose that we are going to observe a value of a random vector X. Let X denote the
set of possible values Xcan take and, for x X, let f(x|) denote the probability that
Xtakes the value x where the parameter is some unknown element of the set .
The problem we face is that of estimating . An estimator

is a procedure which
for each possible value x X species which element of we should quote as an
estimate of . When we observe X = x we quote

(x) as our estimate of . Thus

is
a function of the random vector X. Sometimes we write

(X) to emphasise this point.
Given any estimator

we can calculate its expected value for each possible value
of . As we have already mentioned when discussing the maximum likelihood
estimation, an estimator is said to be unbiased if this expected value is identically equal
to . If an estimator is unbiased then we can conclude that if we repeat the experiment
an innite number of times with xed and calculate the value of the estimator each
time then the average of the estimator values will be exactly equal to . To evaluate
61
the usefulness of an estimator

=

(x) of , examine the properties of the random
variable

=

(X).
Denition 1 (Unbiased estimators). An estimator

=

(X) is said to be unbiased for
a parameter if it equals in expectation
E[
(X)] = E(
) = .
Intuitively, an unbiased estimator is right on target. 2
Denition 2 (Bias of an estimator). The bias of an estimator

=

(X) of is dened
as bias(
) = E[
(X) ]. 2
Note that even if

is an unbiased estimator of , g(
) will generally not be an

unbiased estimator of g() unless g is linear or afne. This limits the importance of the
notion of unbiasedness. It might be at least as important that an estimator is accurate
in the sense that its distribution is highly concentrated around .
Exercise 17. Show that for an arbitrary distribution the estimator S
2
as dened in (3.5)
is an unbiased estimator of the variance of this distribution.
Exercise 18. Consider the estimator S
2
of variance
2
in the case of the normal dis-
tribution. Demonstrate that although S
2
is an unbiased estimator of
2
, S is not an
unbiased estimator of . Compute its bias.
Denition 3 (Mean squared error). The mean squared error of the estimator

is de-
ned as MSE(
) = E(
)
2
. Given the same set of data,

1
is better than

2
if
MSE(
1
) MSE(
2
) (uniformly better if true ). 2
Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as
MSE(
) = Var(
) + bias(
)
2
.
62
Proof. We have
MSE(
) = E(
)
2
= E{ [

E(
) ] + [ E(
) ]}
2
= E[
E(
)]
2
+E[E(
) ]
2
+2 E
_
[
E(
)][E(
) ]
_
. .
=0
= E[
E(
)]
2
+E[E(
) ]
2
= Var(
) + [E(
) ]
2
. .
bias(
)
2
.
NOTE This lemma implies that the mean squared error of an unbiased estimator is
equal to the variance of the estimator.
Exercise 19. Consider X
1
, . . . , X
n
where X
i
N(,
2
) and is known. Three
estimators of are

1
=

X =
1
n
n
i=1
X
i
,

2
= X
1
, and

3
= (X
1
+

X)/2. Discuss
their properties which one you would recommend and why.
Example 9. Consider X
1
, . . . , X
n
to be independent random variables with means
E(X
i
) = and variances Var(X
i
) =
2
i
. Consider pooling the estimators of into a
common estimator using the linear combination = w
1
X
1
+ w
2
X
2
+ + w
n
X
n
.
We will see that the following is true
(i) The estimator is unbiased if and only if
w
i
= 1.
(ii) The estimator has minimum variance among this class of estimators when the
weights are inversely proportional to the variances
2
i
.
(iii) The variance of for optimal weights w
i
is Var( ) = 1/
2
i
.
Indeed, we have E( ) = E(w
1
X
1
+ + w
n
X
n
) =

i
w
i
E(X
i
) =

i
w
i
=
i
w
i
so is unbiased if and only if

i
w
i
= 1. The variance of our estimator is
Var( ) =
i
w
2
i
2
i
, which should be minimized subject to the constraint
i
w
i
= 1.
Differentiating the Lagrangian L =
i
w
2
i
2
i
(
i
w
i
1) with respect to w
i
and
63
setting equal to zero yields 2w
i
2
i
= w
i

2
i
so that w
i
=
2
i
/(
j

2
j
).
Then, for optimal weights we get Var( ) =

i
w
2
i
2
i
= (
4
i

2
i
)/(
2
i
)
2
=
1/(
2
i
).
Assume now that the instead of X
i
we observe biased variable

X
i
= X
i
+ for
some = 0. When
2
i
=
2
we have that Var( ) =
2
/n which tends to zero for
n whereas bias( ) = and MSE( ) =
2
/n +
2
. Thus in the general case
when the bias is present it tends to dominate the variance as n gets larger, which is very
unfortunate.
Exercise 20. Let X
1
, . . . , X
n
be an independent sample of size n from the uniform
distribution on the interval (0, ), with density for a single observation being f(x|) =
1
for 0 < x < and 0 otherwise, and consider > 0 unknown.
(i) Find the expected value and variance of the estimator

= 2

X.
(ii) Find the expected value of the estimator

= X
(n)
, i.e. the largest observation.
(iii) Find an unbiased estimator of the form

= cX
(n)
and calculate its variance.
(iv) Compare the mean square error of

and

.
4.2 Minimum-Variance Unbiased Estimation
Getting a small MSE often involves a tradeoff between variance and bias. For unbiased
estimators, the MSE obviously equals the variance, MSE(
) = Var(
), so no tradeoff
can be made. One approach is to restrict ourselves to the subclass of estimators that are
unbiased and minimum variance.
Denition 4 (Minimum-variance unbiased estimator). If an unbiased estimator of g()
has minimum variance among all unbiased estimators of g() it is called a minimum
variance unbiased estimator (MVUE). 2
We will develop a method of nding the MVUE when it exists. When such an
estimator does not exist we will be able to nd a lower bound for the variance of an
unbiased estimator in the class of unbiased estimators, and compare the variance of our
unbiased estimator with this lower bound.
64
Denition 5 (Score function). For the (possibly vector valued) observation X = x to
be informative about , the density must vary with . If f(x|) is smooth and differen-
tiable, then for nding MLE we have used the score function
S() = S(|x) =

log f(x|)
f(x|)/
f(x|)
.
2
Under suitable regularity conditions (differentiation wrt and integration wrt x can
be interchanged), we have for X distributed according to f(x|):
E{S(|X)} =
_
f(x|)/
f(x|)
f(x|)dx =
_
f(x|)/dx ,
=

__
f(x|)dx
_
=

1 = 0.
Thus the score function has expectation zero. The score function S(|x) is a random
variable if for x we substitute X a random variable with f(x|) distribution. In this
case we often drop explicit dependence on X from the notation by simply writing S().
The negative derivative of the score function measure how concave down is the
likelihood around value .
Denition 6 (Fisher information). The Fisher information is dened as the average
value of the negative derivative of the score function
I() E
_

S()
_
.
The negative derivative of the score function I(), which is a random variable depen-
dent on X, is sometimes referred to as empirical or observed information about . 2
Lemma 3. The variance of S() is equal to the Fisher information about
I() = E{S()
2
} E
_
_

log f(X|)
_
2
_
Proof. Using the chain rule
2
log f =

_
1
f
f
_
=
1
f
2
_
f
_
2
+
1
f
2
f
2
=
_
log f
_
2
+
1
f
2
f
2
65
If integration and differentiation can be interchanged
E
_
1
f
2
f
2
_
=
_
X
2
f
2
dx =

2
2
_
X
fdx =

2
2
1 = 0,
thus
E
_

2
2
log f(X|)
_
= E
_
_

log f(X|)
_
2
_
= I(). (4.1)
Theorem 4.2.1 (Cram er Rao lower bound). Let

be an unbiased estimator of . Then
Var(
) { I() }
1
.
Proof. Unbiasedness, E(
) = , implies
_

(x)f(x|)dx = .
Assume we can differentiate wrt under the integral, then
_

(x)f(x|)
_
dx = 1.
The estimator

(x) cant depend on , so
_

(x)

f(x|) dx = 1.
Since
f
= f

(log f) ,
so that now
_

(x)f

(log f) dx = 1.
Thus
E
_
(x)

(log f)
_
= 1.
Dene random variables
U =

(x),
and
S =

(log f) .
66
Then E(US) = 1. We already know that the score function has expectation zero,
E(S) = 0. Consequently Cov(U, S) = E(US) E(U)E(S) = E(US) = 1. By the
well-known property of correlations (that follows from the Schwartzs inequality) we
have
{Corr(U, S)}
2
=
{Cov(U, S)}
2
Var(U)Var(S)
1
Since, as we mentioned, Cov(U, S) = 1 we get
Var(U)Var(S) 1
This implies
Var(
)
1
I()
which is our main result. We call { I() }
1
the Cram er Rao lower bound (CRLB).
Why information? Variance measures lack of knowledge. Reasonable that the
reciprocal of the variance should be dened as the amount of information carried by
the (possibly vector valued) random observation X about .
Sufcient conditions for the proof of CRLB are that all the integrands are nite,
within the range of x. We also require that the limits of the integrals do not depend on
. That is, the range of x, here f(x|), cannot depend on . This second condition is
violated for many density functions, i.e. the CRLB is not valid for the uniform distri-
bution. We can have absolute assessment for unbiased estimators by comparing their
variances to the CRLB. We can also assess biased estimators. If its variance is lower
than CRLB then it can be indeed a very good estimate, although it is bias.
Example 10. Consider IID random variables X
i
, i = 1, . . . , n, with
f
Xi
(x
i
|) =
1
exp
_
x
i
_
.
Denote the joint distribution of X
1
, . . . , X
n
by
f =
n
i=1
f
Xi
(x
i
|) =
_
1
_
n
exp
_
i=1
x
i
_
,
so that
log f = nlog()
1
i=1
x
i
.
67
The score function is the partial derivative of log f wrt the unknown parameter ,
S() =

log f =
n
+
1
2
n
i=1
x
i
and
E{S()} = E
_
+
1
2
n
i=1
X
i
_
=
n
+
1
2
E
_
n
i=1
X
i
_
For X Exp(1/), we have E(X) = implying E(X
1
+ + X
n
) = E(X
1
) +
+E(X
n
) = n and E{S()} = 0 as required.
I() = E
_

+
1
2
n
i=1
X
i
__
= E
_
n
2

2
3
n
i=1
X
i
_
=
n
2
+
2
3
E
_
n
i=1
X
i
_
=
n
2
+
2n
3
=
n
2
Hence
CRLB =

2
n
.
Let us propose =

X as an estimator of . Then
E( ) = E
_
1
n
n
i=1
X
i
_
=
1
n
E
_
n
i=1
X
i
_
= ,
verifying that =

X is indeed an unbiased estimator of . For X Exp(1/), we
have E(X) = =
_
Var(X), implying
Var( ) =
1
n
2
n
i=1
Var(X
i
) =
n
2
n
2
=

2
n
.
We have already shown that Var( ) = { I() }
1
, and therefore conclude that the
unbiased estimator = x achieves its CRLB. 2
Denition 7 (Efciency ). Dene the efciency of the unbiased estimator

as
e(
) =
CRLB
Var(
)
,
68
where CRLB = { I() }
1
. Clearly 0 < e(
) 1. An unbiased estimator

is said
to be efcient if e(
) = 1. 2
Exercise 21. Consider the MLE

= r/n for the binomial distribution that was con-
sidered in Example 3. Show that for this estimator efciency is 100%, i.e. its variance
attains CRLB.
Exercise 22. Consider the MLE for the Poisson distribution that was considered in
Example 5. Show that also in this case the MLE is 100% efcient.
Denition 8 (Asymptotic efciency ). The asymptotic efciency of an unbiased esti-
mator

is the limit of the efciency as n . An unbiased estimator

is said to be
asymptotically efcient if its asymptotic efciency is equal to 1. 2
Exercise 23. Consider the MLE

for the exponential distribution with parameter that
was considered in Exercise 16. Find its variance, and its mean square error. Consider
also

that was considered in this example. Which of the two has smaller variance and
which has smaller mean square error? Is

asymptotically efcient?
Exercise 24. Discuss efciency of the estimator of variance in the normal distribution
in the case when the mean is known (see Example 7).
4.3 Optimality Properties of the MLE
Suppose that an experiment consists of measuring random variables X
1
, X
2
, . . . , X
n
which are iid with probability distribution depending on a parameter . Let

be the
MLE of . Dene
W
1
=
_
I()(
)
W
2
=
_
I()(
)
W
3
=
_
I(
)(
)
W
4
=
_
I(
)(
).
Then, W
1
, W
2
, W
3
, and W
4
are all random variables and, as n , the probabilistic
behaviour of each of W
1
, W
2
, W
3
, and W
4
is well approximated by that of a N(0, 1)
random variable.
69
Since E[W
1
] 0, we have that E[
] and so

is approximately unbiased. Also
Var[W
1
] 1 implies that Var[
] (I())
1
and so

is asymptotically efcient. The
above properties of the MLE estimators carry to the multivariate case. Here is a brief
account of these properties.
Let the data Xhave probability distribution g(X; ) where = (
1
,
2
, . . . ,
m
) is
a vector of m unknown parameters.
Let I() be the m m observed information matrix and let I() be the m
m Fishers information matrix obtained by replacing the elements of I() by their
expected values. Let

be the MLE of . Let CRLB
r
be the rth diagonal element of the
Fishers information matrix. For r = 1, 2, . . . , m, dene W
1r
= (
r
)/
CRLB
r
.
Then, as n , W
1r
behaves like a standard normal random variable.
Suppose we dene W
2r
by replacing CRLB
r
by the rth diagonal element of the
matrix I()
1
, W
3r
by replacing CRLB
r
by the rth diagonal element of the matrix
I(
)
1
and W
4r
by replacing CRLB
r
by the rth diagonal element of the matrix
I(
)
1
. Then it can be shown that as n , W
2r
, W
3r
, and W
4r
all behave like
standard normal random variables.
70
Chapter 5
The Theory of Condence
Intervals
5.1 Exact Condence Intervals
Suppose that we are going to observe the value of a random vector X. Let X denote the
set of possible values that Xcan take and, for x X, let g(x|) denote the probability
that Xtakes the value x where the parameter is some unknown element of the set .
Consider the problem of quoting a subset of values which are in some sense plausible
in the light of the data x. We need a procedure which for each possible value x X
species a subset C(x) of which we should quote as a set of plausible values for .
Denition 9. Let X
1
, . . . , X
n
be a sample form a distribution that is parameterized
by some parameter . A random set C(X
1
, . . . , X
n
) of possible values for that is
computable from the sample is called a condence region at condence level 1 if
P( C(X
1
, . . . , X
n
)) = 1 .
If the set C(X
1
, . . . , X
n
) has the form of an interval, then we call it a condence
interval.
Example 11. Suppose we are going to observe data x where x = (x
1
, x
2
, . . . , x
n
), and
71
x
1
, x
2
, . . . , x
n
are the observed values of random variables X
1
, X
2
, . . . , X
n
which are
thought to be iid N(, 1) for some unknown parameter (, ) = . Consider
the subset C(x) = [ x 1.96/
n, x + 1.96/
n]. If we carry out an innite sequence

of independent repetitions of the experiment then we will get an innite sequence of x
values and thereby an innite sequence of subsets C(x). We might ask what proportion
of this innite sequence of subsets actually contain the xed but unknown value of ?
Since C(x) depends on x only through the value of x we need to know how x
behaves in the innite sequence of repetitions. This follows from the fact that

X has a
N(,
1
n
) density and so Z =

X
1
n
=

n(

X ) has a N(0, 1) density. Thus even-
though is unknown we can calculate the probability that the value of Z will exceed
2.78, for example, using the standard normal tables. Remember that the probability is
the proportion of experiments in the innite sequence of repetitions which produce a
value of Z greater than 2.78.
In particular we have that P[|Z| 1.96] = 0.95. Thus 95% of the time Z will lie
between 1.96 and +1.96. But
1.96 Z +1.96 1.96
n(

X ) +1.96
1.96/
n

X +1.96/

X 1.96/
n

X + 1.96/
n
C(X)
Thus we have answered the question we started with. The proportion of the innite se-
quence of subsets given by the formula C(X) which will actually include the xed but
unknown value of is 0.95. For this reason the set C(X) is called a 95% condence
set or condence interval for the parameter . 2
It is well to bear in mind that once we have actually carried out the experiment and
observed our value of x, the resulting interval C(x) either does or does not contain
the unknown parameter . We do not know which is the case. All we know is that the
procedure we used in constructing C(x) is one which 95% of the time produces an
interval which contains the unknown parameter.
72
0 20 40 60 80 100
5
1
0
1
5
2
0
2
5
Index
c
(
0
,

m
u
)
Figure 5.1: One hundred condence intervals for the mean of a normal variable with
unknown mean and variance for sample size of ten. In fact the samples have been
drawn from the normal distribution with the mean 15 and standard deviation 6.
The crucial step in the last example was nding the quantity Z =

n(

X )
whose value depended on the parameter of interest but whose distribution was known
to be that of a standard normal variable. This leads to the following denition.
Denition 10 (Pivotal Quantity). A pivotal quantity for a parameter is a random
variable Q(X|) whose value depends both on (the data) X and on the value of the
unknown parameter but whose distribution is known. 2
The quantity Z in the example above is a pivotal quantity for . The following
lemma provides a method of nding pivotal quantities in general.
Lemma 4. Let X be a random variable and dene F(a) = P[X a]. Consider the
random variable U = 2 log [F(X)]. Then U has a
2
2
density. Consider the random
variable V = 2 log [1 F(X)]. Then V has a
2
2
density.
73
Proof. Observe that, for a 0,
P[U a] = P[F(X) exp(a/2)]
= 1 P[F(X) exp(a/2)]
= 1 P[X F
1
(exp(a/2))]
= 1 F[F
1
(exp(a/2))]
= 1 exp(a/2).
Hence, U has density
1
2
exp(a/2) which is the density of a
2
2
variable as required.
The corresponding proof for V is left as an exercise.
This lemma has an immediate, and very important, application.
Suppose that we have data X
1
, X
2
, . . . , X
n
which are iid with density f(x|). De-
ne F(a|) =
_
a
f(x|)dx and, for i = 1, 2, . . . , n, dene U

i
= 2 log[F(X
i
|)].
Then U
1
, U
2
, . . . , U
n
are iid each having a
2
2
density. Hence Q
1
(X, ) =
n
i=1
U
i
has
a
2
2n
density and so is a pivotal quantity for . Another pivotal quantity ( also having
a
2
2n
density ) is given by Q
2
(X, ) =
n
i=1
V
i
where V
i
= 2 log[1 F(X
i
|)].
Example 12. Suppose that we have data X
1
, X
2
, . . . , X
n
which are iid with density
f(x|) = exp(x)
for x 0 and suppose that we want to construct a 95% condence interval for . We
need to nd a pivotal quantity for . Observe that
F(a|) =
_
a
f(x|)dx
=
_
a
0
exp(x)dx
= 1 exp(a).
Hence
Q
1
(X, ) = 2
n
i=1
log [1 exp(X
i
)]
is a pivotal quantity for having a
2
2n
density. Also
Q
2
(X, ) = 2
n
i=1
log [exp(X
i
)] = 2
n
i=1
X
i
74
is another pivotal quantity for having a
2
2n
density.
Using the tables, nd A < B such that P[
2
2n
< A] = P[
2
2n
> B] = 0.025.
Then
0.95 = P[A Q
2
(X, ) B]
= P[A 2
n
i=1
X
i
B]
= P[
A
2
n
i=1
X
i

B
2
n
i=1
X
i
]
and so the interval
[
A
2
n
i=1
X
i
,
B
2
n
i=1
X
i
]
is a 95% condence interval for .
5.2 Pivotal Quantities for Use with Normal Data
Many exact pivotal quantities have been developed for use with Gaussian data.
Exercise 25. Suppose that we have data X
1
, X
2
, . . . , X
n
which are iid observations
from a N(,
2
) density where is known. Dene
Q =
n(

X )
.
Show that the dened random variable is pivotal for . Construct condence intervals
for based on this pivotal quantity.
1
, X
2
, . . . , X
n
from a N(,
2
) density where is known. Dene
Q =
n
i=1
(X
i
)
2
2
We can write Q =
n
i=1
Z
2
i
where Z
i
= (X
i
)/. If Z
i
has a N(0, 1) density then
Z
2
i
has a
2
1
density. Hence, Q has a
2
n
density and so is a pivotal quantity for . If
n = 20 then we can be 95% sure that
9.591
n
i=1
(X
i
)
2
2
34.170
75
which is equivalent to
_
1
34.170
n
i=1
(X
i
)
2

_
1
9.591
n
i=1
(X
i
)
2
.
The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777
and 34.169607 as the 2
1
2
% and 97
1
2
% quantiles from a Chi-squared distribution on 20
degrees of freedom. 2
Lemma 5 (The Student t-distribution). Suppose the random variables X and Y are
independent, and X N(0, 1) and Y
2
n
. Then the ratio
T =
X
_
Y/n
has pdf
f
T
(t|n) =
1
n
([n + 1]/2)
(n/2)
_
1 +
t
2
n
_
(n+1)/2
,
and is known as Students t-distribution on n degrees of freedom.
Proof. The random variables X and Y are independent and have joint density
f
X,Y
(x, y) =
1
2
2
n/2
(n/2)
e
x
2
/2
y
n/21
e
y/21
e
y/2
for y > 0.
The Jacobian (t, u)/(x, y) of the change of variables
t =
x
_
y/n
and u = y
equals
(t, u)
(x, y)

t
x
t
y
u
x
u
y
_
n/y
1
2
x
n
(y)
3/2
0 1
= (n/y)
1/2
and the inverse Jacobian
(x, y)/(t, u) = (u/n)
1/2
.
76
Then
f
T
(t) =
_

0
f
X,Y
_
t(u/n)
1/2
, u
__
u
n
_
1/2
du
=
1
2
2
n/2
(n/2)
_

0
e
t
2
u/2n
u
n/21
e
u/2
_
u
n
_
1/2
du
=
1
2
2
n/2
(n/2)n
1/2
_

0
e
(1+t
2
/n)u/2
u
(n+1)/21
du .
The last integrand comes from the pdf of a Gam((n + 1)/2, 1/2 + t
2
/(2n)) random
variable. Hence
f
T
(t) =
1
n
([n + 1]/2)
(n/2)
_
1
1 +t
2
/n
_
(n+1)/2
,
which gives the above formula.
1
, X
2
, . . . , X
n
from a N(,
2
) density where both and are unknown. Dene
Q =
n(

X )
s
where
s
2
=
n
i=1
(X
i

X)
2
n 1
.
We can write
Q =
Z
_
W/(n 1)
where
Z =
n(

X )
has a N(0, 1) density and

W =
n
i=1
(X
i

X)
2
2
has a
2
n1
density ( see lemma 1 ). It follows immediately that W is a pivotal quantity
for . If n = 31 then we can be 95% sure that
16.79077
n
i=1
(X
i

X)
2
2
46.97924
77
_
1
46.97924
n
i=1
(X
i

X)
2

_
1
16.79077
n
i=1
(X
i

X)
2
. (5.1)
The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077
and 46.97924 as the 2
1
2
% and 97
1
2
% quantiles from a Chi-squared distribution on 30
degrees of freedom. In lemma 5 we show that Q has a t
n1
density, and so is a pivotal
quantity for . If n = 31 then we can be 95% sure that
2.042
n(

X )
s
+2.042
X 2.042
s
n

X + 2.042
s
n
. (5.2)
The R command qt(p=.975,df=30) returns the value 2.042272 as the 97
1
2
%quan-
tile from a Student t-distribution on 30 degrees of freedom. ( It is important to point
out that although a probability statement involving 95% condence has been attached
the two intervals (5.2) and (5.1) separately, this does not imply that both intervals si-
multaneously hold with 95% condence. ) 2
1
, X
2
, . . . , X
n
from a N(
1
,
2
) density and data Y
1
, Y
2
, . . . , Y
m
which are iid observations from a
N(
2
,
2
) density where
1
,
2
, and are unknown. Let =
1
2
and dene
Q =
(

X

Y )
_
s
2
(
1
n
+
1
m
)
where
s
2
=
n
i=1
(X
i

X)
2
+
m
j=1
(Y
j

Y )
2
n +m2
.
We know that

X has a N(
1
,

2
n
) density and that

Y has a N(
2
,

2
m
) density. Then
the difference

X

Y has a N(,
2
[
1
n
+
1
m
]) density. Hence
Z =

X

Y
_
2
[
1
n
+
1
m
]
78
has a N(0, 1) density. Let W
1
=

n
i=1
(X
i

X)
2
/
2
and let W
2
=

m
j=1
(Y
j

Y )
2
/
2
. Then, W
1
has a
2
n1
density and W
2
has a
2
m1
density, and W = W
1
+W
2
has a
2
n+m2
density. We can write
Q
1
= Z/
_
W/(n +m2)
and so, Q
1
has a t
n+m2
density and so is a pivotal quantity for . Dene
Q
2
=
n
i=1
(X
i

X)
2
+
m
j=1
(Y
j

Y )
2
2
.
Then Q
2
has a
2
n+m2
density and so is a pivotal quantity for . 2
Lemma 6 (The Fisher F-distribution). Let X
1
, X
2
, . . . , X
n
and Y
1
, Y
2
, . . . , Y
m
be iid
N(0, 1) random variables. The ratio
Z =
n
i=1
X
2
i
/n
m
i=1
Y
2
i
/m
has the distribution called Fisher, or F distribution with parameters (degrees of free-
dom) n, m, or the F
n,m
distribution for short. The corresponding pdf f
Fn,m
is concen-
trated on the positive half axis
f
Fn,m
(z) =
((n +m)/2)
(n/2)(m/2)
_
n
m
_
n/2
z
n/21
_
1 +
n
m
z
_
(n+m)/2
for z > 0.
Observe that if T t
m
, then T
2
= Z F
1,m
, and if Z F
n,m
, then Z
1
F
m,n
.
If W
1

2
n
and W
2

2
m
, then Z = (mW
1
)/(nW
2
) F
n,m
. 2
1
, X
2
, . . . , X
n
from a N(
X
,
2
X
) density and data Y
1
, Y
2
, . . . , Y
m
N(
Y
,
2
Y
) density where
X
,
Y
,
X
, and
Y
are all unknown. Let
=
X
/
Y
and dene
F
=
s
2
X
s
2
Y
=
n
i=1
(X
i

X)
2
(n 1)
(m1)
m
j=1
(Y
j

Y )
2
.
79
Let
W
X
=
n
i=1
(X
i

X)
2
/
2
X
and let
W
Y
=
m
j=1
(Y
j

Y )
2
/
2
Y
.
Then, W
X
has a
2
n1
density and W
Y
has a
2
m1
density. Hence, by lemma 6,
Q =
W
X
/(n 1)
W
Y
/(m1)

F
2
has an F density with n1 and m1 degrees of freedomand so is a pivotal quantity for
. Suppose that n = 25 and m = 13. Then we can be 95% sure that 0.39 Q 3.02
_
F
3.02

_
F
0.39
.
To see how this might work in practice try the following R commands one at a time
x = rnorm(25, mean = 0, sd = 2)
y = rnorm(13, mean = 1, sd = 1)
Fstar = var(x)/var(y); Fstar
CutOffs = qf(p=c(.025,.975), df1=24, df2=12)
CutOffs; rev(CutOffs)
Fstar / rev(CutOffs)
var.test(x, y)
2
The search for a nice pivotal quantity for =
1

2
continues and is one of the
unsolved problems in Statistics - referred to as the Behrens-Fisher Problem.
5.3 Approximate Condence Intervals
Let X
1
, X
2
, . . . , X
n
be iid with density f(x|). Let

be the MLE of . We saw before
that the quantities W
1
=
_
I()(
), W
2
=
_
I()(
), W
3
=
_
I(
)(

), and W
4
=
_
I(
)(
) all had densities which were approximately N(0, 1).

80
Hence they are all approximate pivotal quantities for . W
3
and W
4
are the simplest
to use in general. For W
3
the approximate 95% condence interval is given by [

1.96/
_
I(
),

+ 1.96/
_
I(
)]. For W
4
the approximate 95% condence interval is
given by [
1.96/
_
I(
),

+ 1.96/
_
I(
)]. The quantity 1/

_
I(
) ( or 1/
_
I(
))
is often referred to as the approximate standard error of the MLE

.
Let X
1
, X
2
, . . . , X
n
be iid with density f(x|) where = (
1
,
2
, . . . ,
m
) con-
sists of m unknown parameters. Let = (
1
,

2
, . . . ,

m
) be the MLE of . We saw
before that for r = 1, 2, . . . , m the quantities W
1r
= (
r

r
)/
CRLB
r
where
CRLB
r
is the lower bound for Var(
r
) given in the generalisation of the Cramer-Rao
theorem had a density which was approximately N(0, 1). Recall that CRLB
r
is the
rth diagonal element of the matrix [I()]
1
. In certain cases CRLB
r
may depend on
the values of unknown parameters other than
r
and in those cases W
1r
will not be an
approximate pivotal quantity for
r
.
We also saw that if we dene W
2r
by replacing CRLB
r
by the rth diagonal ele-
ment of the matrix [I()]
1
, W
3r
by replacing CRLB
r
by the rth diagonal element
of the matrix [I(
)]
1
and W
4r
by replacing CRLB
r
by the rth diagonal element of
the matrix [I(
)]
1
we get three more quantities all of whom have a density which is
approximately N(0, 1). W
3r
and W
4r
only depend on the unknown parameter
r
and
so are approximate pivotal quantities for
r
. However in certain cases the rth diago-
nal element of the matrix [I()]
1
may depend on the values of unknown parameters
other than
r
and in those cases W
2r
will not be an approximate pivotal quantity for
r
. Generally W
3r
and W
4r
are most commonly used.
We now examine the use of approximate pivotal quantities based on the MLE in a
series of examples
Example 17 (Poisson sampling continued). Recall that

= x and I() =
n
i=1
x
i
/
2
=
n
/
2
with E[I()] = n/. Hence I(
) = I(
) = n/
and the usual approximate

95% condence interval is given by
[

1.96
n
,

+ 1.96
n
].
2
81
Example 18 (Bernoulli trials continued). Recall that

= x and
I() =
n
i=1
x
i
2
+
n
n
i=1
x
i
(1 )
2
with
I() = EI() =
n
(1 )
.
Hence
I(
) = I(
) =
n
(1

)
and the usual approximate 95% condence interval is given by
[

1.96
(1

)
n
,

+ 1.96
(1

)
n
].
2
Example 19. Let X
1
, X
2
, . . . , X
n
be iid observations from the density
f(x|, ) = x
1
exp[x
]
for x 0 where both and are unknown. In can be veried by straightforward
calculations that the information matrix I(, ) is given by
_
_
n/
2

n
i=1
x
i
log[x
i
]
n
i=1
x
i
log[x
i
] n/
2
+
n
i=1
x
i
log[x
i
]
2
_
_
Let V
11
and V
22
be the diagonal elements of the matrix [I( ,

)]
1
. Then the approxi-
mate 95% condence interval for is
[ 1.96
_
V
11
, + 1.96
_
V
11
]
and the approximate 95% condence interval for is
[
1.96
_
V
22
,

+ 1.96
_
V
22
].
Finding and

is an interesting exercise that you can try to do on your own. 2
82
Exercise 26. Components are produced in an industrial process and the number of
aws indifferent components are independent and identically distributed with proba-
bility mass function p(x) = (1 )
x
, x = 0, 1, 2, . . ., where 0 < < 1. A random
sample of n components is inspected; n
0
components are found to have no aws, n
1
components are found to have two or more aws.
1. Show that the likelihood function is l() =
n0+n1
(1 )
2n2n0n1
.
2. Find the MLE of and the sample information in terms of n, n
0
and n
1
.
3. Hence calculate an approximate 90% condence interval for where 90 out of
100 components have no aws, and seven have exactly one aw.
Exercise 27. Suppose that X
1
, X
2
, . . . , X
n
is a random sample from the shifted expo-
nential distribution with probability density function
f(x|, ) =
1
e
(x)/
, < x < ,
where > 0 and , < . Both and are unknown, and n > 1.
1. The sample range W is dened as W = X
(n)
X
(1)
, where X
(n)
= max
i
X
i
and X
(1)
= min
i
X
i
. It can be shown that the joint probability density function
of X
(1)
and W is given by
f
X
(1)
,W
(x
(1)
, w) = n(n 1)
2
e
n(x
(1)
)/
e
w/
(1 e
w/
)
n2
,
for x
(1)
> and w > 0. Hence obtain the marginal density function of W and
show that W has distribution function P(W w) = (1 e
w/
)
n1
, w > 0.
2. Show that W/ is a pivotal quantity for . Without carrying out any calculations,
explain how this result may be used to construct a 100(1 )% condence
interval for for 0, < 1.
Exercise 28. Let X have the logistic distribution with probability density function
f(x) =
e
x
(1 +e
x
)
2
, < x < ,
where < < is an unknown parameter.
83
1. Show that X is a pivotal quantity and hence, given a single observation X,
construct an exact 100(1 )% condence interval for . Evaluate the interval
when = 0.05 and X = 10.
2. Given a random sample X
1
, X
2
, . . . , X
n
from the above distribution, briey ex-
plain how you would use the central limit theorem to construct an approximate
95% condence interval for . Hint E(X) = and Var(X) =
2
/3.
Exercise 29. Let X
1
, . . . , X
n
be iid with density f
X
(x|) = exp(x) for x 0.
1. Show that
_
x
0
f(u|)du = 1 exp(x).
2. Use the result in (a) to establish that Q = 2
n
i=1
X
i
is a pivotal quantity for
and explain how to use Q to nd a 95% condence interval for .
3. Derive the information I(). Suggest an approximate pivotal quantity for
involving I() and another approximate pivotal quantity involving I(
) where
= 1/ x is the maximum likelihood estimate of . Show how both approximate

pivotal quantities may be used to nd approximate 95% condence intervals
for . Prove that the approximate condence interval calculated using the ap-
proximate pivotal quantity involving I(
) is always shorter than the approximate

condence interval calculated using the approximate pivotal quantity involving
I() but that the ratio of the lengths converges to 1 as n .
4. Suppose n = 25 and

20
i=1
x
i
= 250. Use the method explained in (b) to
calculate a 95% condence interval for and the two methods explained in (c)
to calculate approximate 95% condence intervals for . Compare the three
intervals obtained.
Exercise 30. Let X
1
, X
2
, . . . , X
n
be iid with density
f(x|) =

(x + 1)
+1
for x 0.
1. Derive an exact pivotal quantity for and explain how it may be used to nd a
95% condence interval for .
84
2. Derive the information I(). Suggest an approximate pivotal quantity for in-
volving I() and another approximate pivotal quantity involving I(
) where

is the maximum likelihood estimate of . Show how both approximate pivotal

quantities may be used to nd approximate 95% condence intervals for .
3. Suppose n = 25 and
25
i=1
log [x
i
+ 1] = 250. Use the method explained in (a)
to calculate a 95% condence interval for and the two methods explained in
(b) to calculate approximate 95% condence intervals for . Compare the three
intervals obtained.
Exercise 31. Let X
1
, X
2
, . . . , X
n
be iid with density
f(x|) =
2
xexp(x)
for x 0.
1. Show that
_
x
0
f(u|)du = 1 exp(x)[1 +x].
2. Describe how the result from (a) can be used to construct an exact pivotal quan-
tity for .
3. Construct FOUR approximate pivotal quantities for based on the MLE

.
4. Suppose that n = 10 and the data values are 1.6, 2.5, 2.7, 3.5, 4.6, 5.2, 5.6, 6.4,
7.7, 9.2. Evaluate the 95% condence interval corresponding to ONE of the
exact pivotal quantities ( you may need to use a computer to do this ). Compare
your answer to the 95% condence intervals corresponding to each of the FOUR
approximate pivotal quantities derived in (c).
Exercise 32. Let X
1
, X
2
, . . . , X
n
be iid each having a Poisson density
f(x|) =

x
exp()
x!
for x = 0, 1, 2, . . . ., . Construct FOUR approximate pivotal quantities for based on
the MLE

. Show how each may be used to construct an approximate 95% condence
interval for . Evaluate the four condence intervals in the case where the data consist
of n = 64 observations with an average value of x = 4.5.
85
Exercise 33. Let X
1
, X
2
, . . . , X
n
be iid with density
f
1
(x|) =
1
exp[x/]
for 0 x < . Let Y
1
, Y
2
, . . . , Y
m
be iid with density
f
2
(y|, ) =

exp[y/]
for 0 y < .
1. Derive approximate pivotal quantities for each of the parameters and .
2. Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose that
m = 40 and the average of the 40 y values is 12.0. Calculate approximate 95%
condence intervals for both and .
86
Chapter 6
The Theory of Hypothesis
Testing
6.1 Introduction
Suppose that we are going to observe the value of a random vector X. Let X denote
the set of possible values that Xcan take and, for x X, let f(x, ) denote the density
(or probability mass function) of X where the parameter is some unknown element
of the set .
A hypothesis species that belongs to some subset
0
of . The question arises
as to whether the observed data x is consistent with the hypothesis that
0
, often
written as H
0
:
0
. The hypothesis H
0
is usually referred to as the null hypothesis.
The null hypothesis is contrasted with the so-called alternative hypothesis H
1
:
1
, where
0

1
= .
The testing hypothesis is aiming at nding in the data x enough evidence to reject
the null hypothesis:
H
0
:
0
in favor of the alternative hypothesis
H
1
:
1
.
87
Due to the focus on rejecting H
0
and control of the error rate for such a decision,
the set up in the role of the hypotheses is not exchangeable.
In a hypothesis testing situation, two types of error are possible.
The rst type of error is to reject the null hypothesis H
0
:
0
as being
inconsistent with the observed data x when, in fact,
0
i.e. when, in fact,
the null hypothesis happens to be true. This is referred to as Type I Error.
The second type of error is to fail to reject the null hypothesis H
0
:
0
as
being inconsistent with the observed data x when, in fact,
1
i.e. when, in
fact, the null hypothesis happens to be false. This is referred to as Type II Error.
The goal is to propose a procedure that for given data X = x would automatically
point which of the hypothesis is more favorable and in such a way that chances of
making Type I Error are some prescribed small (0, 1), that is refered to as the
signicance level of a test. More precisely for given data x we evaluate a certain
numerical characteristics T(x) that is called a test statistic and if it falls in a certain
critical region R
(often also called rejection region), we reject H

0
in the favor of H
1
.
We demand that T(x) and R
are chosen in such a way that for

0
P(T(X) R
|) ,
i.e. Type I Error is at most .
Therefore the test procedure can be identied with a test statistic T(x) and a rejec-
tion region R
. It is quite natural to expected that R
is decreasing with (it should

be harder to reject H
0
if error 1 is smaller). Thus for a given sample x, there should be
an such that for > we have T(x) R
and for < the test statistics T(x) is

outside R
. The value is called the p-value for a given test.

While the focus in setting a testing hypothesis problem is on Type I Error so it
is controlled by the signicance level, it is also important to have chances of Type II
Error as small as possible. For a given testing procedure smaller chances of Type I
Error are at the cost of bigger chances of Type II Error. However, the chances of Type
II Error can serve for comparison of testing procedures for which the signicance level
88
is the same. For this reason, the concept of the power of a test has been introduced. In
general, the power of a test is a function p() of
1
and equals the probability of
rejecting H
0
while the true parameter is , i.e. under the alternative hypothesis. Among
two tests in the same problem and at the same signicance level, the one with larger
power for all
1
is considered better. The power of a given procedure is increasing
with the sample size of data, therefore it is often used to determine a sample size so
that not rejecting H
0
will represent a strong support for H
0
not only a lack of evidence
for the alternative.
Example 20. Suppose the data consist of a random sample X
1
, X
2
, . . . , X
n
from a
N(, 1) density. Let = (, ) and
0
= (, 0] and consider testing H
0
:
0
, i.e. H
0
: 0.
The standard estimate of for this example is

X. It would seem rational to consider
that the bigger the value of

X that we observe the stronger is the evidence against the
null hypothesis that 0, in favor of the alternative > 0. Thus we decide to use
T(X) =

X as our test statistics. How big does

X have to be in order for us to reject
H
0
? In other words we want to determine the rejection region R
. It is quite natural to
consider R
= [a
, ), so we reject H
0
if

X is too large, i.e.

X a
. To determine
a
we recall that controlling Type I Error means that

P(

X a
|) ,
where 0. For such , we clearly have
P(

X a
|) P(

X a
| = 0)
= 1 (a
n),
from which we get that a
= z
1
/
n assures that Type I Error is controlled at

level.
Suppose that n = 25 and we observe x = 0.32. Finding the p-value is then equiva-
lent to determining the chances of getting such a large value for x by a random variable
that has the distribution of

X, i.e. N(0,
1
n
). In our particular case, it is a N(, 0.04), the
probability of getting a value for

X as large as 0.32 is the area under a N(0, 0.04) curve
89
between 0.32 and which is the area under a N(0, 1) curve between
0.32
0.20
= 1.60 and
or 0.0548. This quantity is called the p-value. The p-value is used to measure
the strength of the evidence against H
0
: 0 and H
0
is rejected if the p-value is
less than some small number such as 0.05. You might like to try the R commands
1-pnorm(q=0.32,mean=0,sd=sqrt(.04)) and 1-pnorm(1.6).
Consider the test statistic T(X) =

n

X and suppose we observe T(x) = t. A
rejection region that results in the signicance level can be dened as
R
= [z
1
, ).
In order to calculate the p-value we need to nd such that t = z
1
which is equiva-
lent to = P(T > t). 2
Exercise 34. Since the images on the two sides of coins are made of raised metal, the
toss may slightly favor one face or the other, if the coin is allowed to roll on one edge
upon landing. For the same reason coin spinning is much more likely to be biased than
ipping. Conjurers trim the edges of coins so that when spun they usually land on a
particular face.
To investigate this issue a strict method of coin spinning has been designed and
the results of it recorded for various coins. We assume that the number of considered
tosses if fairly large (bigger than 100). Formulate a testing hypothesis problem for this
situation and in the process answer the following questions.
1. Formulate the null and alternative hypotheses.
2. Propose a test statistic used to decide for one of the hypotheses.
3. Derive a rejection region that guarantees the chances of Type I Error to be at
most .
4. Explain how for an observed proportions p of Heads one could obtain p-value
for the proposed test.
5. Derive a formula for the power of the test.
6. Study how the power depends on the sample size. In particular, it is believed
that a certain coin is tinted toward Heads and that the true chances of landing
90
Heads are at least 0.51. Design an experiment in which the chances of making
a correct decision using your procedure are 95%.
7. If one hundred thousands spins of a coin will be made. What are chances that
the procedure will lead to the correct decision?
8. Suppose that one hundred thousands spins of a coin have been made and the coin
landed Heads 50877 times. Find the p-value and report a conclusion.
Example 21 (The power function). Suppose our rule is to reject H
0
: 0 in favor
H
1
: > 0 if the p-value is less than 0.05. In order for the p-value to be less than 0.05
we require
nt > 1.65 and so we reject H

0
if x > 1.65/
n. What are the chances of

rejecting H
0
if = 0.2 ? If = 0.2 then

X has the N[0.2, 1/n] distribution and so the
probability of rejecting H
0
is
P
_
N
_
0.2,
1
n
_
1.65
n
_
= P
_
N(0, 1) 1.65 0.2
n
_
.
For n = 25 this is given by P{N(0, 1) 0.65} = 0.2578. This calculation can be
veried using the R command 1-pnorm(1.65-0.2
*
sqrt(25)). The following
table gives the results of this calculation for n = 25 and various values of
1
.
: 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Prob: 0.50 .125 .258 .440 .637 .802 .912 .968 .991 .998 .999
This is called the power function of the test. The R command
Ns=seq(from=(-1),to=1, by=0.1)
generates and stores the sequence 1.0, 0.9, . . . , +1.0 and the probabilities in the
table were calculated using 1-pnorm(1.65-Ns
*
sqrt(25)). The graph of the
power function is presented in Figure 6.1. 2
Example 22 (Sample size). How large would n have to be so that the probability of
rejecting H
0
when = 0.2 is 0.90 ? We would require 1.65 0.2
n = 1.28 which
implies that
n = (1.65 + 1.28)/0.2 or n = 215. 2

91
-1.0 -0.5 0.0 0.5 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Ns
y
Figure 6.1: The power function of the test that a normal sample of size 25 has the mean
value bigger than zero.
So the general plan for testing a hypothesis is clear: choose a test statistic T, ob-
serve the data, calculate the observed value t of the test statistic T, calculate the p-value
as the maximum over all values of in
0
of the probability of getting a value for T
as large as t, and reject H
0
:
0
if the p-value so obtained is too small.
6.2 Hypothesis Testing for Normal Data
Many standard test statistics have been developed for use with normally distributed
data.
1
, X
2
, . . . , X
n
from a N(,
2
) density where both and are unknown. Here = (, ) and
= {(, ) : < < , 0 < < }. Dene
X =
n
i=1
X
i
n
and s
2
=
N
i=1
(X
i

X)
2
n 1
.
92
(a) Suppose that for a xed value
0
we consider
0
= {(, ) : <
0
, 0 < < }, which can be simply reported as H
0
:
0
. Dene
T =

n(

X
0
)/s. Let t denote the observed value of T. Then the rejection
region at the level is dened as
R
= [t
1,n1
, ),
where t
p,k
is, as usual, the p-quantile of Student t-distribution with k degrees of
freedom. It is clear that the p-value is that is determined from the equality
t
1 ,n1
= t,
which equivlant to = P(T > t) = 1 F(t), where F is the cdf of Student
t-distribution with n 1 degrees of freedom.
(b) Suppose H
0
:
0
. Let T be as before and t denote the observed value of T.
By the analogy with the previous case
R
= (, t
,n1
],
and the p-value is given by = P(T < t) = F(t).
(c) Suppose H
0
: =
0
. Dene T as before and t denote the observed value of T.
Then Then a rejection region at the level can be dened as
R
= (, t
/2,n1
] [t
1/2,n1
, ),
. It is clear that the p-value is can be obtained as from the equation |t| =
t
1 /2,n1
or equivalently
= 2P(T > |t|) = 2(1 F(|t|)).
(d) Suppose H
0
:
0
. Dene T =

n
i=1
(X
i

X)
2
/
2
0
. Let t denote the
observed value of T. Then the rejection region can be set as
R
= [
2
1,n1
, ),
93
where
2
p,k
is as usually the p-quantile of the chi-squared distribution with k-
degrees of freedom. Let us verify that the test statistic with the rejection region
gives indeed the signicance at the level . We have
P(T
2
1,n1
|
0
) = P(
n
i=1
(X
i

X)
2
/
2

2
1,n1
2
0
/
2
|
0
)
P(
n
i=1
(X
i

X)
2
/
2

2
1,n1
) = .
The p-value is obtained fromt =
2
1 ,n1
which is equivalent to = P(T >
t) = 1F(t) where F is the cdf of the chi-squared distribution with n1 degrees
of freedom.
(e) The case H
0
:
2

2
0
can be treated analogously, so that
R
= [0,
2
,n1
],
and the p-value is obtained as = P(T < t) = F(t).
(f) Finally for H
0
: =
0
and T dened as before we consider a rejection region
R
= [0,
2
/2,n1
] [
2
1/2,n1
, ),
It is easy to see that R
and T gives the signicance level .

More over the p-value is determined by
= 2 (P(T < t) P(T > t)) = 2 (F(t) (1 F(t))) ,
where stands for the minimum operator and F is the cdf of the chi-squared
distribution with n 1 degrees of freedom. 2
Exercise 35. The following data has been obtained as result of measuring tempera-
ture on the rst of December, noon, for ten consequitive years in a certain location in
Ireland:
3.7, 6.6, 8.0, 2.5, 4.5, 4.5, 3.5, 7.7, 4.0, 5.7.
Using this data set perform tests for the following problems
94
H
0
: = 6 vs. H
1
: = 6,
H
0
: 1 vs. H
1
: 1.
Report p values and write conclusions.
In the following we examine two samples problems with the normal distribution
assumption.
1
, X
2
, . . . , X
n
from a N(
1
,
2
) density and data y
1
, y
2
, . . . , y
m
N(
2
,
2
) density where
1
,
2
, and are unknown.
Here
= (
1
,
2
, )
and
= {(
1
,
2
, ) : <
1
< , <
2
< , 0 < < }.
Recall the pooled estimator of the common variance
s
2
=
n
i=1
(x
i
x)
2
+
m
j=1
(y
j
y)
2
n +m2
.
(a) Suppose
0
= {(
1
,
2
, ) : <
1
< ,
1

2
< , 0 < < },
which can be simply expressed by H
0
:
1

2
vs. H
1
:
1
>
2
. Dene
T = ( x y)/
_
s
2
(1/n + 1/m). Let t denote the observed value of T. Then the
following rejection region
R
= [t
1,n+m2
, )
for T denes a test at signicance level . It is clear by the same arguments as
before that = P(T > t) = 1 F(t) is the p value for the discussed procedure.
Here F is the cdf of the Student t-distribution with n+m2 degrees of freedom.
(b) The symmetric case to the previous one of H
0
:
1

2
can be treated by taking
R
= (, t
,n+m2
].
and the p-value = P(T < t).
95
(c) Two-sided testing for H
0
:
1
=
2
vs. H
1
:
1
=
2
is addressed by
R
= (, t
/2,n+m2
] [t
1/2,n+m2
, ).
with the p-value given by = P(|T| > |t|) = 2P(T > |t|) = 2(1 F(t)).
(d) Suppose that we have data X
1
, X
2
, . . . , X
n
which are iid observations from
a N(
1
,
2
1
) density and data y
1
, y
2
, . . . , y
m
which are iid observations from
a N(
2
,
2
1
) density where
1
,
2
,
1
, and
2
are all unknown. Here =
(
1
,
2
,
1
,
2
) and = {(
1
,
2
,
1
,
2
) : <
1
< , <
2
<
, 0 <
1
< , 0 <
2
< }. Dene
s
2
1
=
n
i=1
(x
i
x)
2
n 1
, and s
2
2
=

m
j=1
(y
j
y)
2
m1
.
Suppose
0
= {(
1
,
2
, , ) : <
1
< ,
1
<
2
< , 0 < < }
or simply H
0
:
1
=
2
vs. H
1
:
1
=
2
.
Dene
T =
(n 1)s
2
1
(m1)s
2
2
.
Let t denote the observed value of T. Dene a rejection region by
R
= [0, F
n1,m1
(/2)] [F
n1,m1
(1 /2), ),
where F
k,l
(p) is the p-quantile of the Fischer distribution with k and l degrees of
freedom. We note that F
n1,m1
(/2) = 1/F
m1,n1
(1 /2). The p-value
can be obtained by taking = 2 (P(T < t) P(T > t)) = 2
_
F(t)

F(1/t)
_
,
where F is the cdfs of the Fischer distributions with n 1 and m1 and

F is
the one with m1 and n 1 degrees of freedom. 2
Exercise 36. The following table gives the concentration of norepinephrine (mol per
gram creatinine) in the urine of healthy volunteers in their early twenties.
Male 0.48 0.36 0.20 0.55 0.45 0.46 0.47 0.23
Female 0.35 0.37 0.27 0.29
96
The problemis to determine if there is evidence that concentration of norepinephrine
differs between genders.
1. Testing for the difference between means in two normal sample problem is the
main testing procedure. However it requires verication if the variances in the
samples are the same. Carry out a test that checks if there is a signicant differ-
ence between variances. Evaluate the p-value and make a conclusion.
2. If the above procedure did not reject the equality of variance assumption, carry
out a procedure that examines the equality of concentrations between gender.
Report the p-value and write down conclusion.
6.3 Generally Applicable Test Procedures
Suppose that we observe the value of a random vector X whose probability density
function is g(X|) for x X where the parameter = (
1
,
2
, . . . ,
p
) is some
unknown element of the set R
p
. Let
0
be a specied subset of . Consider the
hypothesis H
0
:
0
vs. H
1
:
1
. In this section we consider three ways in
which good test statistics may be found for this general problem.
The Likelihood Ratio Test: This test statistic is based on the idea that the max-
imum of the log likelihood over the subset
0
should not be too much less than
the maximum over the whole set if, in fact, the parameter actually does lie in
the subset
0
. Let log l() denote the log likelihood function. The test statistic
is
T
1
(x) = 2 log
l(
)
l(
0
)
= 2[log l(
) log l(
0
)]
where

is the value of in the set for which log l() is a maximum and

0
is the value of in the set
0
for which log l() is a maximum.
The Maximum Likelihood Test Statistic: This test statistic is based on the idea
that

and

0
should be close to one another. Let I() be the p p information
97
matrix. Let B = I(
). The test statistic is

T
2
(x) = (
0
)
T
B(
0
)
Other forms of this test statistic follow by choosing B to be I(
0
) or I(
) or
I(
0
).
The Score Test Statistic: This test statistic is based on the idea that

0
should
almost solve the likelihood equations. Let S() be the p 1 vector whose rth
element is given by log l/
r
. Let Cbe the inverse of I(
0
) i.e. C = I(
0
)
1
.
The test statistic is
T
3
(x) = S(
0
)
T
CS(
0
)
In order to calculate p-values we need to knowthe probability distribution of the test
statistic under the null hypothesis. Deriving the exact probability distribution may be
difcult but approximations suitable for situations in which the sample size is large are
available in the special case where is a p dimensional set and
0
is a q dimensional
subset of for q < p, whence it can be shown that, when H
0
is true, the probability
distributions of T
1
(x), T
2
(x) and T
3
(x) are all approximated by a
2
pq
density.
Example 25. Let X
1
, X
2
, . . . , X
n
be iid each having a Poisson distribution with pa-
rameter . Consider testing H
0
: =
0
where
0
is some specied constant. Recall
that
log l() = [
n
i=1
x
i
] log [] n log [
n
i=1
x
i
!].
Here = [0, ) and the value of for which log l() is a maximum is

= x.
Also
0
= {
0
} and so trivially

0
=
0
. We saw also that
S() =
n
i=1
x
i
n
and that
I() =
n
i=1
x
i
2
.
Suppose that
0
= 2, n = 40 and that when we observe the data we get x = 2.50.
98
Hence
n
i=1
x
i
= 100. Then
T
1
= 2[log l(2.5) log l(2.0)]
= 200 log (2.5) 200 200 log (2.0) + 160 = 4.62.
The information is B = I(
) = 100/2.5
2
= 16. Hence
T
2
= (
0
)
2
B = 0.25 16 = 4.
We have S(
0
) = S(2.0) = 10 and I(
0
) = 25 and so
T
3
= 10
2
/25 = 4.
Here p = 1, q = 0 implying p q = 1. Since P[
2
1
3.84] = 0.05 all three test
statistics produce a p-value less than 0.05 and lead to the rejection of H
0
: = 2. 2
Example 26. Let X
1
, X
2
, . . . , X
n
be iid with density f(x|, ) = x
1
exp(x
)
for x 0. Consider testing H
0
: = 1. Here = {(, ) : 0 < < , 0 <
< } and
0
= {(, 1) : 0 < < } is a one-dimensional subset of the two-
dimensional set . Recall that log l(, ) = nlog[]+nlog[]+(1)
n
i=1
log[x
i
]
n
i=1
x
i
. Hence the vector S(, ) is given by
_
_
n/
n
i=1
x
i
n/ +
n
i=1
log[x
i
]
n
i=1
x
i
log[x
i
]
_
_
and the matrix I(, ) is given by
_
_
n/
2

n
i=1
x
i
log[x
i
]
n
i=1
x
i
log[x
i
] n/
2
+
n
i=1
x
i
log[x
i
]
2
_
_
We have that

= ( ,

) which require numerical method for their calculation which
is discussed in the sample of exam problems. Also

0
= (
0
, 1) where
0
= 1/ x.
Suppose that the observed value of T
1
(x) is 3.20. Then the p-value is P[T
1
(x)
3.20] P[
2
1
3.20] = 0.0736. In order to get the maximum likelihood test statistic
plug in the values ,

for , in the formula for I(, ) to get the matrix B. Then
calculate T
2
(X) = (
0
)
T
B(
0
) and use the
2
1
tables to calculate the p-value.
99
Finally, to calculate the score test statistic note that the vector S(
0
) is given by
_
_
0
n +
n
i=1
log[x
i
]
n
i=1
x
i
log[x
i
]/ x
_
_
and the matrix I(
0
) is given by
_
_
n x
2

n
i=1
x
i
log[x
i
]
n
i=1
x
i
log[x
i
] n +
n
i=1
x
i
log[x
i
]
2
/ x
_
_
Since T
2
(x) = S(
0
)
T
CS(
0
) where C = I(
0
)
1
we have that T
2
(x) is
[n +
n
i=1
log[x
i
]
n
i=1
x
i
log[x
i
]/ x]
2
multiplied by the lower diagonal element of C which is given by
n x
2
[n x
2
][n +
n
i=1
x
i
log[x
i
]
2
/ x] [
n
i=1
x
i
log[x
i
]]
2
Hence we get that
T
2
(x) =
[n +
n
i=1
log[x
i
]
n
i=1
x
i
log[x
i
]/ x]
2
n x
2
[n x
2
][n +
n
i=1
x
i
log[x
i
]
2
/ x] [
n
i=1
x
i
log[x
i
]]
2
No numerical techniques are need to calculate the value of T
2
(X) and for this reason
the score test is often preferred to the other two. However there is some evidence that
the likelihood ratio test is more powerful in the sense that it has a better chance of
detecting departures from the null hypothesis. 2
Exercise 37. Suppose that household incomes in a certain country have a Pareto distri-
bution with probability density function
f(x) =
v
x
+1
, v x < ,
where > 0 is unknown and v > 0 is known. Let x
1
, x
2
, . . . , x
n
denote the incomes
for a random sample of n such households. We wish to test the null hypothesis = 1
against the alternative that = 1.
1. Derive an expression for

, the MLE of .
100
2. Show that the generalised likelihood ratio test statistic, (x), satises
ln{(x)} = n nln(
)
n
.
3. Show that the test accepts the null hypothesis if
k
1
<
n
i=1
ln(x
i
) < k
2
,
and state how the values of k
1
and k
2
may be determined. Hint: Find the distri-
bution of ln(X), where X has a Pareto distribution.
Exercise 38. A Geiger counter (radioactivity meter) is calibrated using a source of
known radioactivity. The counts recorded by the counter, x
i
, over 200 one second
intervals are recorded:
8 12 6 11 3 9 9 8 5 4 6 11 6 14 3 5 15 11 7 6 9 9 14 13
6 11 . . . . . . . . . . . . . . . . . . . . . . 9 8 5 8 9 14 14
The sum of the counts
200
i=1
x
i
= 1800. The counts can be treated as observations of
iid Poisson random variables with parameter with p.m.f.
f(x
i
; ) =
xi
e
/x
i
! x
i
= 0, 1, . . . ; > 0.
If the Geiger counter is functioning correctly then = 10, and to check this we would
test H
0
: = 10 versus H
1
: = 10. Suppose that we choose to test at a signicance
level of 5%. The test can be performed using a generalized likelihood ratio test. Carry
out such a test. What does this imply about the Geiger counter? Finally, given the form
of the MLE, what was the point of recording the counts in 200 one-second intervals
rather than recording the count in one 200-second interval?
6.4 The Neyman-Pearson Lemma
Suppose we are testing a simple null hypothesis H
0
: =
against a simple alterna-

tive H
1
: =
, where is the parameter of interest, and
are particular values

of . Observed values of the i.i.d. random variables X
1
, X
2
, . . . , X
n
, each with p.d.f.
101
f
X
(x|), are available. We are going to reject H
0
if (x
1
, x
2
, . . . , x
n
) R
, where
R
is a region of the n-dimensional space called the critical or rejection region. The
critical region R
is determined so that the probability of a Type I error is :

P[ (X
1
, X
2
, . . . , X
n
) R
|H
0
] = .
Denition 11. We call a test dened through R
as the most powerful at the signi-

cance level in the testing problem H
0
: =
against the alternative H

1
: =
if
any other test of this problem has lower power.
The Neyman-Pearson lemma provides us with a way of nding most powerfull tests
in the above problem. It demonstrates that the likelihood ratio test is the most powerful
for the above problem. To avoid distracting technicalities of non-continuous case we
formulate and prove it for the continuous distribution case.
Lemma 7 (The Neyman-Pearson lemma). Let R
be a subset of the sample space

dened by
R
= {x : l(
|x)/l(
|x) k}
where k is uniquely determined from the equality
= P[X R
|H
0
].
Then R
denes the most powerful test at the signicance level for testing the simple
hypothesis H
0
: =
against the alternative simple hypothesis H

1
: =
.
Proof. For any region R of n-dimensional space, we will denote the probability that
X Rby
_
R
l(), where is the true value of the parameter. The full notation, omitted
to save space, would be
P(X R|) =
_
. . .
R
_
l(|x
1
, . . . , x
n
)dx
1
. . . dx
n
.
We need to prove that if A is another critical region of size , then the power of the
test associated with R
is at least as great as the power of the test associated with A,

or in the present notation, that
_
A
l(
)
_
R
l(
). (6.1)
102
By the denition of R
we have
_
A
R
l(
)
1
k
_
A
R
l(
). (6.2)
On the other hand
_
AR
l(
)
1
k
_
AR
l(
). (6.3)
We now establish (6.1), thereby completing the proof.
_
A
l(
) =
_
AR
l(
) +
_
AR
l(
)
=
_
R
l(
)
_
A
R
l(
) +
_
AR
l(
_
R
l(
)
1
k
_
A
R
l(
) +
1
k
_
AR
L(
) ( see (6.2), (6.3) )

=
_
R
l(
)
1
k
_
R
l(
) +
1
k
_
A
l(
)
=
_
R
l(
)

k
+

k
=
_
R
l(
)
since both R
and A have size .

Example 27. Suppose X
1
, . . . , X
n
are iid N(0, 1), and and we want to test H
0
: =
versus H
1
: =
, where
>
. According to the Z-test, we should reject H

0
if
Z =
n(

X
) is large, or equivalently if

X is large. We can now use the Neyman-
Pearson lemma to show that the Z-test is best. The likelihood function is
L() = (2)
n/2
exp{
n
i=1
(x
i
)
2
/2}.
103
According to the Neyman-Pearson lemma, a best critical region is given by the set of
(x
1
, . . . , x
n
) such that L(
)/L(
) k
1
, or equivalently, such that
1
n
ln[L(
)/L(
)] k
2
.
But
1
n
ln[L(
)/L(
)] =
1
n
n
i=1
[(x
i
)
2
/2 (x
1
)
2
/2]
=
1
2n
n
i=1
[(x
2
i
2
x
i
+
2
) (x
2
i
2
x
i
+
2
)]
=
1
2n
n
i=1
[2(
)x
i
+
2
2
]
= (
) x +
1
2
[
2
2
].
So the best test rejects H
0
when x k, where k is a constant. But this is exactly the
form of the rejection region for the Z-test. Therefore, the Z-test is best. 2
Exercise 39. A random sample of n owers is taken from a colony and the numbers
X, Y and Z of the three genotypes AA, Aa and aa are observed, where X +Y +Z =
n. Under the hypothesis of random cross-fertilisation, each ower has probabilities
2
, 2(1 ) and (1 )
2
of belonging to the respective genotypes, where 0 < < 1
is an unknown parameter.
1. Show that the MLE of is

= (2X +Y )/(2n).
2. Consider the test statistic T = 2X +Y. Given that T has a binomial distribution
with parameters 2n and , obtain a critical region of approximate size based on
T for testing the null hypothesis that =
0
against the alternative that =
1
,
where
1
<
0
and 0 < < 1.
3. Show that the above test is the most powerful of size .
4. Deduce approximately how large n must be to ensure that the power is at least
0.9 when = 0.05,
0
= 0.4 and
1
= 0.3.
104
Denition 12. For a general testing problem H
0
:
0
vs. H
1
:
1
. A
test at signicance level is called uniformly most powerful if its power is larger at
each
1
from the power of any other test in the same problem and at the same
signicance.
It is easy to note that if the test (rejection region) derived from the Neyman-Pearson
lemma does not depend on

1
then it is most powerful for the problem H
0
: =
0
vs. H
1
:
1
.
Exercise 40. Let X
1
, X
2
, . . . , X
n
be a random sample from the Weibull distribution
with probability density function f(x) = x
1
exp(x
), for x > 0 where > 0

is unknown and > 0 is known.
1. Find the form of the most powerful test of the null hypothesis that =
0
against
the alternative hypothesis that =
1
, where
0
>
1
.
2. Find the distribution function of X
and deduce that this random variable has an

exponential distribution.
3. Find the critical region of the most powerful test at the 1% level when n =
50,
0
= 0.05 and
1
= 0.025. Evaluate the power of this test.
4. Explain what is meant by the power of a test and describe how the power may
be used to determine the most appropriate size of a sample. Using this approach
to the situation described in the previous item to determine the minimal sample
size for a test that would have chances of any kind of error smaller than 1%.
Exercise 41. In a particular set of Bernoulli trials, it is widely believed that the proba-
bility of a success is =
3
4
. However, an alternative view is that =
2
3
. In order to test
H
0
: =
3
4
against H
1
: =
2
3
, n independent trials are to be observed. Let

denote
the proportion of successes in these trials.
1. Show that the likelihood ratio aapproach leads to a size test in which H
0
is
rejected in favour of H
1
when

< k for some suitable k.
105
2. By applying the central limit theorem, write down the large sample distributions
of

when H
0
is true and when H
1
is true.
3. Hence nd an expression for k in terms of n when = 0.05.
4. Find n so that this test has power 0.95.
6.5 Goodness of Fit Tests
Suppose that we have a random experiment with a random variable Y of interest. As-
sume additionally that Y is discrete with density function f on a nite set S. We repeat
the experiment n times to generate a random sample Y
1
, Y
2
, . . . , Y
n
from the distribu-
tion of Y . These are independent variables, each with the distribution of Y .
In this section, we assume that the distribution of Y is unknown. For a given
probability mass function f
0
, we will test the hypotheses H
0
: f = f
0
versus H
1
:
f = f
0
. The test that we will construct is known as the goodness of t test for the
conjectured density f
0
. As usual, our challenge in developing the test is to nd an
appropriate test statistic one that gives us information about the hypotheses and whose
distribution, under the null hypothesis, is known, at least approximately.
Suppose that S = y
1
, y
2
, . . . , y
k
. To simplify the notation, let p
j
= f
0
(y
j
) for
j = 1, 2, . . . , k. Now let N
j
= #{i 1, 2, ..., n : y
i
= y
j
} for j = 1, 2, . . . , k.
Under the null hypothesis, (N
1
, N
2
, . . . , N
k
) has the multinomial distribution with
parameters n and p
1
, p
2
, . . . , p
k
with E(N
j
) = np
j
and Var(N
j
) = np
j
(1 p
j
). This
result indicates how we might begin to construct our test: for each j we can compare
the observed frequency of y
j
(namely N
j
) with the expected frequency of value y
j
(namely np
j
), under the null hypothesis. Specically, our test statistic will be
X
2
=
(N
1
np
1
)
2
np
1
+
(N
2
np
2
)
2
np
2
+ +
(N
k
np
k
)
2
np
k
.
Note that the test statistic is based on the squared errors (the differences between the
expected frequencies and the observed frequencies). The reason that the squared errors
are scaled as they are is the following crucial fact, which we will accept without proof:
106
under the null hypothesis, as n increases to innity, the distribution of X
2
converges to
the chi-square distribution with k 1 degrees of freedom.
For m > 0 and r in (0, 1), we will let
2
m,r
denote the quantile of order r for
the chi-square distribution with m degrees of freedom. Then, the following test has
approximate signicance level : reject H
0
: f = f
0
versus H
1
: f = f
0
, if and only
if X
2
>
2
k1,1
. The test is an approximate one and works best when n is large.
Just how large n needs to be depends on the p
j
. One popular rule of thumb proposes
that the test will work well if all the expected frequencies satisfy np
j
1 and at least
80% of the expected frequencies satisfy np
j
5.
Example 28 (Genetical inheritance). In crosses between two types of maize four dis-
tinct types of plants were found in the second generation. In a sample of 1301 plants
there were 773 green, 231 golden, 238 green-striped, 59 golden-green-striped. Accord-
ing to a simple theory of genetical inheritance the probabilities of obtaining these four
plants are
9
16
,
3
16
,
3
16
and
1
16
respectively. Is the theory acceptable as a model for this
experiment?
Formally we will consider the hypotheses:
H
0
: p
1
=
9
16
, and p
2
=
3
16
, and p
3
=
3
16
and p
4
=
1
16
;
H
1
: not all the above probabilities are correct.
The expected frequencies for any plant under H
0
is np
i
= 1301p
i
. We therefore
calculate the following table:
Observed Counts Expected Counts Contributions to X
2
O
i
E
i
(O
i
E
i
)
2
/E
i
773 731.8125 2.318
231 243.9375 0.686
238 243.9375 0.145
59 81.3125 6.123
X
2
= 9.272
Since X
2
embodies the differences between the observed and expected values we
can say that if X
2
is large that there is a big difference between what we observe and
107
what we expect so the theory does not seem to be supported by the observations. If X
2
is small the observations apparently conform to the theory and act as support for the
theory. The test statistic X
2
is distributed X
2

2
3df
. In order to dene what we would
consider to be an unusually large value of X
2
we will choose a signicance level of =
0.05. The R command qchisq(p=0.05,df=3,lower.tail=FALSE) calcu-
lates the 5% critical value for the test as 7.815. Since our value of X
2
is greater than the
critical value 7.815 we reject H
0
and conclude that the theory is not a good model for
these data. The R command pchisq(q=9.272,df=3,lower.tail=FALSE)
calculates the p-value for the test equal to 0.026. ( These data are examined further in
chapter 9 of Snedecor and Cochoran. ) 2
Very often we do not have a list of probabilities to specify our hypothesis as we had
in the above example. Rather our hypothesis relates to the probability distribution of the
counts without necessarily specifying the parameters of the distribution. For instance,
we might want to test that the number of male babies born on successive days in a
maternity hospital followed a binomial distribution, without specifying the probability
that any given baby will be male. Or, we might want to test that the number of defective
items in large consignments of spare parts for cars, follows a Poisson distribution, again
without specifying the parameter of the distribution.
The X
2
test is applicable when all the probabilities depend on unknown parame-
ters, provided that the unknown parameters are replaced by their maximum likelihood
estimates and provided that one degree of freedom is deducted for each parameter es-
timated.
Example 29. Feller reports an analysis of ying-bomb hits in the south of London dur-
ing World War II. Investigators partitioned the area into 576 sectors each beng
1
4
km
2
.
The following table gives the resulting data:
No. of hits (x) 0 1 2 3 4 5
No. of sectors with x hits 229 221 93 35 7 1
If the hit pattern is random in the sense that the probability that a bomb will land in
any particular sector in constant, irrespective of the landing place of previous bombs, a
Poisson distribution might be expected to model the data.
108
x P(x) =

x
e
Expected Observed Contributions to X

2
x! 576 P(X) (O
i
E
i
)
2
/E
i
0 0.395 227.53 229 0.0095
1 0.367 211.34 211 0.0005
2 0.170 98.15 93 0.2702
3 0.053 30.39 35 0.6993
4 0.012 7.06 7 0.0005
5 0.002 1.31 1 0.0734
X
2
= 1.0534
The MLE of was calculated as

= 535/576 = 0.9288, that is, the total number
of observed hits divided by the number of sectors. We carry out the chi-squared test
as before except that we now subtract one additional degree of freedom because we
had to estimate . The test statistic X
2
is distributed X
2

2
4df
. The R command
qchisq(p=0.05,df=4,lower.tail=FALSE) calculates the 5% critical value
for the test as 9.488. Alternatively, the R command
pchisq(q=1.0534,df=4,lower.tail=FALSE)
calculates the p-value for the test equal to 0.90. The result of the chi-squared test is
not statistically signicant indicating that the divergence between the observed and
expected counts can be regarded as random uctuations about the expected values.
Feller comments, It is interesting to note that most people believed in a tendency of
the points of impact to cluster. It this were true, there would be a higher frequency of
sectors with either many hits or no hits and a deciency in the intermediate classes. the
above table indicates perfect randomness and homogeneity of the area; we have here
an instructive illustration of the established fact that to the untrained eye randomness
appears a regularity or tendency to cluster. 2
6.6 The
2
Test for Contingency Tables
Let X and Y be a pair of categorical variables and suppose there are r possible values
for X and c possible values for Y . Examples of categorical variables are Religion,
109
Race, Social Class, Blood Group, Wind Direction, Fertiliser Type etc. The random
variables X and Y are said to be independent if P[X = a, Y = b] = P[X = a]P[Y =
b] for all possible values a of X and b of Y . In this section we consider how to test
the null hypothesis of independence using data consisting of a random sample of N
observations from the joint distribution of X and Y .
Example 30. A study was carried out to investigate whether hair colour (columns)
and eye colour (rows) were genetically linked. A genetic link would be supported if
the proportions of people having various eye colourings varied from one hair colour
grouping to another. 955 people were chosen at random and their hair colour and eye
colour recorded. The data are summarised in the following table :
O
ij
Black Brown Fair Red Total
Brown 60 110 42 30 242
Green 67 142 28 35 272
Blue 123 248 90 25 486
Total 250 500 160 90 1000
The proportion of people with red hair is 90/1000 = 0.09 and the proportion having
blue eyes is 486/1000 = 0.486. So if eye colour and hair colour were truly indepen-
dent we would expect the proportion of people having both black hair and brown eyes
to be approximately equal to (0.090)(0.486) = 0.04374 or equivalently we would
expect the number of people having both black hair and brown eyes to be close to
(1000)(0.04374) = 43.74. The observed number of people having both black hair and
brown eyes is 60.5. We can do similar calculations for all other combinations of hair
colour and eye colour to derive the following table of expected counts :
E
ij
Black Brown Fair Red Total
Brown 60.5 121 38.72 21.78 242
Green 68.0 136 43.52 24.48 272
Blue 121.5 243 77.76 43.74 486
Total 250.0 500 160.00 90.00 1000
In order to test the null hypothesis of independence we need a test statistic which mea-
sures the magnitude of the discrepancy between the observed table and the table that
110
would be expected if independence were in fact true. In the early part of this century,
long before the invention of maximum likelihood or the formal theory of hypothesis
testing, Karl Pearson ( one of the founding fathers of Statistics ) proposed the follow-
ing method of constructing such a measure of discrepancy:
(OijEij)
2
Eij
Black Brown Fair Red
Brown 0.004 1.000 0.278 3.102
Green 0.015 0.265 5.535 4.521
Blue 0.019 0.103 1.927 8.029
For each cell in the table calculate (O
ij
E
ij
)
2
/E
ij
where O
ij
is the observed count
and E
ij
is the expected count and add the resulting values across all cells of the table.
The resulting total is called the
2
test statistic which we will denote by W. The null
hypothesis of independence is rejected if the observed value of W is surprisingly large.
In the hair and eye colour example the discrepancies are as follows :
W =
r
i=1
c
j=1
(O
ij
E
ij
)
2
E
ij
= 24.796
What we would now like to calculate is the p-value which is the probability of getting
a value for W as large as 24.796 if the hypothesis of independence were in fact true.
Fisher showed that, when the hypothesis of independence is true, W behaves somewhat
like a
2
random variable with degrees of freedom given by (r 1)(c 1) where r
is the number of rows in the table and c is the number of columns. In our example
r = 3, c = 4 and so (r 1)(c 1) = 6 and so the p-value is P[W 24.796]
P[
2
6
24.796] = 0.0004. Hence we reject the independence hypothesis. 2
Exercise 42. It is believed that the number of breakages in a damaged chromosome,
X, follows a truncated Poisson distribution with probability mass function
P(X = k) =
e
1 e
k
k!
, k = 1, 2, . . . ,
where > 0 is an unknown parameter. The frequency distribution of the number of
breakages in a random sample of 33 damaged chromosomes was as follows:
Breakages 1 2 3 4 5 6 7 8 9 10 11 12 13 Total
Chromosomes 11 6 4 5 0 1 0 2 1 0 1 1 1 33
111
1. Find an equation satised by

, the MLE of .
2. Discuss approximations of

. Show that the observed data give the estimate
= 3.6.
3. Using this value for

, test the null hypothesis that the number of breakages in a
damaged chromosome follows a truncated Poisson distribution. The categories
6 to 13 should be combined into a single category in the goodness-of-t test.
112
Bibliography
[1] Hogg, R.V, McKean J.W., Craig, A.T. (2005) Introduction to mathematical statis-
tics. 6th Ed. Pearson-Prentice Hall.
113

Stat Infer

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Infer

Uploaded by

Copyright:

Available Formats

LECTURE NOTES ON

that maximizes this likelihood is called the maximum likelihood estimator

f(y)dy, the distribution

. To prove that this is the case note that

2Y is Rayleigh, X = log(Y/) is Gumbel.

, is called the parameter

) > 0. A large curvature I(

) is associated with a tight or strong

is a realization of the random vector X = (X

= . From the above calculation it follows that

) is the MLE of g().

). When g is not one-to-one the discussion becomes more subtle,

) is most likely for g(). In fact, we would nd it strange if

) will generally not be an

n]. If we carry out an innite sequence

f(x|)dx and, for i = 1, 2, . . . , n, dene U

has a N(0, 1) density and

) all had densities which were approximately N(0, 1).

)]. The quantity 1/

and the usual approximate

= 1/ x is the maximum likelihood estimate of . Show how both approximate

) is always shorter than the approximate

is the maximum likelihood estimate of . Show how both approximate pivotal

(often also called rejection region), we reject H

are chosen in such a way that for

. It is quite natural to expected that R

is decreasing with (it should

and for < the test statistics T(x) is

. The value is called the p-value for a given test.

we recall that controlling Type I Error means that

n assures that Type I Error is controlled at

nt > 1.65 and so we reject H

n. What are the chances of

n = (1.65 + 1.28)/0.2 or n = 215. 2

and T gives the signicance level .

). The test statistic is

against a simple alterna-

, where is the parameter of interest, and

are particular values

is determined so that the probability of a Type I error is :

as the most powerful at the signi-

against the alternative H

be a subset of the sample space

against the alternative simple hypothesis H

is at least as great as the power of the test associated with A,

) ( see (6.2), (6.3) )

and A have size .

. According to the Z-test, we should reject H

), for x > 0 where > 0

and deduce that this random variable has an

Expected Observed Contributions to X

You might also like