You are on page 1of 103

Introduction to distribution theory and regression analysis

(STA2030S)

Christien Thiart and the STA2030 team


Department of Statistical Sciences
University of Cape Town
c 20 July 2009

Contents
1 Random variables, univariate distributions

1-1

1.1

Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1

1.2

Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1

1.3

Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-4

1.4

Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-7

1.4.1

The Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10

1.4.2

The Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11

1.5

Distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11

1.6

Functions of random variables - cumulative distribution technique

. . . . . . . . . 1-15

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17


2 Bivariate Distributions

2-1

2.1

Joint random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-1

2.2

Independence and conditional distributions . . . . . . . . . . . . . . . . . . . . . .

2-6

2.3

The bivariate Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-9

2.4

Functions of bivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . 2-10


2.4.1

General principles of the transformation technique . . . . . . . . . . . . . . 2-10

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12


3 Moments of univariate distributions and moment generating function

3-1

3.1

Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3-1

3.2

Moments of univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . .

3-1

3.2.1

3-6

Moments - examples A F . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

The moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10

3.4

Moment generating functions for functions of random Variables . . . . . . . . . . . 3-14

3.5

The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17


1

4 Moments of bivariate distributions

4-1

4.1

Assumed statistical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-1

4.2

Moments of bivariate distributions: covariance and correlation . . . . . . . . . . . .

4-1

4.3

Conditional moments and regression of the mean . . . . . . . . . . . . . . . . . . .

4-5

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-6

5 Distributions of Sample Statistics

5-1

5.1

Random samples and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5-1

5.2

Distributions of sample mean and variance for Gaussian distributed populations

5-3

5.3

Application to 2 goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . .

5-6

5.4

Students t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5-7

5.5

Applications of the t distribution to two-sample tests . . . . . . . . . . . . . . . . .

5-9

5.6

The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14


6 Regression analysis

6-1

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6-1

6.2

Simple (linear) regression - model, assumptions . . . . . . . . . . . . . . . . . . . .

6-2

6.3

Matrix notation for simple regression . . . . . . . . . . . . . . . . . . . . . . . . . .

6-6

6.4

Multivariate regression - model, assumptions . . . . . . . . . . . . . . . . . . . . .

6-8

6.5

Graphical residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10

6.6

Variable diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13


6.6.1

6.7

Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . 6-13

Subset selection of regressor variables - building the regression model . . . . . . . . 6-15


6.7.1

All possible regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.7.2

Stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.8

Further residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.9

Inference about regression parameters and predicting . . . . . . . . . . . . . . . . . 6-17


6.9.1

Inference on regression parameters . . . . . . . . . . . . . . . . . . . . . . . 6-17

6.9.2

Drawing inferences about E[Y |xh ] . . . . . . . . . . . . . . . . . . . . . . . 6-18

6.9.3

Drawing inferences about future observations . . . . . . . . . . . . . . . . . 6-18

Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Attitude

19
A-1

B Maths Toolbox

B-1

B.1 Differentiation (e.g. ComMath, chapter 3) . . . . . . . . . . . . . . . . . . . . . . . B-1


B.2 Integration (ComMath, chapter 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.3 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.4 Double integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.5 Matrices: e.g. ComMath, Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . B-5

Chapter 1

Random variables, univariate


distributions
These notes have been prepared for the course STA2030S, for which the pre-requisite at UCT is a
pass in STA1000, STA2020 and a course in mathematics. It is thus assumed that the material in
the textbook for STA1000 (INTROSTAT, by L G Underhill and D J Bradfield) is known.
This course is based on the following three principles:
First year statistical background (Introstat).
Sharp mathematical tools (Appendix B) and
Attitude (Appendix A).

1.1

Assumed statistical background

Concept of a random variable (Introstat, chapter 4)


Probability mass functions: binomial, poisson, (Introstat chapter 5 and 6)
Probability density functions: normal, uniform, exponential (Introstat, chapter 5 and 6)
Concept of a cumulative density function (Introstat, chapter 6)
Quantiles (Introstat, chapter 1, chapter 6)

1.2

Random variables

[Random variable: a numerical value whose value is determined by the outcome of a random
experiment (or repeatable process). Each repetition of the random experiment is called a trial.]
It is useful to start this course with the question: what do we mean by probability? The basis
in STA1000 was the concept of a sample space, subsets of which are defined as events. Very often
the events are closely linked to values of a random variable, i.e. a real-valued number whose true
value is at present unknown, either through ignorance or because it is still undetermined. Thus if T
is a random variable, then typical events may be {T 20}, {0 < T < 1}, {T is a prime number}.

1-1

In general terms, we can view an event as a statement, which is in principle verifiable as true or
false, but the truth of which will only be revealed for sure at a later time (if ever). The concept of a
random experiment was introduced in STA1000, as being the observation of nature (the real world)
which will resolve the issue of the truth of the statement (i.e. whether the event has occurred or
not). The probability of an event is then a measure of the degree of likelihood that the statement
is true (that the event has occurred or will occur), i.e. the degree of credibility in the statement.
This measure is conventionally standardized so that if the statement is known to be false (the
event is impossible) then the probability is 0; while if it is known to be true (the event is certain)
then the probability is 1. The axioms of Kolmogorov (INTROSTAT, chapter 3) give minimal
requirements for a probability measure to be rationally consistent.
Kolmogorov (definition of Probability measure)
A probability measure on a sample space is a function P from subsets (events, say A) of to
the interval [0, 1] which satisfies the following three axioms:
(1) P [] = 1
(2) If A then P [A] 0.
(3) If A1 , and A2 are disjoint then P [A1 A2 ] = P [A1 ] + P [A2 ]. In general if A1 , A2 , , Ar , is
a sequence of mutually exclusive events (that is the intersection between Ai and Aj is empty
(Ai Aj = ) for i 6= j, j = 1, 2, ) then
P [A1 A2 ] = P [
i=1 Ai ]

X
=
P [Ai ]
i=1

Within this broad background, there are two rather divergent ways in which a probability measure
may be assessed and/or interpreted:
Frequency Intuition: Suppose the identical random experiment can be conducted N times (e.g.
rolling dice, spinning coins), and let M be the number of times that the event E is observed
to occur. We can define a probability of E by:
Pr[E] = lim

M
N

i.e. probability is interpreted as the relative proportion of times that E occurs in many trials.
This (interpretation) is still a leap in faith! Why should the future behave like the past?
Why should the occurrence of E at the next occasion obey the average laws of the past? In
any case, what do we mean by identical experiments? Nevertheless, this approach does
give a sense of objectivity to the interpretation of probability.
Very often, the experiments are hypothetical mental experiments! These tend to be based
on the concept of equally likely elementary events, justified by symmetry arguments (e.g. the
faces of a dice).
Subjective Belief: The problem is that we cannot even conceptually view all forms of uncertainty in frequency of occurrence terms. Consider for example:
The petrol price next December

The grade of ore in a specific undeveloped block of ore in a mine


Your marks for this course

1-2

None of these can be repeatedly observed, and yet we may well have a strong subjective sense
of the probability of events defined in terms of these random variables. The subjective view
accepts that most, in fact very nearly all, sources of uncertainty include at least some degree
of subjectivity, and that we should not avoid recognizing probability as a measure of our
subjective lack of knowledge. Of course, where well-defined frequencies are available, or can
be derived from elementary arguments, we should not lightly dismiss these. But ultimately,
the only logical rational constraint on probabilities is that they do satisfy Kolmogorovs
axioms (for without this, the implied beliefs are incoherent, in the sense that actions or
decisions consistent with stated probabilities violating the axioms can be shown to lead to
certain loss).
The aim of statistics is to argue from specific observations of data to more general conclusions
(inductive reasoning). Probability is only a tool for coping with the uncertainties inherent in
this process. But it is inevitable that the above two views of probability extend to two distinct
philosophies of statistical inference, i.e. the manner in which sample data should extrapolated to
general conclusions about populations. These two philosophies (paradigms) can be summarized as
follows:
Frequentist, or Sampling Theory: Here probability statements are used only to describe the
results of (at least conceptually) repeatable experiments, but not for any other uncertainties (for example regarding the value of an unknown parameter, or the truth of a null
hypothesis, which are assumed to remain constant no matter how often the experiment is
repeated). The hope is that different statisticians should be able to agree on these probabilities, which thus have a claim to objectivity. The emergence of statistical inference during the
late 19th and early 20th centuries as a central tool of the scientific method occurred within
this paradigm (at a time when objectivity was prized above all else in science). Certainly,
concepts of repeatable experiments and hypothesis testing remain fundamental cornerstones
of the scientific method.
The emphasis in this approach is on sampling variability: what might have happened if
the underlying experiments could have been repeated many times over? This approach was
adopted throughout first year.
Bayesian: Here probability statements are used to represent all uncertainty, whether a result
of sampling variability or of lack of information. The term Bayesian arises because of the
central role of Bayes theorem in making probability statements about unknown parameters
conditional on observed data, i.e. inferring likely causes from observed consequences, which
was very much the context in which Bayes worked.
The emphasis is on characterizing how degrees of uncertainty change from the position prior
to any observations, to the position after (or posterior to) these observations
One cannot say that one philosophy of inference is better than another (although some have
tried to argue in this way). Some contexts lend themselves to one philosophy rather than the
other, while some statisticians feel more comfortable with one set of assumptions rather than the
other.
Nevertheless, for the purposes of this course and for much of third year (STA3030), we will largely
limit ourselves to discussing the frequentist approach: this approach is perhaps simpler, will avoid
confusion of concepts at this relatively early stage in your training, and is the classical approach
used in reporting experimental results in many fields. In fact, the fundamental purpose of this
course is to demonstrate precisely how the various tests, confidence intervals, etc.. in first year
are derived from basic distributional theory applied to sampling variability viewed in a frequentist
sense.
Notation: We shall use upper case letters (X, Y, . . .) to signify random variables (e.g. time to
failure of a machine, size of an insurable loss, blood pressure of a patient), and lower case letters
1-3

(x, y, . . .) to represent algebraic quantities. The expression {X x} thus represents the event,
or the assertion, that the random variable denoted by X takes on a value not exceeding the real
number x. (We can quite legitimately define events such as {X 3.81}.)

1.3

Probability mass functions

Now suppose that X can take on discrete values only. Let be the set of all such values e.g.
= {0, 1, 2, }.
Discrete random variable: A random variable X is defined discrete if the range of possible
values X is countable; e.g. number of cars parked in front of Fuller Hall; number of babies that
are born at Tygerberg hospital; number of passengers on the Jammie Shuttle at 10 am.
We define the probability mass function (pmf ) (sometimes simply termed the probability function)
by pX () = Pr[X = ] (the is any scalar value). The following properties are then evident:
i pX (xi ) 0 for xi = (countable set)
ii pX (xi ) = 0 for all other values of x
P
iii
i=1 pX (xi ) = 1

Example A: Consider the experiment of tossing a coin 4 times. Define the random variable Y as
the number of heads when a coin is tossed 4 times.
(1) Write down the sample space.
(2) Derive the probability mass function (pmf) (assume Pr(head) = p).
(3) Check that the 3 properties of a pmf are satisfied.
First list the sample space: We need to fill 4 positions, the rv Y (the number of heads) can take
on values 1,2,3 or 4. The sample space is obtained by listing all the possibilities:
Y
0

Sample space
TTTT

probability
(1-p) (1-p) (1-p) (1-p)

HTTT
THTT
TTHT
TTTH

p (1-p) (1-p) (1-p)


(1-p) p (1-p) (1-p)
(1-p) (1-p) p (1-p)
(1-p) (1-p) (1-p) p

HHTT
HTHT
HTTH
TTHH
THHT
THTH

p p (1-p) (1-p)
p (1-p)p (1-p)
p (1-p) (1-p)p
p p (1-p) (1-p)
p (1-p) (1-p)p
(1-p)p (1-p)p

HHHT
HTHH
HHTH
THHH

p p p (1-p)
p (1-p) p p
p p (1-p)p
(1-p) p p p

HHHH

pppp

1-4

pmf
(1 p)4

4p(1 p)3

6p2 (1 p)2

4p3 (1 p)
p4

Thus, the pmf is given by




pY (y) =

4
y

py (1 p)ny , y = 0, 1, 2, 3, 4

Note that pY (y) 0 for all y = 0, 1, 2, 3, 4 (a countable set), and is zero for all other values of y.
Furthermore the pmf sums to one:


4 
X
4
pi (p 1)4i
i

= (p + (1 p))4 (binomial theorem)

i=0

= 1

The random variable Y in Example A is an example of a binomial random variable. In general


the probabiltiy mass function of a binomial random variable, with parameters n and p where n is
the number of trails and p is the probability of observing a success (be careful how you define the
concept of success), is given as:
pX (x) =

n
x

px (1 p)nx , x = 0, 1, 2, , , , n

In shorthand notation we write this pmf as X B(n, p). For more exercises on the Binomial see
Introstat, chapter 5.
Example B: Consider the following probability mass function:
pX (x) =

x
,
k

x = 1, 2, 3, 4; zero elsewhere

(1) Find the constant k so that pX (x) satisfies the conditions of being a pmf of a random variable
X.
(2) Find Pr[X = 2 or 3].
(3) Find Pr[ 23 < X < 92 ].
Solution:
(1) To find k we need to evaluate the sum of the pmf and set it equal to one (third property of
a pmf):
4
X

pX (xi ) =

i=1

=
=
Thus,

1
2
3
4
+ + +
k
k k k
1+2+3+4
k
1 (given it is a pmf)

10
= 1 = k = 10.
k

(2) Pr[X = 2 or 3] = Pr[X = 2] + Pr[X = 3] =


1-5

2
3
5
+
=
.
10 10
10

(3) Pr[ 23 < X < 92 ] = Pr[2 X 4] = 1Pr[X = 1] = 1

1
9
=
.
10
10

Example C: The pmf of the random variable S is given in the following table:
Value of S
pS (s)

0
0.15

1
0.25

2
0.25

3
0.15

4
c

(1) Find the constant c so that pS (s) satisfies the conditions of being a pmf of the random
variable S.
(2) Find Pr[S = 6 or 2.5].
(3) Find Pr[S > 3].
Solution:
(1) To find c we need to evaluate the sum of the pmf and set it equal to one (third property of
a pmf):
c = 1 (0.15 + 0.25 + 0.25 + 0.15) = 0.20
(2) Pr[S = 6 or 2.5] = 0
(3) Pr[S > 3] =Pr[S = 4] = 0.20
In first year you covered other discrete probability mass functions you need to revise them yourself),
but here are some important notes on some of them:
Binomial (Example A and Introstat Chapter 5)
The binomial sampling situation refers to the number of successes (which may, in some cases,
be the undesirable outcomes!) in n independent trials, in which the probability of success in
a single trial is p for all trials. This situation is often described initially in terms of sampling
with replacement, where the replacement ensures a constant value of p (as otherwise we have
the hypergeometric distribution), but applies whenever sequential trials have constant success
probability (e.g.sampling production from a continuous process).
Poisson (Introstat, Chapter 5)
(1) The Poisson distribution with parameter = np is introduced as an approximation to the
binomial distribution when n is large and p is small.
(2) The Poisson distribution also arises in conjunction with the exponential distribution in the
important context of a memoryless (Poisson) process. Such a process describes discrete
occurrences (e.g. failures of a piece of equipment, claims on an insurance policy), when the
probabilities of future events do not depend on what has happened in the past. For example,
if the probability that a machine will break down in the next hour is independent of how
long it is since the last breakdown, then this is a memoryless process. For such a process
in which the rate of occurrence is (number of occurrences per unit time), it was shown
in INTROSTAT that (a) the time between successive occurrences is a continuous random
variable having the exponential distribution with parameter (i.e. a mean of 1/); and (b)
the number of occurrences in a fixed interval of time t is a discrete random variable having
the Poisson distribution with parameter (i.e. mean) t.
These distributions are assumed known, revise Chapter 5 and 7 (Introstat), also check and see
if you can prove that these 4 probability mass functions satisfy the conditions of a pmf. For each
discrete distribution do at least 4 exercises of Introstat.
You need to realize one important rule:
1-6

Although some pmfs can have special names (e.g. Binomial, Poisson, Geometric, Negative
binomial), each will still follow the rules of a pmf.

1.4

Probability density functions

A discrete random variable can only give rise to a countable possible number of values, in contrast
a continuous random variable is a random variable whose set of possible values is uncountable
(e.g. you can measure your weight to 2 decimals, 4 decimals - you can measure it to any degree of
accuracy!) Examples of continuous random variables:
Lifetime of a energy saver globe
Petrol consumption of your car
The length of time you have to wait for the Jammie shuttle
A continuous random variable (e.g. say X) is described by its probability density function (pdf).
The function fX (x) is the probability density function (pdf ). Once again, the subscript X identifies
the random variable under consideration, while the argument x is an arbitrary algebraic quantity.
The pdf satisfies the following properties:
fX (x) 0 ... but can be greater than 1!
Z
fX (x)dx = 1
x=

Pr[a < X b] =

fX (x)dx

x=a

Example D: Let T be a random variable of the continuous type with pdf given by
fT (t) = ct2 for 0 t 2
(1) Show that for fT (t) to be a pdf, c =

3
.
8

(2) Find P[X > 1].


(3) Draw the graph of fT (t).
Solution:
R2
(1) Since fT (t) is a pdf, we must have that x=0 fT (t)dt = 1
 3 2
R2
R2
t
c
8
2
= (8 0) = c
x=0 fT (t)dt = x=0 ct dt = c 3
3
3
0
thus
8
c = 1 (given it is a pdf)
3
3
c=
8

1-7

(2)
P [X > 1] =
=

1 P [X 1]
Z 1
1
fT (t)dt
x=0
1

=
=
=
=

3 2
t dt
8
x=0
 3 1
3t
1
83 0
31
1
83
7
8
1

Example E: Let W be a random variable of the continuous type with pdf given by

fW (w)

2e2w for 0 < w <

(1) Show that fW (w) is a pdf.


(2) Find P[2 < w < 10].
(3) Draw the graph of fW (w).
Solution:
(1) It is clear that fW (w) 0, we need to show that
Z

fW (w)dw

x=0

x=0 fW (w)dw

= 1:

2e2w dw

x=0

2e2w dw
x=0


= 1 e2w 0
= 1

= 1(e e0 )
= 1(0 1)
= 1
(2)
P [2 < w < 10] =

10

2e2w dw

x=2

10

2e2w dw


10
1 e2w 2

=
=
=
1-8

x=2

1(e20 e4 )
1(0.01832)
0.01832

Comment: The random variable W follows an exponential distribution with parameter = 2


(see below, and Introstat, chapter 5).
Example F: Let X be a random variable of the continuous type with pdf given by

fX (x)

1
for 2 < x < c
10

(1) Find c so that fX (x) is a pdf.


(2) Find P[x > 14].
(3) Find P[x 5].
(4) Draw the graph of fX (x).
Solution:
(1)
Z

fX (x)dx

x=2

=
=
For fX (x) to be a pdf

1
dx
x=2 10

c
1
x
10 2
1
(c 2)
10

1
(c 2) = 1, thus c = 12
10

(2) P[x > 14] = 0 (fX (x) = 0, outside the bounds)


(3)
P [x 5] =
=
=
=

1
dx
10
x=2

5
1
x
10 2
1
(5 2)
10
3
10

Comment: The random variable X follows a uniform distribution (X U (2, 10)).


If the continues random variable X is equally likely to take on any value in the interval (a, b) then
X has the uniform distribution, X U (a, b) with probability density function
fX (x) =

1
for a < x < b
ba

and 0 otherwise. (Introstat, chapter 5).


Other continous probability density functions that you come across in first year include the normal
(Introstat, Chapter 5), t, F and Chi-squared distributions (Introstat, Chapter 9, 10 and in later
chapters of these notes).
You need to realize two important rules:
1-9

Although some pdfs can have special names (e.g. Weibull, log-normal, laplace), each will
still follow the rules of a pdf.
A pdf given without its bounds is not a pdf! A pdf will always have a lower bound and an
upper bound. Always specify these bounds, even when they are or .

1.4.1

The Gamma distribution

For your maths toolbox:


(1) We first define the gamma function as follows:
Z
(n) =
xn1 ex dx.
0

Note that this equation gives a function of n, and not a function of x, which is an arbitrarily
chosen variable of integration! The argument n may be an integer or any real number.
(2) The easiest case is:
Z

(1) =
(3) It can also be shown that ( 12 ) =

ex dx = 1.

(This follows by change of variable in the integration from x to z = 2x, and recognizing the
form of the density of the normal distribution, whose integral is known. (only the students
with strong mathematical skills should try showing this result)).

(4) An important property of the gamma function is given by the following, in which the second
line follows by integration by parts:
Z
xn ex dx
(n + 1) =
0
Z
 n

= x (ex ) 0
n xn1 (ex )dx
0
Z
= 0+n
xn1 ex dx
0

n (n)

Use of this result together with (1) = 1 and ( 21 ) = allows us to evaluate (n) for all
integer and half-integer arguments. In particular, it is easily confirmed that for integer values
of n, (n) = (n 1)!.
(5) Evaluate
(a) (5) = (5 1)! = 24

(b) (4 21 ) = (3 21 + 1) = 3 12 (3 21 + 1) = = 3 12 2 21 1 21

1
2

( 12 ) = 3 21 2 12 1 12

1
2

The Gamma distribution is defined as:


fX (x) =

x1 ex
()

for 0 < x < , > 0 and > 0.

(Note that in general we will assume that the value of a density function is 0 outside of the range
for which it is specifically defined.)
1-10

Exercise: Show that the pdf given above for the gamma distribution is indeed a proper pdf.
(HINT: Transform the integration to a new variable u = x.)
Alternative definition: In some texts, the gamma distribution is defined in terms of and a
parameter defined by = 1/. In this case the density is written as:
fX (x) =

x1 ex/
()

for 0 < x < .

Of course, the mathematical fact that fX (x) satisfies the properties of a pdf does not imply or
show that any random variable will have this distribution in practical situations. We shall, a little
later in this course, demonstrate two important situations in which the gamma distribution arises
naturally. These situations are:
(1) The gamma special case in which = n/2 (for integer n) and = 21 is the 2 (chi-squared)
distribution with n degrees of freedom (which you met frequently in the first year course).
(2) When = 1 the gamma distribution is called the exponential distribution.

1.4.2

The Beta distribution

We start by defining the beta function, as follows:


Z 1
B(m, n) =
xm1 (1 x)n1 dx
0

Note that this equation gives a function of m and n (and not of x). Note the symmetry of the
arguments: B(m, n) = B(n, m). It can be shown that the beta and gamma functions are related
by the following expression, which we shall not prove:
B(m, n) =

(m)(n)
(m + n)

Clearly, the function defined by:


fX (x) =

xm1 (1 x)n1
B(m, n)

for 0 < x < 1

satisfies all the properties of a probability density function. A probability distribution with pdf
given by fX (x) is called the beta distribution, or more correctly the beta distribution of the first
kind. (We shall meet the second kind shortly.) We will later see particular situations in which this
distribution arises naturally, e.g. in comparing variances or sums of squares.

1.5

Distribution function

The distribution function (sometimes denoted as the cumulative distribution function (cdf)) of the
random variable X, is defined by:
FX (b) = Pr[X b]

The subscript X refers to the specific random variable under consideration. The argument b is an
arbitrary algebraic symbol: we can just as easily talk of FX (y), or FX (t), or just FX (5.17).

Since for any pair of real numbers a < b, the events {a < X b} and {X a} are mutually
exclusive, while their union is the event {X b}, it follows that Pr[X b] = Pr[X a] + Pr[a <
X b], or (equivalently):
Pr[a < X b] = FX (b) FX (a)
Some properties of the cdf F are
1-11

(1) F is a monotone, non-decreasing function; that is, if a < b, then FX (a) FX (b).
(2) lim FX (a) = 1 if a . (in practice it means that if a upper bound then FX (a) = 1
(3) lim FX (a) = 0 if a . (in practice it means that if a lower bound then FX (a) = 0
Some examples: Find the cdf for Examples A - F:
Example A (cont): The cdf is given by:


X
4
py (1 p)4y
y

F Y () =

y=0

e.g. for = 1
F Y (1) = (1 p)4 + 4p1 (1 p)3
Example B (cont): The cdf is given by:

FX () =

X
x
10
x=1

e.g. for = 3

FX (3)

3
X
x
10
x=1

1
2
3
+
+
10 10 10
6
10

=
=
Example C (cont): The cdf is given by:

FS () =

pS (s)

s=0

e.g. for = 5
FS (5) = FS (4) = 1
Example D (cont): The cdf is given by:

FT () =
=
=

3 2
t dt
0 8
 
3 t3
8 3 0
3
8

1-12

Thus the cdf of T is:

FT (t)

= 0 t<0
t3
=
0t2
8
= 1 t>2

Example E (cont): The cdf is given by:

FW () =

2e2w dw

= 1

2e2w dw



= 1 e2w 0

= 1[e2 e0 ]
= 1 e2
Thus the cdf of W is given by

FW (w)

=
=

0 w0
1 e2w ,

w>0

Example F (cont): The cdf is given by:

1
dx
10
2


1
=
x
10 2
2
=
10

FX () =

Thus the cdf of X is given by

FX (x)

=
=
=

0, x 2
x2
, 2 < x < 12
10
1, x 12

The cumulative distribution function can also be used to find the quantiles (e.g. lower quartile,
median and upper quartile) of distributions. (Reminder: from first year (Introstat, chapter 1)
lower quartile: 14 of the sample data is below the lower quartile; 12 (or 50%) is below the median
and 34 (75%) is below the upper quartile).
Example D (quantiles): Use the cdf of T and find the median:

1-13

The cdf of T is given by:

FT (t)

= 0 t<0
t3
=
0t2
8
= 1 t>2

The median can be denoted by 12 or by t(m) .


FT (t(m) ) =
thus t3(m)

t3(m)

8
8
=
2

1
2

and the median is t(m) = 4 3

Example E (quantiles): Use the cdf of W and find the lower quartile:
The cdf of W is given by

FW (w)

=
=

0 w0

1 e2w ,

w>0

The lower quartile can be denoted by 14 or by w(l) .


FW (w(l) ) = 1 e2w(l) =

1
4

thus
3
4

e2w(l)

2w(l)

3
= ln( )
4
= 0.143841

w(l)

Example F (quantiles): Use the cdf of X and find the upper quartile:
The cdf of X is given by:

FX (x)

=
=
=

0, x 2
x2
, 2 < x < 12
10
1, x 12

The upper quartile can be denoted by 34 or by x(u) .


FX (x(u) ) =

x(u) 2
3
=
10
4

and the upper quartile is x(u) = 9 21 .

1-14

1.6

Functions of random variables - cumulative distribution


technique

Before proceeding to consideration of specific distributions, let us briefly review some results from
integration theory. Suppose we have an integral of the form:
Z

f (x)dx

but that we would rather work in terms of a new variable defined by u = g(x). There are at
least two possible reasons for this change:
(i) the variable u may have more physical or statistical meaning (i.e. a statistical reason); or
(ii) it may be easier to solve the integral in terms of u than in terms of x (i.e. a mathematical
reason).
We shall suppose that the function g(x) is monotone (increasing or decreasing) and continuously
differentiable. We can then define a continuously differentiable inverse function, say g 1 (u), which
is nothing more than the solution for x in terms of u from the equation g(x) = u. For example, if
g(x) = ex , then g 1 (u) = ln(u). We then define the Jacobian of the transformation from x to
u by:

1

dx
dg (u)




|J| = =
du
du
Note that J is a function of u.

Example: Continuing with the example of g(x) = ex , and g 1 (u) = ln(u), we have that:



d[ ln(u)] 1 1
= =
|J| =
u u
du
since u > 0.

Important Note: Since dx/du = [du/dx]1 , it follows that we can also define the Jacobian by
|J| = [dg(x)/dx]1 , but the result will still be written in terms of x, requiring a further
substitution to get it in terms of u as required. Note also that some texts define the Jacobian
as the inverse of our definition, and care has thus to be exercised in interpreting results
involving Jacobian s.
Example (continued): In the previous example, dg(x)/dx = ex , and thus the Jacobian could
be written as ex ; substituting x = ln(u) then gives |J| = e ln(u) = u1 , as before.
Theorem 1.1 For any monotone function g(x), defining a transformation of the variable of integration:
Z b
Z d
f (x)dx =
f [g 1 (u)] |J| du
a

where c is the smaller of g(a) and g(b), and d the larger.


This theorem then defines a procedure for changing the variable of integration from x to u = g(x)
(where g(x) is monotone):
(1) Solve for x in terms of u to get the inverse function g 1 (u).
1-15

(2) Differentiate g 1 (u) and take absolute value to obtain the Jacobian |J|.
(3) Calculate the minimum and maximum values for u (i.e. c and d in the theorem).
(4) Write down the new integral, as given by the theorem.
Example: Evaluate:

xe 2 x dx

(1) Solve for x in terms of u to get the inverse function g 1 (u).

Substitute u = 12 x2 , which is monotone over the range given, which gives x = 2u, and
1
(2) Differentiate
g (u) to obtain the Jacobian |J|.


|J| = d du2u = 12u .
(3) Calculate the minimum and maximum values for u (i.e. c and d in the theorem).
Clearly, u also runs from 0 to .
(4) Write down the new integral, as given by the theorem.
Z
Z
Z
1 2
1
xe 2 x dx =
2u eu du =
eu du = 1
2u
0
0
0

We consider examples D, E and F. In what follow we are going to use the cdf (found in Example
D, E and F) to evaluate the cdf and pdf of a function of the random variables.
Example D (cont): Define Y =

T +2
, find the pdf of Y by using the cdf of T .
2

Solution:

FY (y)

= Pr[Y y]
T +2
= Pr[
y]
2
= Pr[T 2y 2]
= FT (2y 2)
(2y 2)3
=
8

thus the pdf of Y is:


fY (y) =

dFY (y)
3(2y 2)2 2
=
= 3(y 1)2 . for 1 y 2
dy
8

Homework: check that your answer is right.


Example E (cont): Define H = 2W , find the pdf of H by using the cdf of W .
Solution:
FH (h)

= Pr[H h]

= Pr[2W h]
h
= Pr[W ]
2
h
= FW ( )
2
h
= 1 exp[2 ]
2
= 1 exp[h]
1-16

thus the pdf of H is:


dFH (h)
= 0 + exp[h] for h > 0.
dh
Example F(cont): Define Y = X + 8, find the pdf of Y by using the cdf of X.
fH (h) =

Solution:

FY (y) = Pr[Y y]
= Pr[X + 8 y]

= Pr[X y 8]
= FX (y 8)
y82
=
10
y 10
=
10

thus the pdf of Y is:


fY (y) =

dFY (y)
1
=
for 10 < y < 20.
dy
10

Homework: check that these answers are correct in D,E and F.

Tutorial Exercises
1. Which of the following functions are valid probability mass functions or probability density
functions:
(a)
fX (x)

=
=

1 2
(x 1) 0 x 3
6
0 otherwise

(b)
fX (x)

7 2
(x 1) 1 x 3
20
0 otherwise

=
=

(c)
fX (x)

=
=

1
5
0

1x6
otherwise

(d)
pX (x)

e1
x!

x = 1, 2, , ,

(e)
pS (s)

= s
= 0
1-17

1 3 1 5
; ; ;
12 12 3 12
otherwise

s =

(f)
pT (t) =

4
t

  4
1
t = 0; , 1; 2; 3;
2
otherwise

2. In question one, state the conditions that will make the pdf or pmf that are invalid, valid.
3. For what values of D can pY (y) be a probability mass function?

pY (y) =
=
=
=

1D
y=0
4
1+D
y=2
2
1D
y=4
4
0 otherwise

4. A random variable X has probability density function given by

fX (x)

kx3

(a) For fX (x) to be a valid pdf show that k =

0x4

otherwise
1
64 .

(b) Find the cumulative distribution function (cdf) of X.


(c) Find the Pr[2 < x 3].

(d) Find the lower quartile.

(e) Say Y = 16X, find the pdf of Y .


5. A random variable Y has probability density function given by

fY (y) =
=

cey Y 0
0 otherwise

(a) For fY (y) to be a valid pdf show that c = 1.


(b) Find the cumulative distribution function (cdf) of Y .
(c) Find the Pr[y > 4].
(d) Find the median.
(e) Say S = 3Y , find the pdf of S.
6. A random variable X has probability density function given by

fX (x)

2x

0xb

otherwise

(a) For fX (x) to be a valid pdf show that b = 1.


(b) Find the cumulative distribution function (cdf) of X.
(c) Find the Pr[ 21 < x 5].
1-18

(d) Find the upper quartile.


(e) Say Y = 4X, find the pdf of Y .
7. A random variable X has probability mass function given by

pX (x)

e4 cx
x = 0, 1, 2,
x!
0 otherwise

=
=

(a) For pX (x) to be a valid pmf show that c = 4.


(b) Find the cumulative distribution function (cdf) of X (hint: do it term by tern no pattern
emerges).
(c) Find the Pr[x > 3].
8. A random variable T has probability mass function given by
9.
pT (t) =
=

5
t



4
2t

c
otherwise

t = 0, 1, 2,

(a) For pT (t) to be a valid pmf show that c = 36.


(b) Find the cumulative distribution function (cdf) of T .
(c) Find the Pr[t > 3].
(d) Find the Pr[t = 23].
10. A random variable X has probability density function given by

fX (x)

=
=

10
x>c
x
0 xc

(a) For fX (x) to be a valid pdf show that c = 10.


(b) Find the cumulative distribution function (cdf) of X.
(c) Find the Pr[2 < x 3].

(d) Find the lower quartile.


(e) Find the Pr[x > 7].

11. Two fair dice are rolled. Let S equal the sum of the two dice.
(a) List all the possible values S can take.
(b) List the sample space.
(c) Find the pmf of S.
12. Let D represent the difference between the number of heads and the number of tails obtained
when a (unbiased) coin is tossed 5 times.
(a) List all the possible values D can take.
(b) List the sample space.
(c) Find the pmf of D.
1-19

13. The following question is from Ross (1998), suppose


tuion function:

1 1
FX () =
+

2
4

11

12

1
(a) Find Pr[X = i], i = 1,2,3.

1-20

X has the following cumulative distrib<0


0<1
1<2
2<3
3

Chapter 2

Bivariate Distributions
Assumed statistical background
Chapters one of these notes
Maths Toolbox
Please work through the section on double integration - Maths Toolbox A.4.

2.1

Joint random variables

Up to now, we have assumed that our random experiment results in the observation of a value of
a single random variable. Quite typically, however, a single experiment (observation of the real
world) will result in a (possibly quite large) collection of measurements, which can be expressed as
a vector of observations describing the outcome of the experiment. For example:
A medical researcher may record and report various characteristics of each subject being
tested, such as blood pressure, height, weight, smoking habits, etc., as well as the direct
medical observations;
An investment analyst may wish to examine a large number of financial indicators specific
for each share under consideration;
The environmentalist would be interested in recording many different pollutants in the air.
Each such vector of observations is termed a multivariate observation or random variable. A very
large proportion of statistical analysis is concerned with the analysis of such multivariate data.
For this course (except in the chapter on regression), however, we will be limiting ourselves to the
simplest case of two variables only, i.e. bivariate random variables.
For any bivariate pair of random variables, say (X, Y ), we can define events such as X x, Y y,
i.e. the joint occurrence of the events X x and Y y. Once again, upper case letters (X, Y )
will denote random variables, while lower case letters (x, y) denote observable real numbers. As
with univariate random variables, the discrete case is easily handled by means of a probability
mass function. We shall once again, with no loss of generality, assume that the discrete random
variables are defined on the non-negative integers. For a discrete pair of random variables, the
joint probability mass function is defined by:
pXY (x, y) = Pr[X = x, Y = y]
2-1

i.e. by the probability of the joint occurrence of X = x and Y = y. Just as in the case of the
univariate, the joint pmf will have the following three properties:
(i) pXY (x, y) 0
(ii) pXY (x, y) > 0 on a countable sets and
P P
(iii)
x
y pXY (x, y) = 1

The joint cumulative distribution function is given as:


y
x X
X

FXY (x, y) =

pXY (i, j)

i=0 j=0

Note that the event {X = x} is equivalent to the union of all (mutually disjoint) events of the
form X = x, Y = y as y ranges over all non-negative integers. It thus follows that the marginal
probability mass function of X, i.e. the probability that X = x, must be given by:

pX (x) =

pXY (x, y)

y=0

and similarly the marginal for Y is:

pY (y) =

pXY (x, y).

x=0

Example G: Assume the joint probability mass function of X and Y is given in the following
joint probability table:
y=
0
1
2
3

0
0.03125
0.06250
0.03125
0

x=
1
2
0.0625 0.03125
0.15625 0.1250
0.1250 0.15625
0.03125 0.0625

3
0
0.03125
0.06250
0.03125

Note the following:


From the table it is clear that X can take on the values x = 0, 1, 2, 3 and y = 0, 1, 2, 3. The
pmf is only defined over this range of 16 points in two-dimensional space.
pXY (0, 1) = Pr[X = 0, Y = 1] = 0.06250
pXY (5, 1) = 0 (X is outside the valid range)
P3 P3
i=0
j=0 pXY (i, j) = 1 (the joint pmf sums to 1)

Summing down each column then gives the marginal pmf for X; so, for example:
pX (0) = 0.03125 + 0.0625 + 0.03125 + 0 = 0.125
and similarly pX (1) = pX (2) = 0.375 and pX (3) = 0.125. It is easily seen that this is also the
marginal distribution of Y (obtained by adding across the rows of the table).

2-2

For continuous random variables, we need to introduce the concept of the joint probability density
function fXY (x, y). In principle, the joint pdf is defined to be the function for which:
Z

Pr[a < X b, c < Y d] =

x=a

fXY (x, y) dy dx

y=c

for all a < b and c < d. Just as in the case of the univariate, the joint pdf will have the following
three properties:
(i) fXY (x, y) 0
(ii) fXY (x, y) > 0 on a measurable sets and
(iii) The total volume (the surface over the x y plane) is one:
R R
f (x, y)dxdy = 1
XY

Note that in particular this condition requires that:


Z x
Z y
FXY (x, y) =
u=

fXY (u, v) dv du

v=

and it is easy to move from the joint cdf to the joint pdf:
fXY (x, y) =

FXY (x, y)
x

2 FXY (x, y)
xy

As for discrete random variables, we can also define marginal pdf s for X and Y :
fX (x) =

fXY (x, y) dy

y=

and similarly for the marginal pdf of Y :


fY (y) =

fXY (x, y) dx.

x=

The marginal pdfs describe the overall variation in one variable, irrespective of what happens with
the other. For example, if X and Y represent height and weight of a randomly chosen individual
from a population, then the marginal distribution of X describes the distribution of heights in the
population.
Example H: Suppose the joint pdf of X, Y is given by:
fXY (x, y) =

3
(x + y 2 ) for 0 x 2; 0 y 2
28

Is this function a joint pdf? Yes, because


fXY (x, y) 0 and

2-3

fXY (x, y) dy dx

3
(x + y 2 ) dy dx
28
x=0 y=0
Z 2 Z 2

Z 2 Z 2
3
2
x dydx +
y dydx
28
x=0 y=0
x=0 y=0
Z 2

Z 2
Z 2 Z 2
3
x
dydx +
y 2 dydx
28
x=0
y=0
x=0 y=0
!
Z 2
Z 2  3 2
3
y
2
x [y]y=0 dx +
dx
28
3 y=0
x=0
x=0
 Z 2

Z 2
8
3
2
x dx +
dx
28
x=0
x=0 3
!
 2 2
3
x
8 2
2
+ [x]
28
2 x=0 3 x=0


3
16
4+
28
3
3 28
=1
28 3

=
=
=
=
=
=
=
The marginal pdf of X is:

fX (x)

=
=
=

3
(x + y 2 ) dy
28
y=0

2
3
y3
xy +
28
3 y=0


8
3
2x +
for 0 x 2
28
3

The marginal pdf of Y is:

fY (y) =
=
=

3
(x + y 2 ) dx
28
x=0

2
3 x2
2
+y x
28 2
x=0

3 
2 + 2y 2 for 0 y 2
28

Note that marginal pdfs are still pdfs!

Find the probability that both X and Y are less than one:

2-4

P r [0 X 1, 0 Y 1] =
=
=
=
=
=
=
=

3
(x + y 2 ) dy dx
28
x=0 y=0
Z 1 Z 1

Z 1 Z 1
3
2
x dydx +
y dydx
28
x=0 y=0
x=0 y=0
Z 1

Z 1
Z 1 Z 1
3
x
dydx +
y 2 dydx
28
x=0
y=0
x=0 y=0
!
Z 1
Z 1  3 1
3
y
1
x [y]y=0 dx +
dx
28
3 y=0
x=0
x=0
Z 1

Z 1
1
3
x dx +
dx
28
x=0
x=0 3
!
 2 1
3
x
1 1
+ [x]
28
2 x=0 3 x=0


3 1 1
+
28 2 3
3 3+2
5
=
28 6
56

Example I: Suppose that X and Y are continuous random variables, with joint pdf given by:
fXY (x, y) = cex e2y

for x, y > 0

1. Find c.
We need to integrate out the joint pdf and set the definite integral equal to 1:
Z

fXY (x, y)

= c
=
=
=
=
=

cex e2y dy dx

x=0 y=0
Z
Z
x

x=0

e2y dy dx

y=0

Z
x 1
c
e
(2)e2y dy dx
2 y=0
x=0
Z


c
ex e2y y=0 dx
2 x=0
Z
c x
e dx
2 x=0


c
(1) ex y=0
2
c
2

thus setting the integral equal to 1 we obtain c = 2.


2. Find the marginal pdf for X:

2-5

fX (x)

2ex e2y dy

y=0

= ex (1)

2e2y dy

y=0
 2y 
x
e
e
y=0
x
0

= e [0 e ]
= ex for x > 0
3. Find the marginal pdf of Y :
Z

fY (y) =

2exe2y dx


(1) ex x=0

x=0
2y

2e

2e2y for y > 0

Note that marginal pdfs are still pdfs!


4. Find Pr[Y < X].
We need to evaluate the double integral over the region y < x, within the domain
0 < x < and 0 < y < (see figure 2.1).
P r[Y < X] =
=
=
=
=

Z Z
Z

x=0
Z

Zx=0

Zx=0

x=0

=
=
=

2.2

2ex e2y dy dx

y<x
Z x

2ex e2y dy dx

y=0


x
ex (1) e2y y=0 dx
ex [1 e2x ]dx
Z
ex dx
e3x dx
x=0



1  3x 
(1) ex x=0
e
x=0
3
1
1
3
2
3

Independence and conditional distributions

Recall from the first year notes, the concepts of conditional probabilities and of independence of
events. If A and B are two events then the probability of A conditional on the occurrence of B is
given by:
Pr[A B]
Pr[A|B] =
Pr[B]
2-6

X
Figure 2.1: Dotted area indicates region of integration
provided that Pr[B] > 0. The concept of the intersection of two events (A B) is that of the joint
occurrence of both A and B. The events A and B are independent if Pr[A B] = Pr[A]. Pr[B],
which implies that Pr[A|B] = Pr[A] and Pr[B|A] = Pr[B] whenever the conditional probabilities
are defined.
The same ideas carry over to the consideration of bivariate (or multivariate) random variables. For
discrete random variables, the linkage is direct: we have immediately that:
Pr[X = x|Y = y] =

Pr[X = x ; Y = y]
p (x, y)
= XY
Pr[Y = y]
pY (y)

provided that pY (y) > 0. This relationship applies for any x and y such that pY (y) > 0, and defines
the conditional probability mass function for X, given that Y = y. We write the conditional pmf
as:
p (x, y)
pX|Y (x|y) = XY
.
pY (y)
In similar manner, we define the conditional probability mass function for Y , given that X = x,
i.e. pY |X (y|x).
By definition of independent events, the events {X = x} and {Y = y} are independent if and only
if pXY (x, y) = pX (x).pY (y).
If this equation holds true for all x and y, then we say that the random variables X and Y are
independent. In this case, it is easily seen that all events of the form a < X b and c < Y d are
independent of each other, which (inter alia) also implies that FXY (x, y) = FX (x).FY (y) for all x
and y.
Example G (cont.) Refer back to example G in the previous section (p2-2), and calculate the
conditional probability mass function for X, given Y = 2. We have noted that pY (2) = 0.375,
2-7

and thus:
pX|Y (0|2) =
pX|Y (1|2) =

pXY (0, 2)
0.03125
=
= 0.0833
pY (2)
0.375
pXY (1, 2)
0.125
=
= 0.3333
pY (2)
0.375

pX|Y (2|2) =

pXY (2, 2)
0.15625
=
= 0.4167
pY (2)
0.375

pX|Y (3|2) =

pXY (3, 2)
0.0625
=
= 0.1667
pY (2)
0.375

Note that the conditional probabilities again add to 1, as required.


The random variables are not independent, since (for example) pX (2) = pY (2) = 0.375, and
thus pX (2).pY (2) = 0.140625, while pXY (2, 2) = 0.125.
Once again there is a slight technical problem when it comes to continuous distributions, since
all events of the form {X = x} have zero probability. Nevertheless, we continue to define the
conditional probability density function for X given Y = y as:
fX|Y (x|y) =

fXY (x, y)
fY (y)

provided that fY (y) > 0, and similarly for Y given X = x. This corresponds to the formal
definition of conditional probabilities in the sense that for small enough values of h > 0:
Pr[a < X b | y < Y y + h] =

fX|Y (x|y) dx

x=a

The continuous random variables X and Y are independent if and only if fXY (x, y) = fX (x)fY (y)
for all x and y, which is clearly equivalent to the statement that the marginal and conditional pdfs
are identical.
Example H (cont.) In this example, we have from the previous results that:



8 3 
3
fX (x)fY (y) =
2x +
2 + 2y 2 6= fXY (x, y)
28
3 28
thus X and Y are not independent and the conditional pdf of Y |X is
fY |X (y|x) =
=
=

fXY (x, y)
fX (x)
3
(x + y 2 )
28 

3
8
28 2x + 3

x + y2
for 0 y 2, when given any x [0, 2]
2x + 83

Homework: 1. Find the conditional pdf of X|Y and


2. Show that the two conditional pdfs are indeed pdfs.
Find the P r[(Y < 1)|(X = 0)]

2-8

P r[(Y < 1)|(X = 0)] =


=
=
=

y=0
Z 1

fY |X (y|0)dy

3y 2
dy
y=0 8
 1
3 y3
8 3 0
1
8

Example I (cont.) In this example, we have from the previous results that:
fX (x)fY (y) = ex 2e2y = fXY (x, y)
thus X and Y are independent and the conditional pdf of Y |X is just the marginal of Y
fY |X (y|x) =

2.3

fXY (x, y)
f (x)fY (y)
= X
= fY (y)
fX (x)
fX (x)

The bivariate Gaussian distribution

In the same way that the Gaussian (normal) distribution played such an important role in the
univariate statistics of the first year syllabus, its generalization is equally important for bivariate
(and in general, multivariate) random variables. The bivariate Gaussian (normal) distribution is
defined by the following joint probability density function for (X, Y ):


Q(x, y)
1
p
exp
fXY (x, y) =
(2.1)
2(1 2 )
2X Y 1 2

where Q(x, y) is a quadratic function in x and y defined by:


Q(x, y) =

x X
X

2

x X
X



y Y
Y

y Y
Y

2

In matrix notation this pdf can be expressed in the form:




1
1
1
fXY (x, y) =
exp (z ) (z )
2
2||1/2
where z and are column vectors of (x, y) and (X , Y ) respectively, and the matrix is given
by:


2
X
X Y
=
X Y
Y2
The form (2.1) applies also to the general multivariate Gaussian distribution, except that for a
multivariate random variable of dimension p, the 2 term is raised to the power of p/2.
We now briefly introduce a few key properties of the bivariate Gaussian distribution:
PROPERTY 1: MARGINAL DISTRIBUTIONS ARE GAUSSIAN.
In other words, fY (y) is the pdf of a gaussian distribution with mean Y and variance
Y2 .
2-9

Similarly, the marginal distribution of X is gaussian with mean X and variance 2X .


PROPERTY 2: CONDITIONAL DISTRIBUTIONS ARE GAUSSIAN.
The conditional pdf for X given Y = y is a Gaussian distribution, with mean (i.e. the
conditional mean for X given Y = y)
2
X + [X /Y ](y Y ) and conditional variance X
(1 2 ).
Similarly the conditional distribution for Y given X = x is Gaussian with mean
Y + [Y /X ](x X ) and conditional variance Y2 (1 2 ).

Note that both regressions are linear, but that the slopes are not reciprocals of each other
unless || = 1.

Note also that the condtional distributions have reduced variances, by the fractional factor
(1 2XY )
PROPERTY 3: XY = 0 IMPLIES THAT X AND Y ARE INDEPENDENT.

2.4

Functions of bivariate random variables

When faced with bivariate, or in general multivariate, random variables (observations), there is
often a need to derive the distributions of particular functions of the individual variables. There
are at least two reasons why this may be necessary:
It may be useful to summarize the data in some way, by taking the means, sums, differences
or ratios of similar or related variables. For example, height and weight together give some
measure of obesity, but weight divided by (height)2 may be a more succinct measure.
Some function, such as the ratio of two random variables, may be more physically meaningful
than the individual variables on their own. For example, the ratio of prices of two commodities
is more easily compared across countries and across time, than are the individual variables.

2.4.1

General principles of the transformation technique

The principles will be stated for the bivariate case only, but does in fact apply to multivariate
distributions in general.
Suppose that we know the bivariate probability distribution of a pair (X, Y ) of random variables,
and that two new random variables U and V are defined by the following transformations:
U = g(X, Y )

V = h(X, Y )

which are assumed to be jointly one-to-one (i.e. each pair X, Y is transformed to a unique pair
U, V ). In other words, each event of the form {X = x, Y = y} corresponds uniquely to the event
{U = u, V = v}, where u = g(x, y) and v = h(x, y). The uniqueness of this correspondence allows
us in principle to solve for x and y in terms of u and v, whenever only u and v are known. This
solution defines an inverse function, which we shall express in the form:
x = (u, v)

y = (u, v).

Let us further suppose that all the above functions are continuously differentiable. We can then
define the Jacobian of the transformation (precisely as we did for change of variables in multiple
integrals) as follows:

2-10



(u, v)/u (u, v)/v



|J| =
(u, v)/u (u, v)/v


(u, v) (u, v)
(u, v) (u, v)
=

u
v
u
v

We then have the following theorem.

Theorem 2.1 Suppose that the joint pdf of X and Y is given by fXY (x, y), and that the continuously differentiable functions g(x, y) and h(x, y) define a one-to-one transformation of the random
variables X and Y to U = g(X, Y ) and V = h(X, Y ), with inverse transformation given by
X = (U, V ) and Y = (U, V ). The joint pdf of U and V is then given by:
fU V (u, v) = fXY ((u, v), (u, v)) |J|
Note 1: We have to have as many new variables (i.e. U and V ) as we had original variables (in
this case 2). The method only works in this case. Even if we are only interested in a single
transformation (e.g. U = X +Y ) we need to invent a second variable, quite often something
trivial such as V = Y . We will then have the joint distribution of U and V , and we will need
to extract the marginal pdf of U by integration.
Note 2: Some texts define the Jacobian in terms of the derivatives of g(x, y) and h(x, y) w.r.t.
x and y. With this definition, one must use the inverse of |J| in the theorem, as it can be
shown that with our definition of |J|:

g(x, y)/x g(x, y)/y
|J| =
h(x, y)/x h(x, y)/y

This is sometimes the easier way to compute the Jacobian in any case.
Example Suppose that X and Y are independent random variables, each having the gamma
distribution with = 1 in each case, and with = a and = b respectively, i.e.:
fXY (x, y) =

xa1 y b1 e(x+y)
(a)(b)

for non-negative x and y. We define the following transformations:


U=

X
X +Y

V = X + Y.

Clearly 0 < U < 1 and 0 < V < . The inverse transformations are easily seen to be defined
by the functions:
x = uv
y = v(1 u)
and thus

x/u x/v
|J| =
y/u y/v


v

u



=
v (1 u)

= |v(1 u) + uv| = v
2-11

The joint pdf is thus:

fU V (u, v)

ua1 v a1 v b1 (1 u)b1 ev
v
(a)(b)

ua1 (1 u)b1 v a+b1 ev


(a)(b)

over the region defined by 0 < u < 1 and 0 < v < . As an exercise, show that U has
the Beta distribution (of the first kind) with parameters a and b, while V has the gamma
distribution with parameters 1 and a + b, and that U and V are independent.

Tutorial Exercises
1. The joint mass function of two discrete random variable, W and V , is given by
pW V (1, 1) =

1
8

pW V (1, 2) =

1
4

pW V (2, 1) =

1
8

pW V (2, 2) =

1
2

(a) Find the marginal mass functions of W and V .


(b) Find the conditional mass function of V given W = 2.
(c) Are W and V independent?
(d) Compute [i] P (W V 3).

[ii] P [W + V > 2]

[iii] P [W/V > 1].

2. If the joint pmf of C and D is given by

pCD (c, d) =

cd
, for c = 1, 2, 3 and d = 1, 2, 3
36
0
elsewhere

(a) Find the marginal pmf of C and D.


(b) Find the conditional mass function of C given D = 2.
(c) Are C and D independent?
(d) Find the probability mass function of X = CD.
3. The joint pdf of X and Y is given by
fXY (x, y) =

k(x2 +
0

xy
2 ),

0 < x < 1,

0<y<1
elsewhere

(a) Use the definition of a joint pdf to find k (a constant).


(b) Find the marginal pdf of Y
(c) Find fX|Y (x|y).
(d) Find [i] P (1 < Y < 2|x = 3) (ii) P (0 < X < 12 |y = 1)
4. The joint mass function of two discrete random variable, W and V , is given in the following
table:

2-12

v
-1
0
1
2

2
0.20
0.05
0.04
0.15

w
3
0.05
0.10
0.04
0.10

4
0.04
0.07
0.16
k

(a) Use the definition of a joint pmf to find k.


(b) Find the marginal mass functions of W and V .
(c) Find the conditional mass function of V given W = 2.
(d) Find [i] P r(W 2, V > 0).

[ii] P r[0 < V < 2|W = 2]

(e) Find the pmf of X = V + W 2.

5. The joint pdf of X and Y is given by


fXY (x, y) =

x + y, 0 < x < 1,
0

0<y<1
elsewhere

(a) Find the marginal density functions of X and Y .


(b) Are X + Y independent?
(c) Find the conditional density function of X given Y = 21 .
(d) Find [i] P r[X 12 ]

[ii] P r[(0 < X < 0.5)|(Y = 21 )]

(e) Find the joint pdf of W = XY and V = Y .


(f) Find the marginal pdf of W .

(g) Find the pdf of S = X + Y .


6. The joint pdf of S and T is given by
fST (s, t) =

c(2s + 3t), 0 s 1,
0

(a) Use the definition of a joint pdf to show that c =

2
.
5

(b) Find the marginal density functions of S and T .


(c) Are S and T independent?
(d) Find the conditional density function of S given T .
(e) Find [i] P r[S < 12 , T 0]

[ii] P r[(0 S 1)|(T = 12 )]

(f) Find the joint pdf of W = S + T and V = T .

2-13

0t1
elsewhere

Chapter 3

Moments of univariate
distributions and moment
generating function
3.1

Assumed statistical background

More about random variables (Introstat, chapter 6).


Probability mass functions: binomial, poisson, geometric (Introstat, chapter 5 and 6).
Probability density functions: gaussian, uniform, exponential (Introstat, chapter 5 and 6).
Chapter one of these notes.

3.2

Moments of univariate distributions

Let g(x) be any real-valued function defined on the real line. Then, by g(X) we mean the random
variable which takes on the value g(x), whenever X takes on the value x (e.g. g(x) = 2x + 3). In
previous chapters, we have seen how in particular circumstances we can derive the distribution of
g(X), once we know the distribution of X. The expectation of g(X), in the case of a continuous
random variable defined by:
E[g(X)] =

g(x)fX (x) dx

and in the case of a discrete random variable defined as:

E[g(X)] =

g(x)pX (x)

Note that while g(X) is a variable, E[g(X)] is an number value.


If we interpret the probabilities in a frequentist sense, then this expectation can be seen as a longrun average of the random variable g(X). More generally, the value represents in a sense the
centre of gravity of the distribution of g(X). An important special case arises when g(x) = xr for
3-1

some positive integer value of r; the expectation is then called the r-th moment of the distribution
of X, written in the case of a continuous random variable as:
Z
r = E[Xr ] =
xr fX (x) dx

and in the case of a discrete random variable as:


r = E[Xr ] =

xr pX (x)

The case r = 1, or E[X], is well-known: it is simply the expectation, or the mean, of X itself,
which we shall often write as X .
For r > 1 it is more convenient to work with the expectations of (X X )r . These values are the
called central moments of X, where the r-th central moment for a continuous random variable is
defined by:
Z
r = E[(X X )r ] =

(x X )r fX (x) dx

and in the case of a discreet random variable as:


r = E[(X X )r ] =

(x X )r pX (x) dx

Each central moment r measures in its own unique way, some part of the manner in which the
distribution (or the observed values) of X are spread out around the mean X .
2
You should be familiar with the case of r = 2, which gives the variance of X, also written as X
,
or Var[X]. In the case of a continuous random variable we write:

=
=
=
=
=
=
=

E[(X X )2 ]
Z
(x X )2 fX (x) dx

Z
(x2 2X x + 2X ) fX (x) dx

Z
Z
Z
x2 fX (x) dx
2 X x fX (x) dx +
2X fX (x) dx

Z
E[X2 ] 2X
x fX (x) dx + 2X

2
2

2X X + 2X
2X

Homework: - In the case of a discrete random variable, show that 2 = 2 2X .


The variance is always non-negative (in fact, strictly positive, unless X takes on one specific value
with probability 1), and measures the magnitude of the spread of the distribution. This interpretation should be well-known from first year. We now introduce two further central moments, the
3rd and the 4th.
Consider a probability density function which has a skewed shape similar to that shown in
Figure 3.1. The mean of this distribution is at x = 2, but the distribution is far from symmetric
3-2

6
f (x)

mean

Figure 3.1: Example of a skew distribution


around this mean. Now consider what happens when we examine the third central moment. For
X < X , we have (X X )3 < 0, while (X X )3 > 0 for X > X . For a perfectly symmetric
distribution, the positives and negatives will cancel out in taking the expectation, so that 3 = 0.
But for a distribution such as that shown in Figure 3.1, very large positive values of (X X )
will occur, but no very large negative values. The nett result is that 3 > 0, and we term such a
distribution positively skewed. Negatively skewed distributions (with the long tail to the left) can
also occur, but are perhaps less common in practice and in applications.
Of course, the magnitude of 3 will also depend on the amount of spread. In order to obtain a feel
for the skewness of the distribution, it is useful to eliminate the effect of the spread itself (which
is measured already by the variance). This elimination is effected by defining a coefficient of skew
by:
3
3
= p
3
X
( Var(X))3

which, incidentally, does not depend on the units of measurement used for X. For the distribution
illustrated in Figure 3.1, the coefficient of skew turns out to be 1.4. (This distribution is, in fact,
the gamma distribution with = 2.)
In defining and interpreting the fourth central moment, we may find it useful to examine Figure 3.2.
The two distributions do in fact have the same mean (0) and variance (1). This may be surprising
at first sight, as the more sharply peaked distribution appears to be more tightly concentrated
around the mean. What has happened, however, is that this distribution has much longer tails.
The flatter distribution (actually a Gaussian distribution) has a density very close to zero outside
of the range 3 < x < 3; but for the more sharply peaked distribution, the density falls away much
more slowly, and is still quite detectable at 5. In evaluating the variance, the occasionally very
large values for (X X )2 inflate the variance sufficiently to produce equal variances for the two
3-3

-4

-2

Figure 3.2: Example of differences in kurtosis


distributions. But consider what happens when we calculate 4 : the occasional large discrepancies
create an even greater effect when raised to the power 4, and thus 4 is larger for the sharplypeaked-and-long-tailed distribution than for the flatter, short-tailed distribution. The single word
to describe this contrast is kurtosis, and the sharp-peaked, long-tailed distribution is said to have
greater kurtosis than the other.
Thus the fourth central moment, 4 , is a measure of kurtosis, in the sense that for two distributions
having the same variance, the one with the higher 4 has the greater kurtosis (is more sharply
peaked and long-tailed). But as with the third moment, 4 is also affected by the spread, and thus
once again it is useful to have a measure of kurtosis only (describing the shape, not the spread, of
the distribution). This elimination of spread is achieved by the coefficient of kurtosis defined as:
4
4
=
4
X
(Var(X))2
which again does not depend on the units of measurement used for X.
For the Gaussian distribution (the flatter of the two densities illustrated in Figure 3.2), the coefficient of kurtosis is always 3 (irrespective of mean and variance). The more sharply peaked density
in Figure 3.2 is that of a random variable which follows a so-called mixture distribution, i.e.
its value derives from the Gaussian distribution with mean 0 and variance 4 with probability 0.2,
and from the Gaussian distribution with mean 0 and variance 0.25 otherwise. The coefficient of
kurtosis in this case turns out to be 9.75.
There is, however, an alternative definition for the coefficient of kurtosis obtained by subtracting
3 from the above, i.e.:
4
4 3
X

3-4

so that the second definition partly measures in effect departure from the Gaussian distribution
(negative values corresponding to flatter and shorter-tailed than Gaussian distributions, and positive values to distributions which are more sharply peaked and heavy tailed than the Gaussian.
One could in principle continue further with higher order moments still, but there seems to be
little practical value in doing so: the first four moments do give considerable insight into the
shape of the distribution. Working with moments has great practical value, since from any set
of
of a random variable, we can obtain the corresponding sample moments based on
Pnobservations
r
(x

)
.
These sample moments can be used to match the sample data to a particular
i=1 i
family of distributions.
Some useful formulae: Apart from the first moment, it is the central moments r which best
describe the shape of the distribution, but more often than not it is easier to calculate the
raw or uncentred moments r . Fortunately, there are close algebraic relationships between
the two types of moments, which are stated below for r = 2, 3, 4. The derivation of these
relationships is left as an exercise. The relationship for the variance is so frequently used
that it is worth remembering (although all three formulae are easily recollected once their
derivation is understood):
2
2 = X
= 2 (X )2
3 = 3 32 X + 2(X )3

4 = 4 43 X + 62 (X )2 3(X )4
Example: Suppose that X has the exponential distribution with parameter . The mean is
given by:
Z
Z
X = E[X] =
x fX (x) dx =
xex dx
0

which can be integrated by parts as follows:


X


x(ex ) 0 +

1
= 0+

1
=

ex dx

For r > 1 we have in general that:


r

= E[X ] =

x fX (x) dx =

xr ex dx

which may again be integrated by parts as follows:


Z


r = xr (ex ) 0 +
rxr1 ex dx
0
Z
1 r1 x
= 0 +r
x e
dx
0
r
=

r1
From this recursion relationship it follows that 2 = 2/2 , 3 = 6/3 , and 4 = 24/4 . It is
left as an exercise to convert these raw moments (using the above formulae) to the second,
third and fourth central moments, and hence also to determine the coefficients of skew (2)
and kurtosis ( 94 ) for the exponential distribution. Note that these coefficients do not depend
on .
3-5

3.2.1

Moments - examples A F

We consider Example A F of chapter one again, to obtain some of the moments:


Example A:
pY (y) =

4
y

py (1 p)4y , y = 0, 1, 2, 3, 4

note that pY (y) 0 for all y = 0, 1, 2, 3, 4 (a countable set) and zero for all other values of y.

Find the mean of Y

E[Y ] =

4
X

y pY (y)

y=0



4
X
4
y
py (1 p)4y
y
y=0

0+

4
X

y=1

=
=
=

p4

(4 1)!
py1 (1 p)4y
(y 1)!(4 y)!

3
X

3!
px (1 p)3+1(x+1) by setting x = y 1 and m = 4 1
(x)!((3
+
1)

(x
+
1))!
x=0

3 
X
3
4p
px (1 p)3x
x
x=0

3 
X
3
px (1 p)3x = 1 (pmf of B(3, p))
4p(1) because
x
4p

x=0

4p

Example B: Recall that the pmf of X is given by


pX (x) =

x
, x = 1, 2, 3, 4; zero elsewhere
10

Find the mean and variance of X:

E[X] =

4
X

x pX (x)

X=1

4
X

X=1

=
=
=

x
10

1
2
3
4
+2 +3 +4
10
10
10
10
1 + 4 + 9 + 16
10
30
10

3-6

E[X 2 ] =

4
X

x2

X=1

=
=
=

x
10

1
2
3
4
+ 4 + 9 + 16
10
10
10
10
1 + 8 + 27 + 64
10
100
10

E[X 2 ] [E[X]]2
100
9
10
1

V ar[X] =
=
=

Example C: Recall that the pmf of S is given by the following table:


Value of S
pS (s)

0
0.15

1
0.25

2
0.25

3
0.15

4
0.20

The mean and variance of S is:

E[S] =

4
X

spS (s)

S=0

=
=

E[S 2 ] =

0 0.15 + 1 0.25 + 2 0.25 + 3 0.15 + 4 0.20

4
X

s2 pS (s)

S=0

=
=

0 0.15 + 1 0.25 + 4 0.25 + 9 0.15 + 16 0.20


5.8

V ar[S]

=
=

E[S 2 ] [E[S]]2
1.8

Example D: Recall that the pdf of T is given by


fT (t) =

3 2
t for 0 t 2
8

Find r

3-7

=
=

E[Tr ]
Z 2
tr fT (t) dt
0

3
tr t2 dt
8

3 r+2
t
dt
8

=
=


2
3 tr+2+1
8 r + 2 + 1 t=0

3 2r+2+1
8r+2+1

thus the mean is

E[T] = 1
3 21+2+1
=
81+2+1
3
=
2
and
E[T2 ] =
=
=

2
3 22+2+1
82+2+1
12
5

thus the variance is

V ar[T ] = E[T 2 ] [E[T ]]2


12 9
=

5
4
12 4 9 5
=
20
3
=
20
Example E: Recall that the pdf of W is given by

fW (w)

2e2w for 0 < w <

Find r

3-8

= E[Wr ]
Z
=
wr fW (w) dw
0
Z
=
wr 2e2w dw
0
Z h ir
x
1
1
=
2
ex dx [by setting 2w = x, dw = dx]
2
2
2
 0 r Z
1
=
x(r+1)1 ex dx
2
0
 r
1
[r + 1]
=
2

thus the mean is

1
 1
1
[1 + 1]
2
1
2

E[W] =
=
=

E[W2 ]

= 2
 2
1
=
[2 + 1]
2
1
=
2
4
1
=
2

Var[W] =
=
=

2 [1 ]2
1 1

2 4
1
4

Example F: Recall that the pdf of X is given by

fX (x)

1
for 2 < x < 12
10

Find r

3-9

= E[Xr ]
Z 12
=
xr fX (x) dx
2

12

1
dx
10
2

12
1 xr+1
10 r + 1 x=2

=
=

xr

1 12r+1 2r+1
10
r+1

=
thus the mean is

E[X] = 1
1 122 22
=
10
2
1 122 22
=
10
2
1 1
=
(144 4)
10 2
= 7

E[X2 ] = 2
1 123 23
=
10
3
1
=
(123 23 )
30
1720
=
30
and the variance is
Var[X]

3.3

= 2 [1 ]2
172
=
49
3
= 8.3333

The moment generating function

The moment generating function (mgf ) of a random variable X is defined by:


MX (t) = E[etX ]
for real-valued arguments t, i.e. by:
MX (t) =

etx fX (x) dx

3-10

for continuous random variables, or by:


MX (t) =

etx pX (x)

x=0

for discrete random variables. Note that the moment generating function is a function of t, and not
of realizations of X. The random variable X (or more correctly, perhaps, its distribution) defines
the function of t given by MX (t).
Clearly, MX (0) = 1, since for t = 0, etx = 1 (a constant). For non-zero values of t, we cannot
be sure that MX (t) exists at all. For the purposes of this course, we shall assume that MX (t)
exists for all t in some neighbourhood of t = 0, i.e. that there exists an > 0 such that MX (t)
exists for all < t < +. This condition is a restrictive assumption, in that particular or specific
otherwise well-behaved distributions are excluded. We are, however, invoking this assumption
for
convenience only in this course. If we used a purely imaginary argument, e.g. it, where i = 1,
then the corresponding function E[eitX ] (called the characteristic function of X, when viewed as
a function of t) does exist for all proper distributions. Everything which we shall be doing with
the mgfs carry through to characteristic functions as well, that extension involves us in issues of
complex analysis. For ease of presentation in this course, therefore, we restrict ourselves to the
mgf.
Recalling the power series expansion of ex , we see that:
MX (t)
t2 X 2
t3 X 3
t4 X 4
+
+
+ ]
2!
3!
4!
t2
t3
t4
= 1 + tE[X] + E[X 2 ] + E[X 3 ] + E[x4 ] +
2!
3!
4!
2
3
4
t
t
t
= 1 + t1 + 2 + 3 + 4 +
2!
3!
4!
= E[1 + tX +

Now consider what happens when we repeatedly differentiate MX (t). Writing:


dr MX (t)
dtr

(r)

MX =
we obtain:
(1)

MX

=
=

2t
3t2
4t3
2 +
3 +
+
2!
3!
4! 4
t2
t3
1 + t2 + 3 + 4 + .
2!
3!
1 +

Similarly:
(2)

MX = 2 + t3 +

t3
t2
4 + 5 + ,
2!
3!

and continuing in this way, we have in general:


(r)

MX = r + tr+1 +

t2
t3
r+2 + r+3 + .
2!
3!
(1)

(2)

If we now set t = 0 in the above expressions, we obtain 1 = X = MX (0), 2 = MX (0), and in


general:
(r)
r = MX (0).
We thus have a procedure for determining moments by performing just one integration or summation (to get MX (t)), and the required number of differentiations. This is often considerably
3-11

simpler than attempting to compute the moments directly by repeated integrations or summations. This only gives the uncentred moments, but the centred moments can be derived from these
raw moments, using the formulae in the previous chapter.
The expansion for MX (t) used above indicates that the mgf is fully determined by the moments
of the distribution, and vice versa. Since the distribution is in fact fully characterized by its
moments, this correspondence suggests that there is a one-to-one correspondence between mgfs
and probability distributions. This argument is not a proof at this stage, but the above assertion
can in fact be proved for all probability distributions whose mgfs exist in a neighbourhood of
t = 0. The importance of this result is that if we can derive the mgf of a random variable, then
we have in principle also found its distribution. In practice, what we do is to calculate and record
the mgfs for a variety of distributional forms. Then when we find a new mgf, we can check back
to see what the distribution it matches. This idea will be illustrated later. We now derive mgfs
for some important distributional classes.
Example (Geometric distribution): pX (x) = pq x for x = 0, 1, 2, . . ., where q = 1 p. Thus:
MX (t)

= E[eXt ]

X
=
etx pX (x)
x=0

etx pq x

x=0

= p

(qet )x .

x=0

Using the standard sum of a geometric series, we obtain MX (t) = p/(1 qet ), provided that
qet < 1. The mgf thus exists for all t < ln(q), where ln(q) is a positive upper bound
since q < 1 by definition.
Exercise Obtain the mean and variance of the geometric distribution from the mgf
Example (Poisson distribution):
x e
x!

pX (x) =
The mgf is thus:
MX (t)

=
=

x e
x!

etx

x=0

(et )x e
x!
x=0
t

= e ee

X
(et )x ee
x!
x=0

Now the term in the summation expression is the pmf of a Poisson distribution with parameter
et , and the summation thus evaluates to 1. The mgf is thus:
t

MX (t) = e(e

1)

This recognition of a term which is equivalent to the pdf or pmf of the original distribution,
but with modified parameters, is often the key to evaluating mgfs.
3-12

The first two derivatives are:

(1)

MX (t) = et e(e
and:

(1)

MX (t) = et e(e

1)

1)
t

+ (et )2 e(e

Setting t = 0 in these expressions gives X = and

1)

= + . Thus:

2
X
= 2 (X )2 = .

Exercise (Binomial distribution): Show that the mgf of the binomial distribution is given by:
MX (t) = (q + pet )n
where q = 1 p.

Hint: Combine terms with x as exponent.

Example (Gamma distribution):


MX (t) =
=
=

etx

Z0

1 x
x
e
dx
()

1 (t)x
x
e
dx
()
0
Z

( t) 1 (t)x

x
e
dx
( t) 0
()

The integrand in the last line above is the pdf of the gamma distribution with parameters
and t, provided that t < . Thus, for t < , the integral above evaluates to 1. Note how
once again we have recognized another form of the distribution with which we began. We
have thus demonstrated that for the gamma distribution:



t
MX (t) =
= 1
( t)

2
We leave it as an exercise to verify that X = /, and that X
= /2 , from the mgf.

Recall that the 2 distribution with n degrees of freedom is the gamma distribution with
= n/2 and = 21 . Thus the mgf of the 2 distribution with n degrees of freedom is given
by:
MX (t) = (1 2t)n/2 .
We shall make use of this result later.
We consider Example A F of chapter one again, to obtain some moment generating functions:
Example A: Y B(4, p) - already done for homework (see above)
Example B, C and D mgf not useful.
Example E: Let W be a random variable of the continuous type with pdf given by
fW (w)

= 2e2w for 0 < w < .

The mgf of W is given by


MW (w)

= E[ewt ]
Z
=
ewt fW (w) dw
Z0
=
ewt 2e2w dw
Z0
=
2ew(2t) dw
0

3-13

To solve this integral we make the following transformation


x
1
w(2 t) = x thus w = 2t
and dw = 2t
dx
thus
Z
1
2ex
dx
MW (w) =
2

t
0
2
[0 1]
=
2t
2
=
2t
Example F: Let X be a random variable of the continuous type with pdf given by

fX (x)

1
for 2 < x < 12.
10

The mgf of X is given by


MX (x)

=
=

E[ext ]
Z 12
ext fX (x) dx
2

=
=
=
=

3.4

12

ext

1
dx
10

Z
1 1 12 xt
te dx
10 t 2
1  xt 12
e x=2
10t
1 12t
[e e2t ]
10t
e12t e2t
10t

Moment generating functions for functions of random


Variables

Apart from its use in facilitating the finding of moments, the mgf is a particularly useful tool
in deriving the distributions of functions of one or more random variables. We shall be seeing a
number of examples of this application later. For now, we examine the general principles.
Suppose that we know the mgf of a random variable X, and that we are now interested in the mgf
of Y = aX + b, for some constants a and b. Clearly:
E[etY ] = E[eatX+bt ] = ebt E[eatX ]
since b, and thus ebt , is a constant for any given t. We have thus demonstrated that MY (t) =
ebt MX (at). Note that by MX (at), we mean the mgf of X evaluated at an argument of at (Hint:
Think about MX () imply the mgf of the random variable X evaluated at an argument of ).

For example, suppose that X is Gaussian distributed with mean and variance 2 . We know that
2
Z = (X )/ has the standard Gaussian distribution, i.e. MZ (t) = et /2 . But then X = Z + ,
and by the above result:
2 2
MX (t) = et MZ (t) = et+ t /2 .
3-14

A second important property relates to the mgfs of sums of independent random variables. Suppose that X and Y are independent random variables with known mgfs. Let U = X + Y ; then:
MU (t) = E[etU ] = E[etX+tY ] = E[etX etY ] = E[etX ]E[etY ]
where the last equality follows from the independence of X and Y . In other words:
MU (t) = MX (t)MY (t).
This result can be extended: for example if Z is independent of X and Y (and thus also of
U = X + Y ), and we define V = X + Y + Z, then V = U + Z, and thus:
MV (t) = MU (t)MZ (t) = MX (t)MY (t)MZ (t).
Taking this argument further in an inductive sense, we have the following theorem:
Theorem 3.1 Suppose that X1 , X2 , X3 , . . . , Xn are independent
Pnrandom variables, and that Mi (t)
is (for ease of notation) the mgf of Xi . Then the mgf of S = i=1 Xi is given by:
MS (t) =

n
Y

Mi (t).

i=1

An interesting special case of the theorem is that in which the Xi are identically distributed, which
implies that the moment generating functions are identical: MX (t), say. In this case:
n

MS (t) = [MX (t)] .


= S/n, we can combine our previous results to get:
Since the sample mean X

 n
t
MX (t) = MX
.
n
This equation is an important result which we will use again later.
We can use theorem 3.1 to prove an interesting property of the Poisson, Gamma and Gaussian
distributions (which does not carry over to all distributions in general, however). (This property
can be described as the closure of each of the families under addition of independent variables
within the family).
Poisson Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
Xi has the Poisson distribution with parameter i . Then:
t

Mi (t) = ei (e
and the mgf of S =

Pn

i=1

1)

Xi is:
n
Y

ei (e

1)

= exp

i=1

"

n
X
i=1

i (et 1)

which is the mgf of the Poisson distribution with parameter


distribution.

Pn

i=1

i . Thus S has this Poisson

Gamma Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
each Xi has a Gamma distribution with a common value for the parameter, but with
possibly different values for the parameter, say i for Xi . Then:

i
t
Mi (t) = 1

3-15

and the mgf of S =

Pn

i=1

Xi is:
n 
Y

i=1

t
1

i


 P ni=1 i
t
= 1

which is the mgf of the Gamma distribution with parameters


this Gamma distribution.

Pn

i=1

i and . Thus S has

For the special case of the chi-squared distribution, suppose that Xi has the 2 distribution
2
with rP
i degrees of freedom. It follows from the above result that S then has the distribution
n
with i=1 ri degrees of freedom.

Gaussian Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
Xi has a Gaussian distribution with mean i and variance i2 . Then:
2 2

Mi (t) = ei t+i t
and the mgf of S =

Pn

i=1

/2

Xi is:
exp

"

n
X

(i t) +

i=1

n
X

(i2 t2 /2) .

i=1

P
P
Which is the mgf of the Gaussian distribution with mean of ni=1 i and variance of ni=1 i2 .
Thus S (the sum) has the Gaussian distribution with this mean and variance.
The concept of using the mgf to derive distributions extends beyond additive transformations. The
following result illustrates this, and is of sufficient importance to be classed as theorem.
Theorem 3.2 Suppose that X has the Gaussian distribution with mean and variance 2 , and
let:
(X )2
.
Y =
2
Then Y has the 2 distribution with one degree of freedom.
(We will skip the proof)
Corollary: If X1 , X2 , . . . , Xn are independent random variables from a common Gaussian distribution with mean and variance , then:
Y =

n
X
(Xi )2
i=1

has the 2 distribution with n degrees of freedom. This is a very important result!

3.5

The central limit theorem

We come now to what is perhaps the best known, and most widely used and misused, result
concerning convergence in distribution, viz. the central limit theorem. We state it in the following
form:

3-16

Theorem 3.3 The Central Limit Theorem) Let X1 , X2 , X3 , . . . be an iid sequence of random
variables, having finite mean () and variance ( 2 ). Suppose that the common mgf, say MX (t),
and its first two derivatives exist in a neighbourhood of t = 0. For each n, define:
n

X
n = 1
X
Xi
n i=1
and:
Zn =

n
X
.
/ n

Then the sequence Z1 , Z2 , Z3 , . . . converges in distribution to Z which has the standard Gaussian
distribution.
Comment 1: The practical implication of the central limit theorem is that for large enough n,
the distribution of the sample mean can be approximated by a Gaussian distribution with
mean and variance 2 /n, provided only that the underlying sampling distribution satisfies
the conditions of the theorem. This is very useful, as it allows the powerful statistical inferential procedures based on Gaussian theory to be applied, even when the sampling distribution
itself is not Gaussian. However, the theorem can be seriously misused by application to cases
with relatively small n.
Comment 2: The assumption of the existence of the twice-differentiable mgf is quite strong.
However, the characteristic function (i.e. using imaginary arguments) and its first two derivatives will exist if the first two moments exist, which we have already assumed.

Tutorial exercises
1. A random variable X has probability density function given by

fX (x)

1 3
x
0x4
64
= 0 otherwise
=

(a) Find the expectation and variance of X.


(b) Is this distribution skew? Motivate your answer.
2. A random variable Y has probability density function given by

fY (y) =
=

ey
0

0y

otherwise

(a) Find r .
(b) Find the mean and variance of Y .
(c) Find the kurtosis of Y .
(d) Find the mgf (MY (y)) of Y .
(e) Use the mgf to derive the mean and variance of Y .
(f) Let W = 2Y , find the mgf of W . Can you identify the distribution of W ?

3-17

3. A random variable X has probability density function given by


fX (x)
(a)
(b)
(c)
(d)

=
=

2x 0 x 1
0 otherwise

Find r .
Find the mean and variance of X.
Find the kurtosis of X.
Comment on the distribution of X.

4. A random variable X has probability mass function given by

pX (x)

(a)
(b)
(c)
(d)

4x e4
x = 0, 1, 2,
x!
= 0 otherwise
=

Find r .
Find the mean and variance of X.
Find the mgf of X.
Use this mgf to find the first two moments of X.

5. A random variable T has probability mass function given by

pT (t) =
=

5
t



4
2t

36
otherwise

t = 0, 1, 2,

Find the mean and variance of X.


6. A random variable X has probability density function given by
fX (x)

=
=

10
x > 10
x
0 x 10

Find the mean and variance of X.


7. The following questions is from Hogg and Craig (1978):
denote the mean of a random sample of size 75 from the distribution that has
(a) Let X
the pdf
fX (x)

= 1 for 0 < x < 1


= 0 elsewhere

Use the central limit theorem to show that the approximate probability
< 0.55] = 0.866
P r[0.45 < X
.
denote the mean of a random sample of size 100 from a distribution that is 250
(b) Let X
(Chi-square with 50 degrees of freedom). Compute an approximate value of Pr[49 <
< 51].
X
Hogg R.V. and Craig A.T. (1978): Introduction to mathematical statistics (fourth edition).
Macmillan Publishing, Co., Inc., New York, 438 p.

3-18

Chapter 4

Moments of bivariate distributions


4.1

Assumed statistical background

Chapter two and three of this notes.

4.2

Moments of bivariate distributions: covariance and correlation

The concept of moments is directly generalizable to bivariate (or any multivariate) distributions.
If (X, Y ) is a bivariate random variable, then we define the joint (r, s)-th moment by:
rs = E[X r Y s ]
which for continuous random variables is given by:
Z Z
xr y s fXY (x, y) dy dx

and for a discrete random variable by


X

xr y s pXY (x, y).

As with univariate moments, it is useful to subtract out the means for higher order moments.
Thus the (r, s)-th central moment of (X, Y ) is defined by:
rs = E[(X X )r (Y Y )s ]
The simplest such case is when r = 1 and s = 1, thus
11 = E[(X X )(Y Y )]
which is termed the covariance of X and Y , written as Covar(X,Y ) or as XY . While variance
measures the extent of dispersion of a single variable about its mean, covariance measures the
extent to which two variables vary together around their means. If large values of X (i.e. X > X )
4-1

tend to be associated with large values of Y (i.e. Y > Y ), and vice versa, then (X X )(Y Y )
will tend to take on positive values more often than negative values, and we will have XY > 0. X
and Y will then be said to be positively correlated. Conversely, if large values of the one variable
tend to be associated with small values of the other, then we have XY < 0, and the variables are
said to be negatively correlated. If XY = 0, then we say that X and Y are uncorrelated.
Covariance, or the sample estimate of covariance, is an extremely important concept in statistical
practice. Two important uses are the following:
Exploring possible causal links: For example, early observational studies had shown that the
level of cigarette smoking and the incidence of lung cancer were positively correlated. This
suggested a plausible hypothesis that cigarette smoking was a causative factor in lung cancer.
This was not yet a proof, but suggested important lines of future research.
Prediction: Whether or not a causal link exists, it remains true that if XY > 0, and we observe
X >> X , then we would be led to predict that Y >> Y . Thus, even without proof of
a causal link between cigarette smoking and lung cancer, the actuary would be justified in
classifying a heavy smoker as a high risk for lung cancer (even if, for example, it is propensity
to cancer that causes addiction to cigarettes).
As with other moments, it is often easier to calculate the uncentred moment 11 , than to calculate
the covariance by direct integration. The following is thus a useful result, worth remembering:
XY

=
=
=

E[(X X )(Y Y )]
E[XY ] X E[Y ] Y E[X] + X Y

11 X Y

We continue with the examples (G and I) from chapter two:


Example G: The joint probability mass function of X and Y is:
y=
0
1
2
3

0
0.03125
0.06250
0.03125
0

x=
1
2
0.0625 0.03125
0.15625 0.1250
0.1250 0.15625
0.03125 0.0625

3
0
0.03125
0.06250
0.03125

In order to obtain the covariance, we first calculate:

E[XY ] =

3 X
3
X

xypXY (xy)

x=0 y=0

=
=

E[X] =

3
X

0 0 0.03125 + + 3 3 0.03125
5
2.5 =
2

xpX (x)

x=0

=
=

0 0.125 + 1 0.375 + 2 0.375 + 3 0.125


3
1.5 =
2
4-2

2
It is left as an exercise to show that Y = 32 , X
=

XY

3
4

and Y2 =

3
4

. The covariance is:

= E[XY ] X Y
5 3 3
=

2 2 2
1
=
4

Example H: Recall that the joint pdf of X, Y is given by:


3
(x + y 2 ) for 0 x 2; 0 y 2
28

fXY (x, y) =

2
It is left as an exercise to show that X = 16/14, Y = 9/7, X
=
order to obtain the covariance, we first calculate:

11

46
147

and Y2 =

71
245 .

In

= E[XY ]
Z 2 Z 2
=
xyfXY (xy) dy dx
x=0

x=0

=
=
=

3
28
3
28
3
28

y=0

xy

y=0
2 Z 2

x=0

(x2 y + xy 3 ) dy dx

y=0

x=0

3
(x + y 2 ) dy dx
28

x2 y 2
xy 4
+
2
4

2

dx

y=0

(2x2 + 4x) dx

x=0
 3

2
3 2x
4x2
+
28 3
2 x+0


3 2.4.2 4.2.2
=
+
28
3
2
= 10/7
=

Thus the covariance is:


XY =

10 16 9
2

=
7
14 7
49

We now state and prove a few important results concerning bivariate random variables, which
depend on the covariance. We start, however, with a more general result:
Theorem 4.1 If X and Y are independent random variables, then for any real valued functions
g(x) and h(y), E[g(X)h(Y )] = E[g(X)] E[h(Y )].
Proof: We shall give the proof for continuous distributions only. The discrete case follows analogously.

4-3

Since, by independence, we have that fXY (x, y) = fX (x)fY (y), it follows that:
Z Z
E[g(X)h(Y )] =
g(x)h(y)fX (x)fY (y) dy dx

Z

Z
=
g(x)fX (x)
h(y)fY (y) dy dx

Z
 Z

=
g(x)fX (x) dx
h(y)fY (y) dy
=

E[g(X)] E[h(Y )]

Note in particular that this result implies that if X and Y are independent, then E[XY ] =
11 =X Y , and thus that XY = 0. We record this result as a theorem:
Theorem 4.2 If X and Y are independent random variables, then XY = 0.
The converse of this theorem is not true in general (i.e. we cannot in general conclude that X and
Y are independent if XY = 0), although an interesting special case does arise with the gaussian
distribution (see property 3 of the Bivariate gaussian distribution). That the converse is not true,
is demonstrated by the following simple discrete example:
Example of uncorrelated but dependent variables: Suppose that X and Y are discrete random variables, with pXY (x, y) = 0, except for the four cases indicated below:
pXY (0; 1) = pXY (1; 0) = pXY (0; 1) = pXY (1; 0) =

1
4

Note that pX (1) = pX (1) = 14 and pX (0) = 12 , and similarly for Y . Thus X and Y are not
independent, because (for example) pXY (0; 0) = 0, while pX (0)pY (0) = 41 .
We see easily that X = Y = 0, and thus XY = E[XY ]. But XY = 0 for all for cases
with non-zero probability, and thus E[XY ]=0. Thus the variables are uncorrelated, but
dependent.
Theorem 4.3 For any real numbers a and b:
2
Var[aX + bY ] = a2 X
+ b2 Y2 + 2abXY

Proof: Clearly:
E[aX + bY ] = a E[X] + b E[Y ] = aX + bY
Thus the variance of aX + bY is given by:
E[(aX + bY aX bY )2 ]
= E[a2 (X X )2 + b2 (Y Y )2 + 2ab(X X )(Y Y )]

= a2 E[(X X )2 ] + b2 E[(Y Y )2 ] + 2abE[(X X )(Y Y )]


2
= a2 X
+ b2 Y2 + 2abXY

Special Cases:

The following are useful special cases:


2
+ Y2 + 2XY
Var[X + Y ] = X
2
Var[X Y ] = X
+ Y2 2XY

If X and Y are independent (or even if only uncorrelated), these two cases reduce to:
2
Var[X + Y ] = Var[X Y ] = X
+ Y2

4-4

As with the interpretation of third and fourth moments, the interpretation of covariance is confounded by the fact that the magnitude of XY is influenced by the spreads of X and Y themselves.
As before, we can eliminate the effects of the variances by defining an appropriate correlation coefficient, namely
XY =
To summarize:

XY
Covar[X, Y ]
=p
X Y
Var[X]Var[Y ]

1. Note that XY has the same sign as XY , and takes on the value zero when X and Y are
uncorrelated.
2. If X and Y are precisely linearly related, then |XY | = 1, with the sign being determined by
the slope of the line; and
3. If X and Y are independent, then XY = 0.
The magnitude of the correlation coefficient is thus a measure of the degree to which the two
variables are linearly related, while its sign indicates the direction of this relationship.
3
4

and XY =

3
4

2
Example G: In example G, we had X
= 34 , Y2 =

XY = q
2
Example H : In example H, we had X
=

46
147 ,

1
4
3
4

Y2 =

71
245

1
4

Thus:

1
3
2
and XY = 49
Thus:

2
49
= 0.13554
XY =
0.31293 0.28980

Example I: For homework show that XY = 0.

4.3

Conditional moments and regression of the mean

An alternative to covariance or correlation as a means of investigating the relationships between


two random variables, is to examine the means of the two conditional distributions. The conditional pdfs fX|Y (x|y) and fY |X (y|x) (or, for discrete distributions, the conditional probability mass
functions pX|Y (x|y) and pY |X (y|x)) define proper probability distributions, for which moments, and
in particular means, can be calculated in the usual manner. We shall write X|y , or E[X|Y = y],
to represent the mean of X conditional on Y = y, and similarly Y |x , or E[Y |X = x], for Y
conditional on X = x.
Note that initially we condition a sample space that includes all possible values of X and Y . But
when we condtion, we consider only a fragment of the original space.
The conditional mean E[X|Y = y] is the expectation (or long-run average) of X, amongst all
outcomes for which Y = y (or, more correctly for continuous random variables, for y Y y + h
in the limit as h 0). Note that E[X|Y = y] is a function of the real number y only; it is not
a random variable, and it does not depend on any observed value of X, since it is an average of
all Xs within a stated class. The conditional expectation of X given Y = y is also termed the
regression (of the mean of ) X on Y , which is often plotted graphically (i.e. E[X|Y = y] versus y).

4-5

We can, of course, also compute the regression of Y on X, and it is worth emphasizing that the
two regressions will not in general give identical plots.
In first year, you were introduced to the concept of linear regression, which can be viewed as a best
linear approximation to the relationship between E[Y |X = x] and x. In the previous chapter, we
noted that for the bivariate gaussian distribution, the means of the two conditional distributions,
i.e. the regressions are truly linear. In general, however, regressions will not be linear for other
distributions. This contrast is illustrated in the following example:
Example H (cont.)
for which:

We continue further with example H from the previous chapter, viz. that

fXY (x, y) =

3
(x + y 2 ) for 0 x 2; 0 y 2
28

From our previous results, it is easy to confirm that:


fY |X (y|x) =

x + y2
2x + 83

for 0 y 2, and zero elsewhere (but only defined for 0 x 2)


Thus the regression of Y on X is given by:
Z 2
Y |x =
yfY |X (y|x) dy
y=0
2

=
=
=
=

x + y2
8 dy
y=0 2x + 3
Z 2
1
(xy + y 3 ) dy
2x + 83 y=0
 2
2
1
xy
y4
+
2
4 y=0
2x + 83
2x + 4
2x + 83
y

It is left as an exercise to find the regression of X on Y .


Note that the two regressions are non-linear, and are not inverse functions of each other.

Tutorial Exercises
1. For each of the tutorial exercises (1 -6) of Chapter 2 calculate:
i) The marginal means and marginal variances.
ii) 11 .
iii) The covariance.
iv) The correlation coefficient.
v) The regression of Y on X.
vi) The regression of X on Y .
2. Let E[X] = 3, Var[X] = 5, E[Y ] = 0 and Var[Y ] = 1. If X and Y are independent, find
4-6

(a) E[2 + X]
(b) E[(2 + X)2 ]
(c) Var[2 + X]
(d) E[2X + 1]
(e) Var[2X + 1]
(f) E[XY ]
(g) Var[XY ]
(h) E[X + 3Y ]
(i) Var[X + 3Y ]

4-7

Chapter 5

Distributions of Sample Statistics


Assumed statistical background
Introstat - Chapter 8,9 and 10
Chapter 1-4 of this course

5.1

Random samples and statistics

In applying probability concepts to real world situations (e.g. planning or decision making), we
usually need to know the distributions of the underlying random variables, such as heights of
people in a population (for a clothing manufacturer, say), sizes of insurance claims (for the actuary), or numbers of calls through a telephone switchboard (for the telecommunications engineer).
Each random variable is defined in terms of a population, which may, in fact, be somewhat
hypothetical, and will almost always be very large or even infinite. We cannot then determine
the distribution of the random variable by total census or enumeration of the population, and we
resort to sampling. Typically, this involves three conceptual steps:
1. Conceptualize or visualize a convenient hypothetical population defined on possible values of
the random variable itself (rather the real observational units). For example, the population
may viewed as all non-negative integers, or all real numbers (rather than hours of the day or
students). Sometimes a convenient population may include strictly impossible situations;
for example, we may view that population of student heights as all real numbers from
to !
2. Use a combination of professional judgment and mathematical modeling to postulate a distributional form on this population, e.g.:
Persons heights assumed N (, 2 ) on all real numbers;

Actuarial claims assumed to be Gamma(, ) distributed on positive real numbers;


Number of calls through a switchboard in any given period assumed Poisson().

Note that this will usually leave a small number of parameters unspecified at this point, to
be estimated from data.
3. Observe an often quite small number of actual instances (outcomes of random experiments,
or realizations of random variables), the sample, and use the assumed distributional forms
to generalize sample results to the entire assumed population, by estimating the unknown
parameters.
5-1

The critical issue here is the choice of the sample. In order to make the extension of sample results
to the whole population in any way justifiable, the sample needs to be representative of the
population. We need now to make this concept precise. Consider for example:
Are the heights of students in the front row of the lecture room representative of all UCT
students?
Would the number of calls through a company switchboard during 09h00-10h00 on Monday
morning be representative?
Would 10 successive insurance claims be representative of claims over the entire year?
The above examples suggest two possible sources of non-representativeness, viz. (i) observing certain parts of the population preferentially, and (ii) observing outcomes which are not independent
of each other. One way to ensure representativeness is to deliberately impose randomness on the
selected population, in a way which ensures some uniformity of coverage. If X is the random
variable in whose distribution we are interested (e.g. heights of male students) then we attempt
to design a scheme whereby each male student is equally likely to be chosen, independently of
all others. Quite simply, each observation is then precisely a realization of X. Practical issues of
ensuring this randomness and independence will be covered in later courses in this department,
but it is useful to think critically of any sampling scheme in these terms. In an accompanying
tutorial, there are a few examples to think about. Discuss them amongst yourselves.
For this course, we need to define the above concepts in a rigorous mathematical way. For this
purpose, we shall refer to the random variable X in which we are interested as the population random variable, with distribution function given by FX (x), etc. Any observation, or single realization
of X will usually be denoted by subscripts, e.g. X1 , X2 , . . . , Xi , . . .. The key concept is that of a
random sample defined as follows:
Definition: A random sample of size n of a population random variable X is a collection on n iid
random variables X1 , X2 , . . . , Xn all having the same distribution as X.
It is usual to summarize the results of a random sample by a small number of functions of
X1 , X2 , . . . , Xn . You will be familiar with 5-number summaries, and with the sample mean
and variance (or even higher order sample moments). All of these summaries have the property
that they can be calculated from the sample, without any knowledge of the distribution of X. Any
summary which satisfies this property is called a statistic. Formally:
Definition: A statistic is any function of a random sample which does not depend on any unknown
parameters of the distribution of the population random variable.
P
P
P
Thus, a function such as ( ni=1 Xi4 )/( ni=1 Xi2 )2 would be a statistic, but ni=1 (Xi X )2 would
generally not be (unless the population mean X were known for sure a priori).
Let T (X1 , X2 , . . . , Xn ) be a statistic. It is important to realize that T (X1 , X2 , . . . , Xn ) is a random variable, one which takes on the numerical value T (x1 , x2 , . . . , xn ) whenever we observe the
joint event defined by X1 = x1 , X2 = x2 , . . . , Xn = xn . If we are to use this statistic to draw
inferences about the distribution of X, then it is important to understand how observed values
of T (X1 , X2 , . . . , Xn ) vary from one sample to the next: in other words, we need to know the
probability distribution of T (X1 , X2 , . . . , Xn ). This we are able to do, using the results of the previous chapters, and we shall be doing so for a number of well-known cases in the remainder of this
chapter. In doing so, we shall largely be restricting ourselves to the case in which the distribution
of the population random variable X is normal with mean and variance 2 , one or both of which
are unknown. The central limit theorem will allow us to use the same results as an approximation
for non-normal samples in some cases.
5-2

5.2

Distributions of sample mean and variance for Gaussian


distributed populations

Assume we draw a sample from a Gaussian population (e.g. Xi N (ui , i2 )), recall that the
moment generating function of Xi is given by
1
MXi (t) = exp{i t + i2 t2 },
2
remember the t is keeping a space:
1
MXi () = exp{i + i2 2 }
2
The sample mean is the statistic:

X
= 1
X
Xi .
n i=1
is:
Assuming that the sample does in fact satisfy our definition of a random sample, the mgf of X

MX (t)

= E[etX ]

 P 
Xi
= E exp t
n
t
t
t
= E[exp( X1 )]E[exp( X2 )] E[exp( Xn )] (given independence)
n
n
n
t
t
t
= MX1 ( )MX2 ( ) MXn ( )
n
n
n
t
= [MX ( )]n (given identical)
n
t
= [MX ()]n ( = )
n
1 2 2 n
= [exp{ + }]
2
1
= exp{n + n 2 2 }
2
t
1
t
= exp{n + n 2 ( )2 }
n
2
n
1 2 2
= exp{t +
t }
2 n

which is the mgf of the Gaussian distribution with mean and variance 2 /n. The central limit
theorem told us that this was true asymptotically for all well-behaved population distributions,
but for gaussian sampling this is an exact result for all n.
The distribution of the sample variance is a little more complicated, and we shall have to approach
this stepwise. As a first step, let Ui = (Xi )/, and consider:
Pn
n
X
(Xi )2
2
Ui = i=1 2
.

i=1
By the corollary to Theorem 3.2, we know that this has the 2 distribution with n degrees of
freedom. In principal we know its pdf. (from the gamma distribution), and we can compute
5-3

as many moments as we wish. Integration of the density to obtain cumulative probabilities is


more difficult, but fortunately this is what is given in tables of the 2 distribution. For any
(0 < < 1), let us denote by 2n; the value which is exceeded with probability . In other words,
if V has the 2 distribution with n degrees of freedom, then:
Pr[V > 2n; ] = .
P
Now ni=1 (Xi )2 / 2 is not a statistic, but in the special case in which is known, it is the ratio
of a statistic to the remaining unknown parameter ( 2 ), and can thus be used to draw inferences
2
about
Pn . Let2 us2 briefly look further at this case, to see how knowledge of the distribution of
i=1 (Xi ) / allows us to draw inferences. Throughout this chapter, we will be illustrating
how the results which we derive apply to concepts of point and interval estimation and of hypothesis
tests, with which you should really be familiar. Further examination of the underlying theoretical
principles and philosophies of these concepts will be dealt with in STA3030F.
Point estimation of : From the properties of the 2 (gamma) distributions, we know that:
 Pn

2
i=1 (Xi )
E
= n.
2
Re-arranging terms, noting that is a constant (and not a random variable), even though it
is unknown, we get:
" n
#
1X
2
E
(Xi ) = 2 .
n i=1
Suppose now that, based on the observed values from a random sample X1 = x1 , X2 =
x2 , . . . , Xn = xn , we propose to use the following as an estimate of :
n

1X
(xi )2 .
n i=1
From one sample, we can make no definite assertions about how good this estimate is. But
if samples are repeated many times, and the same estimation procedure applied every time,
then we know that we will average out at the correct answer in the long run. We say that
the estimate is thus unbiased.
Hypothesis tests on : Now suppose that a claim is made that 2 = 02 , where 02 is a given
positive real number. Is this true? We might make this the null hypothesis, with an
alternative given by 2 > 02 . If the null hypothesis is true, then from the properties of the
2 distribution, we know that:

 Pn
2
2
i=1 (Xi )
Pr
n; = .
02
Suppose now that for a specific observed sample, we find that
Pn
2
i=1 (Xi )
02
is in fact larger than 2n; for some suitably small value of . What do we conclude? We
cannot be sure whether the null hypothesis is true or false. But we do know that either
the null hypothesis is false, or we have observed a low probability event (one that occurs
with probability less than ). For sufficiently small , we would be led to reject the null
hypothesis at a 100% significance level.

5-4

Confidence interval for : Whatever the true value of , we know from the properties of the
2 distribution that:
Pn


2
2
2
i=1 (Xi )
Pr n;1/2
n;/2 = 1
2
or, after re-arranging terms, that:
#
"P
Pn
n
2
2
2
i=1 (Xi )
i=1 (Xi )

= 1 .
Pr
2n;/2
2n;1/2
An important point to note is that this is a statement of probabilities concerning the outcome
of a random sample from a normal distribution with known mean , and fixed (but unknown)
variance 2 . It is NOT a probability statement about values of 2 .
In view of this result, we could state on the basis of a specific observed random sample, that
we have a substantial degree of confidence in the claim that 2 is included in the interval
"P
#
Pn
n
2
2
i=1 (xi )
i=1 (xi )
;
.
2n;/2
2n;1/2
In any one specific case, the assertion is either true or false, but we dont know which.
However, if we repeat the procedure many times, forming intervals in this way every time,
then in the long run, on average, we will be correct a proportion 100(1 )% of the time,
which is why the above interval is conventionally called a 100(1 )% confidence interval for
2 .
Example: A factory manufactures ball bearings of a known mean diameter of 9mm, but individual
ball bearings are assumed to be normally distributed with a mean of 9mm and a standard
deviation of .
A sample of 10 ball bearings is taken, and their diameters in mm (x1 , x2 , . . . , x10 ) are carefully
P10
measured. It turns out that i=1 (xi 9)2 = 0.00161.

Our estimate of 2 is thus 0.00161/10 = 0.000161, while that of the standard deviation is

0.000161 = 0.0127. A 95% confidence interval can be calculated by looking up in tables


that 210;0.025 = 20.483 and 210;0.975 = 3.247. The confidence interval for 2 is thus:


0.00161 0.00161
;
20.483
3.247



7.86 105 ; 49.58 105 .

The corresponding confidence interval for is obtained by taking square roots, to give
[0.0089 ; 0.0223].
Suppose, however, that the whole point of taking the sample was because of our skepticism
with a claim made by the factory manager that 0.009. We test this against the alternative
hypothesis that > 0.009. For this purpose, we calculate the ratio 0.00161/(0.009)2 which
comes to 19.88. From 2 tables we find that 210;0.05 = 18.307 while 210;0.025 = 20.483. We
thus can say that we would reject the factory managers claim at the 5% significance level,
but not at the 2 21 % level (or, alternatively, that the p-value for this test is between 0.025 and
0.05). There is evidently reason for our skepticism in the managers claim!
The problem, however, is that in most practical situations, if we dont know the variance, we also
do not know the mean. The obvious solution is to replace the population mean by the sample
mean, i.e. to base the same sorts of inferential procedures as we had above on:
Pn
2
(n 1)S 2
i=1 (Xi X)
=
2
2
5-5

where S 2 is the usual sample variance defined by:


n

S2 =

1 X
2.
(Xi X)
n 1 i=1

Pn
2

Certainly, if we define Vi = (Xi X)/,


then the above is
i=1 Vi , and it is easily shown
that the Vi s are normally distributed with zero mean. The variance of Vi can be shown to be
(n 1)/n,
Pn which is slightly less than 1, but the real problem is that the Vi are not independent,
since i=1 Vi = 0, and thus Theorem 3.2 and its corollary no longer apply. And yet it seems
intuitively that not too much can have changed. We now proceed to examine this case further.
Firstly, however, we need the following theorem:
and S 2 are
Theorem 5.1 For random samples from the normal distribution, the statistics X
independent random variables.
The proof is not part of this course.
Theorem 5.2 For random samples from the normal distribution, (n 1)S 2 / 2 has the 2 distribution with n 1 degrees of freedom
The proof is not part of this course.
Comment: The only effect of replacing the population mean by the sample mean is to change
the distribution from the 2 with n degrees of freedom to one with n 1 degrees of freedom.
The one linear function relating the n terms has lost us one degree of freedom.
Note that the expectation of (n 1)S 2 / 2 is thus n 1, and that E[S 2 ]= 2 , i.e. S 2 is now
the unbiased estimator of 2 .
Proof: The proof is not part of this course.
which is
Example: Suppose, in the context of the previous
we discover that it is X
Pnexample, that
2
equal to 9mm, and not , and that in fact i=1 (xi x
) = 0.00161. The unbiased estimator
of 2 is thus 0.00161/9=0.000179.

For the significance test, we now need to compare the observed value (19.88) of the statistic
with critical points of the 2 distribution with 9 (not 10) degrees of freedom. We find now that
29;0.025 = 19.023, so that the hypothesis is rejected at the 2 21 % significance level. Similarly,
the 95% confidence interval needs to be based on 29;0.025 and 29;0.975 = 2.70, which gives
[8.46 105 ; 59.6 105 ] as the confidence interval for 2 , or [0.0092 ; 0.024] for .

5.3

Application to 2 goodness-of-fit tests

The previous section shows that if a sequence of random variables Z1 , Z2 , . . . , Zn are approximately
normally distributed with mean 0 and variance
P1,n and are nearly independent in the sense that the
only relationship between them is of the form i=1 Zi = 0, then the 2 distributional result remains
correct, but we lose one degree of freedom. Strictly speaking we needed Var[Zi ]=1 1/n, but
that is often a small effect. This suggests an intuitive rule, that the sum of squares of approximately
standardized normal random variables will have a 2 distribution, but with one degree of freedom
lost for each relationship between them. This intuition serves us well in many circumstances.
Recall the 2 goodness of fit test from first year. We could view this in the following way. Suppose
we have a random sample X1 , X2 , . . . , Xn , and a hypothesis that these come from a specified
5-6

distribution described by F (x). We partitioned the real line into a fixed number (say K) on
contiguous intervals, and calculated the theoretical probability, say pk , that X would fall in each
interval k = 1, 2, . . . , K, assuming that the hypothesis were true, as follows:
Z
pk =
f (x) dx
{xinterval k}

where f (x) is the pdf corresponding to F (x). Let the random variable Nk be the number
Pn of observations in which the random variable is observed to fall into interval k. Evidently, k=1 Nk = n.
Any one Nk taken on its own is binomially distributed with parameters n and pk . For sufficiently
large n, the distribution of Nk can be approximated by the normal distribution with mean npk
and variance npk (1 pk ). (Conventional wisdom suggests that the approximation is reasonable
if npk > 5, or some would say > 10.) Thus:
Nk npk
p
npk (1 pk )

has approximately the standard normal distribution. It is perhaps a little neater to work in terms
of:
Nk npk
Zk =
npk
which is also then approximately normal with zero mean and variance equal to 1 pk . Note that
the Zk are related by the constraint that:
K
K
X
X

npk Zk =
(Nk npk ) = 0.
k=1

Thus

K
X

k=1

k=1

Zk2 =


K 
X
(Nk npk )2
k=1

npk

is a sum of squares of K terms which are approximately normally distributed with mean 0. If the
choice of categories is reasonably balanced, then the pk will be approximately equal, i.e. pk 1/K,
in which case each term has variance of approximately 1 1/K. This is then fully analogous to
the situation in Theorem 5.2 (with K playing the role of n there), and we would expect the same
result to occur, viz. that the above sum has the 2 distribution with K 1 degrees of freedom. It
does, in fact, turn out that this is a good approximation, which is thus the basis for the 2 test.
If the distribution function F (x) is not fully specified at the start, but involves parameters to be
estimated from the data, this imposes further relationships between the Zk , leading to further
losses of degrees of freedom.

5.4

Students t distribution

For a random sample of size n from the N (, 2 ) distribution, we know that:



X

/ n
has the standard normal distribution, and this fact can be used to draw inferences about if is
known. For example, a test of the hypothesis that = 0 can be based on the fact (see normal
tables) that:


X 0
> 1.645 = 0.05
Pr
/ n
5-7

if the hypothesis is true. Thus, if the observed value of this expression exceeds 1.645, then we could
reject the hypothesis (in favour of > 0 ) at the 5% significance level. Similarly, a confidence
interval for for known can be based on the fact that:



X
< +1.96 = 0.95
Pr 1.96 <
/ n
which after re-arrangement of the terms gives:



Pr X 1.96 < < X + 1.96


= 0.95
n
n
and not
We must re-emphasize that the probability refers to random (sampling) variation in X,
to which is viewed as a constant in this formulation.
In practice, however, the population variance is seldom known for sure. The obvious thing to
do is to replace the population variance by the sample variance, i.e. to base inferences on:
T =

S/ n

and S. Large values of the ratio can be due to


This is now a function of two statistics, viz. X
from the population mean or to values of S below the population standard
large deviations in X
and S are independent, and we know their distributions,
deviation. Fortunately, we do know that X
and thus we should be able to derive the distribution of T . It is useful to approach this in the
following manner. Let us first define:

X

Z=
/ n
(which has the standard normal distribution) and:
U=

(n 1)S 2
2

which has the 2 distribution with n 1 degrees of freedom. Z and U are independent, and:
Z
T = p
.
U/(n 1)

In the following theorem, we derive the pdf of T in a slightly more general context which will prove
useful later.
Theorem 5.3 Suppose that Z and U are independent random variables having the standard normal distribution and the 2 distribution with m degrees of freedom respectively. Then the pdf of

is given by:

Z
T =p
U/m

(m+1)/2
((m + 1)/2)
t2
fT (t) =
1+
m
m(m/2)

Comments: The pdf for T defines the t-distribution, or more correctly Students t distribution,
with m degrees of freedom. It is not hard to see from the functional form that the pdf
has a bell-shaped distribution around t = 0, superficially rather like that of the normal
distribution. The t-distribution has higher kurtosis than the normal distribution, although
it tends to the normal distribution as m .
5-8

For m > 2, the variance of T is m/(m 2). The variance does not exist for m 2, and in
fact for m = 1, even the integral defining E[T ] is not defined (although the median is still at
T = 0). The t-distribution with 1 degree of freedom is also termed the Cauchy distribution.
As you should know, tables of the t-distribution are widely available. Values in these tables
can be expressed as numbers tm; , such that if T has the t-distribution with m degrees of
freedom, then:
Pr[T > tm; ] = .
Hypothesis tests and confidence intervals can thus be based upon observed values of the ratio:
T =

S/ n

and critical values of the t-distribution with n 1 degrees of freedom. This should be very familiar
to you.

5.5

Applications of the t distribution to two-sample tests

A greater understanding of Theorem 5.3 can be developed by looking at the various types of twosample tests, and the manner in which different t-tests occur in each of these. These were all
covered in first year, but we need to examine the origins of these tests.
2
Suppose that XA1 , XA2 , . . . , XAm is a random sample of size m from the N (A , A
) distribution,
2
and that XB1 , XB2 , . . . , XBn is a random sample of size n from the N (B , B ) distribution. We
suppose further that the two samples are independent. Typically, we are interested in drawing
inferences about the difference in population means:

AB = A B .
A and X
B be the corresponding sample means. We now that X
A is normally distributed
Let X
2
B is normally distributed with mean B and variance
with mean A and variance A
/m, while X
2
A and X
B are independent (since the samples are independent), we know further
B
/n. Since X
A X
B is normally distributed with mean AB and variance:
that X
2
A
2
+ B
m
n

and thus the term:

A X
B AB
X
ZAB = p 2
2 /n
A /m + B

has the standard normal distribution.

If the variances are known, then we can immediately move to inferences about AB . If they are
2
2
not known, we will wish to use the sample variances SA
and SB
. The trick inspired by Theorem 5.3
is to look for a ratio of a standard normal variate to the square root of a 2 variate, and hope that
2
2
the unknown population variances will cancel. Certainly, (m 1)SA
/A
has the 2 distribution
2
2
2
with m 1 degrees of freedom and (n 1)SB /B has the distribution with n 1 degrees of
freedom. Furthermore, we know that their sum:
UAB =

2
2
(m 1)SA
(n 1)SB
+
2
2
A
B

has the 2 distribution with m + n 2 degrees of freedom. (Why?) This does not, however, seem
to lead to any useful simplification in general.
5-9

2
2
But see what happens if A
= B
= 2 , say. In this case ZAB and UAB become:

ZAB =
and

A X
B AB
X
q
1
m
+ n1

2
2
2
(m + n 2)Spool
(m 1)SA
+ (n 1)SB
=
2
2
where the pooled variance estimator is defined by:

UAB =

2
Spool
=

Now if we take T = ZAB /

2
2
(m 1)SA
+ (n 1)SB
.
m+n2

p
UAB /(m + n 2), then we get:
T =

A X
B AB
X
q
1
Spool m
+ n1

which by Theorem 5.3 has the t-distribution with m + n 2 degrees of freedom. We can thus carry
out hypothesis tests, or construct confidence intervals for AB .
Example: Suppose we have observed the results of two random samples as follows:
m=8
n=6

x
A = 61
xB = 49

P8
(xAi xA )2 = 1550
Pi=1
6
B )2 = 690
i=1 (xBi x

We are required to test the null hypothesis that A B 5, against the one-sided alternative
that A B > 5 at the 5% significance level, under the assumption that the variances of
the two populations are the same. The pooled variance estimate is:
s2pool =

1550 + 690
= 186.67
8+62

and thus under the null-hypothesis, the t-statistic works out to be:
61 49 5
q
t=
186.67 18 +

1
6

= 0.949

The 5% critical value for the t-distribution with 8 + 6 2 = 12 degrees of freedom is 1.782,
and we cannot thus reject the null hypothesis.
In general, when variances are not equal, there appears to be no way in which a ratio of a normal to
the square root of a 2 can be constructed, in such a way that both unknown population variances
cancel out. This is called the Behrens-Fisher problem. Nevertheless, we would expect that a ratio
of the form
X
B AB
X
pA
2 /m + S 2 /n
SA
B

should have something like a t-distribution. Empirical studies (e.g. computer simulation) have
shown that this is indeed true, but the relevant degrees of freedom giving the best approximation
to the true distribution of the ratio depend on the problem structure in a rather complicated
manner (and usually turns out to be a fractional number, making it hard to interpret). A number of
approximations have been suggested on the basis of numerical studies, one of which is incorporated
into the STATISTICA package.
5-10

There is one further special case, however, which is interesting because it allows, ironically, the
relaxation of other assumptions. This is the case in which m = n. We can now pair the observations
(at this stage in any way we like), and form the differences Yi = XAi XBi say. The Yi will be iid
2
2
normally distributed with mean AB and with unknown variance Y2 = A
+ B
. The problem
thus reduces to the problem of drawing inferences about the population mean for a single sample,
when the variance is unknown. Note that we only need to estimate Y2 , and not the individual
2
2
variances A
and B
.
In order to apply this idea, we need only to have that the Yi are iid It is perfectly permissible
to allow XAi and XBi to share some dependencies for the same i. They might be correlated, or
both of their means may be shifted from the relevant population means by the same amount. This
allows us to apply the differencing technique to paired samples, i.e. when each pair XAi , XBi
relate to observations on the same subject under two different conditions. For example, each i
may relate to a specific hospital patient chosen at random, while XAi and XBi refer to responses
to two different drugs tried at different times. All we need to verify is that the differences are iid
normal, after which we use one sample tests.
Example: A random sample of ten students are taken, and their results in economics and statistics
are recorded in each case as follows.
Student
1
2
3
4
5
6
7
8
9
10

Economics
73
69
64
87
58
84
96
58
90
82

Statistics
66
71
49
81
61
74
89
60
85
76

Difference (Y )
7
-2
15
6
-3
10
7
-2
5
6

If care was taken in the selection of the random sample of students, then the statistics results
and the economics results taken separately would represent random samples. But the two
marks for the same student are unlikely to be independent, as a good student in one subject
is usually likely to perform well in another. But the last column above represents a single
random sample of the random variable defined by the amount by which the economics mark
exceeds the statistics mark, and these seem to be plausibly independent. For example, if there
is no true mean difference (across the entire population) between the two sets of marks, then
there is no reason to suggest that knowing that one student scored 5% more on economics
than on statistics has anything to do with the difference experienced by another student,
whatever their absolute marks.
The test of the hypothesis of no difference between the two courses is equivalent to the null
hypothesis that E[Y ]=0. The sample mean and sample variance of the differences
p above are
4.9 and 32.99 respectively. The standard (one sample) t-statistic is thus 4.9/ 32.99/10 =
2.698. Relevant critical values of the t-distribution with 9 degrees of freedom are t9;0.025 =
2.262 and t9;0.01 = 2.821. Presumably a two-sided test is relevant (as we have not been given
any reason why differences in one direction should be favoured over the other), and thus the
p-value lies between 5% (2 0.025) and 2% (2 0.01). Alternatively, we can reject the
hypothesis at the 5%, but not at the 2% significance level.

5-11

5.6

The F distribution

In the previous section, we have looked at the concept of a two-sample problem. Our concern
there was with comparing the means of two populations. Now, however, let us look at comparing
their variances. There are at least two reasons why we may wish to do this:
1. We have a real interest in knowing whether one population is more or less variable than
another. For example, we may wish to compare the variability in two production processes,
or in two laboratory measurement procedures.
2. We may merely wish to know whether we can use a pooled variance estimate for the t-test
for comparing the means.
2
2
In the case of variances, it is convenient to work in terms of ratios, i.e. A
/B
. Equality of
2
2
variances means that this ratio is 1. We have available to us the sample variances SA
and SB
, and
2
2
we might presumably wish to base inferences upon SA /SB . The important question is: what is
2
2
2
2
the probability distribution of SA
/SB
for any given population ratio A
/B
?
2
2
2
2
We know that U = (m 1)SA
/A
and V = (n 1)SB
/B
have 2 distributions with m 1 and
n 1 degrees of freedom respectively. Thus let us consider the function:
2
U/(m 1)
S 2 /SB
= A
.
2
2
V /(n 1)
A /B

Since we know the distributions of U and V , we can derive the distribution of the above ratio,
which will give us a measure of the manner in which the sample variance ratio departs from the
population variance ratio. The derivation of this distribution is quite simple in principle, although
it becomes algebraically messy. We shall not give the derivation here, but recommend it as an
excellent exercise for the student. We simply state the result in the following theorem.
Theorem 5.4 Suppose that U and V are independent random variables, having 2 distributions
with r and s degrees of freedom respectively. Then the probability density function of
Z=

U/r
V /s

is given by:
fZ (z) =
for z > 0.

i(r+s)/2
((r + s)/2)  r r/2 r/2 1 h rz
z
+1
(r/2)(s/2) s
s

The distribution defined by the above pdf is called the F-distribution with r and s degrees of
freedom (often called the numerator and denominator degrees of freedom respectively). One word
of caution: when using tables of the F-distribution, be careful to read the headings, to see whether
the numerator degree of freedom is shown as the column or the row. Tables are not consistent in
this sense.
As with the other distributions we have looked at, we shall use the symbol Fr,s; to represent the
upper 100% critical value for the F-distribution with r and s degrees of freedom. In other words,
with Z defined as above:
Pr[Z > Fr,s; ] = .
Tables are generally given separately for a number of values of , each of which give Fr,s; for
various combinations of r and s.

5-12

Example: We have two alternative laboratory procedures for carrying out the same analysis. Let
us call these A and B. Seven analyses have been conducted using procedure A (giving measurements XA1 , . . . , XA7 ), and six using procedure B (giving measurements XB1 , . . . , XB6 ).
2
2
2
2
We wish to test the null hypothesis that A
= B
, against the alternative that A
> B
, at
2
2
the 5% significance level. Under the null hypothesis, SA /SB has the F-distribution with 6
and 5 degrees of freedom. Since F6,5;0.05 = 4.95, it follows that:
 2

SA
Pr 2 > 4.95 = 0.05
SB
2
2
Suppose now that we observe SA
= 5.14 and SB
= 1.08. This may look convincing, but the
ratio is only 4.76, which is less than the critical value. We cant at this stage reject the null
hypothesis (although I would not be inclined to accept it either!).

You may have noticed that tables of the F-distribution are only provided for smaller values of ,
e.g. 10%, 5%, 2.5% and 1%, all of which correspond to variance ratios greater than 1. For onesided hypothesis tests, it is always possible to define the problem in such a way that the alternative
involves a ratio greater than 1. But for two-sided tests, or for confidence intervals, one does need
both ends of the distribution. There is no problem! Since
Z=

U/r
V /s

has (as we have seen above) the F-distribution with r and s degrees of freedom, it is evident that:
Y =

V /s
1
=
Z
U/r

has the F-distribution with s and r degrees of freedom.


Now, by definition:

1 = Pr[Z > Fr,s;1 ] = Pr Y <


and thus:

Pr Y

1
Fr,s;1

1
Fr,s;1

= .

Since Y is continuous, and has the F-distribution with s and r degrees of freedom, it follows
therefore that:
1
Fs,r; =
Fr,s;1
i.e. for smaller values of , and thus larger values of 1 , we have:
Fr,s;1 =

1
.
Fs,r;

Example (Confidence Intervals): Suppose that we wish to find a 95% confidence interval for
2
2
the ratio A
/B
in the previous example. We now know that:


2
S 2 /A
Pr F6,5;0.975 < A
<
F
= 0.95
6.5;0.025
2 / 2
SB
B
which after some algebraic re-arrangement gives:


2
2
2
1
SA
A
1
SA
Pr
2 < 2 < F
2 .
F6,5;0.025 SB
6,5;0.975 SB
B
The tables give us F6,5;0.025 = 6.98 directly, and thus 1/F6,5;0.025 = 0.143. Using the above
relationship, we also know that 1/F6,5;0.975 = F5,6;0.025 = 5.99 (from tables). Since the
2
2
observed value of SA
/SB
is 5.14/1.08=4.76, it follows that the required 95% confidence
interval for the variance ratio is [0.682 ; 28.51].
5-13

Tutorial exercises
1. Suppose we are interested in students smoking habits (e.g. the distribution of weekly smoking
(number of cigarettes during one week, say)). The correct procedure would be to draw a
sample of students at random (from student number, say) and to interview each in order to
establish their smoking patterns. This may be a difficult or expensive procedure. Comment
on the extent to which the following alternative procedures also satisfy our definition of a
random sample:
(a) Interview every 100th student entering Cafe Nescafe during one week.
(b) Interview all resident of Smuts Hall, or of Smuts and Fuller Halls.
(c) E-mail questionnaires to all registered students and analyze responses received.
(d) Include all relevant questions on next years STA204F sample questionnaire.
(e) Interview all students at the Heidelberg next Friday night.
2. Laboratory measurements on the strength of some material are supposed to be distributed
normally around a true mean material strength , with variance 2 . Let X1 , X2 , . . . denote
individual measurements.
Based on a random sample of size 16, the following statistic was
P16
computed:
)2 = 133. Can you reject the hypothesis : 2 = 4.80?
i=1 (xi x
3. In the problem of Question 2, suppose that for a sample of size 21, a value of 5.1 for the
statistic s2 was observed. Construct a 95% confidence interval for 2 .

4. Let X N (3, 9), W N (0, 4), Y N (3, 25) and Z 223 (chi squared with 23 degrees
of freedom) be independent random variables. Write down the distributions of the following
random variables
1
X 1
3
D = T2
W
A=
2
(Y + 3)2
B=
25
(X + Y )2
K=
34
E =K+Z

(a) T =
(b)
(c)
(d)
(e)
(f)

6(W 2 )
E
(h) S = X + W + Y
(g) G =

5. W, X, Y, Z are independent random variables, where W, X, and Y have the following normal
distributions:
W N (0, 1)

X N (0, 19 )

and

1
Y N (0, 16
)

and Z 40 (chi-squared distribution with 40 degrees of freedom). We assert that


cY
T =
2
bX + W 2 + Z
has a t distribution with m degrees of freedom. For what values of c, b and m is the assertion
true, and what results justify the assertion?

5-14

6. Eight operators in a manufacturing plant were sent on a special training course. The times,
in minutes, that each took for a particular activity were measured before and after the course;
these were as follows:
Operator
Time (before course)
Time (after course)

1
23
17

2
18
14

3
16
13

4
15
13

5
19
12

6
21
20

7
31
14

8
22
17

Would you conclude that the course has speeded up their times? Is there evidence for the
claim that the course, on average, leads to a reduction of at least one minute per activity?
7. One of the occupational hazards of being an airplane pilot is the hearing loss that results
from being exposed to high noise levels. To document the magnitude of the problem, a team
of researchers measured the cockpit noise levels in 18 commercial aircraft. The results (in
decibels) are as follows:
Plane
1
2
3
4
5
6

Noise level (dB)


74
77
80
82
82
85

Plane
7
8
9
10
11
12

Noise level (dB)


80
75
75
72
90
87

Plane
13
14
15
16
17
18

Noise level (dB)


73
83
86
83
83
80

(a) Find a 95% confidence interval for by firstly assuming that 2 = 27 and secondly
by assuming that 2 is unknown and that you have to estimate it from the sample.
(Assume that your are sampling from a normal population)
(b) Find a 95% confidence interval for 2 by firstly assuming that = 80.5 and secondly by
assuming that is unknown and that you have to estimate it from the sample. (Assume
that your are sampling from a normal population)
8. Two procedures for refining oil have been tested in a laboratory. Independent tests with each
procedure yielded the following recoveries (in ml. per l. oil):
Procedure A : 800.9; 799.1; 824.7; 814.1; 805.9; 798.7; 808.0; 811.8; 796.6; 820.5
Procedure B: 812.6; 818.3; 523.0; 911.2; 823.9; 841.1; 834.7; 824.5; 841.8; 819.4; 809.9; 837.5;
826.3; 817.5
We assume that recovery per test is distributed N (A , A2 ) for procedure A, and N (B , B2 )
for procedure B.
(a) If we assume A2 = B2 , test whether Procedure B (the more expensive procedure) has
higher recovery than A. Construct a 95% confidence interval for
AB = B A .

(b) What if we cannot assume A2 = B2 ?

5-15

Chapter 6

Regression analysis
Assumed statistical background
Introstat - Chapter 12
Bivariate distributions and its moments - Chapter 2 and 4 of these notes
Maths Toolbox
Maths Toolbox B.5.
Assumed software background
You can use any package with which you are comfortable e.g. excel, SPSS or Statistica. You need
to be able to (a) complete a regression analysis (b) supply residual diagnostics and (c) be able to
interpret the output of the package.

6.1

Introduction

In Chapter two we consider the situations in which we simultaneously observed two (or more) variables on each member/object of a sample (or observed two or more outcomes in each independent
trial of a random experiment).
Example RA: We draw a sample of size 10 from the STA2030S class and measure the height and
weight of each member in the sample:

height
weight

1
1.65
65

2
1.62
63

3
1.67
64

4
1.74
67

5
1.65
63

6
1.69
68

7
1.81
73

8
1.90
85

9
1.79
65

10
1.58
54

We can produce a scatter diagram (scatterplot, bivariate scatter) of the two variables (Figure
6.1). A random scatter of dots (points) would indicate that no relationship exist between the two
variables, however in figure 6.1 it is clear that we have a distinct correlation between weight and
height (as height increases, the corresponding average weight increases).
In first year statistics (Introstat, Chapter 12) you considered two questions:

6-1

Scatterplot

weight

90
80
70
60
50
1.5

1.6

1.7

1.8

1.9

height
Figure 6.1: Example RA - a bivariate scatter plot of weight versus height for a sample of size 10
from the STA2030S class
Is there a relationship between the variables (the correlation problem)?
If the answer to this question is yes, you continued with regression.

How do we predict values for one variable, given particular values for the other variable(s)?
(the regression problem)
In this chapter we are going to study (linear) regression: how one or more variables (explanatory
variables, predictor variables, regressors or X variables) affect another variable (dependent variable,
outcome variable, response variable or Y). In particular we are going to apply linear regression
(fitting a straight line through the data). Note in some instances we can do some transformation
to either the Y s or the X s (to force linearity) before we fit a straight line. Linearity refers to the
fact that the left hand side values (Y s) are a linear function of the parameters (0 and 1 ) and a
function of the X values.

6.2

Simple (linear) regression - model, assumptions

The term Simple (linear) regression refers to the case when we only have one explanatory variable
X. The (simple linear) regression model 6.1 is:
Yi = 0 + 1 Xi + i for n = 1, , n
where:
Yi is the value of the response (observed, dependent) variable in the ith trial
6-2

(6.1)

0 and 1 are parameters


Xi is a fixed (known) value of the explanatory variable in the ith trial
i is a random error term with the following properties:
i (0, 2 ) (that is E[i ] = 0 and the Var[i ] = 2 )
i and j are uncorrelated that is

Cov[i , j ] = 0 for all i, j; i 6= j


Model 6.1 is simple in the sense that it only includes one explanatory variable and it is linear in
the parameters. Yi is a random variable and the expectation and variance of Yi is given by

E[Yi ] = E[0 + 1 Xi + i ]
= 0 + 1 Xi + E[i ]
= 0 + 1 X i

V ar[Yi ] = V ar[0 + 1 Xi + i ]
= V ar[i ]
= 2
Furthermore any Yi and Yj are uncorrelated.
If we also assume that the error terms are also Gaussian (i N [0, 2 ]) then
Yi N [0 + 1 Xi , 2 ].

(e.g. compare to the properties of the condition distribution discussed in Chapter 2).
The parameters 0 and 1 in the simple regression model are also known as the regression parameters. 0 is called the intercept of the regression line (the value Y takes when X is 0). 1 is the
slope of the regression line (the change in the conditional means of Y per unit increase in X). The
parameters are usually unknown and in this course we are going to estimate them by using the
method of least squares, sometimes called ordinary least squares (OLS).P
The method of
least squares minimizes the sum of the squared residuals (observed - fitted), Q = ni=1 (Yi Yi )2 .
We want to find estimates for 0 and 1 that minimize Q:
Q=

n
X
i=1

(Yi Yi )2 =

n
X
i=1

(Yi 0 1 Xi )2

The values of 0 and 1 which minimize Q can be derived by differentiating Q with respect to 0
and 1 .
n
X
Q
= 2
(Yi 0 1 Xi )
0
i=1
n
X
Q
= 2
Xi (Yi 0 1 Xi )
1
i=1

6-3

Denote the values of 0 and 1 that minimize Q by 0 and 1 . If we set these partial derivatives
equal to zero we obtain the following pair of simultaneous equations:
n
X
i=1

and

n
X
i=1

(Yi 0 1 Xi ) = 0

Xi (Yi 0 1 Xi ) = 0

or by bringing in the summation, we write


n
X
i=1

and
n
X
i=1

Yi n0 1

Xi Yi 0

n
X
i=1

n
X

Xi = 0

i=1

Xi 1

n
X

Xi2 = 0

i=1

By rearranging the terms we obtain the so called normal equations:


n
X

Yi = n0 + 1

i=1

and

n
X

Xi

i=1

Xi Yi = 0

i=1

n
X

n
X

Xi + 1

i=1

n
X

Xi2

(6.2)

i=1

By solving the equations simultaneously we obtain the following OLS estimates for 0 and 1 :
1 =

Pn

Pn

X )(

Pn

i
i=1
i=1
i=1 Xi Yi
Pn n
Pn
( i=1 Xi )2
2
i=1 Xi
n

Yi

0 = Y 1 X

.
For the example in figure 6.1, the regression line is given in figure 6.2.
The estimated regression line (figure 6.2) , fitted by OLS has the following properties:
1. The sum of the residuals is null.

6-4

90
y = -59.398 + 73.858x
R2 = 0.7848

weight

80
70
60
50
1.5

1.6

1.7

height

1.8

1.9

Figure 6.2: Least square regression line for example RA: weight versus height for a sample of size
10 from the STA2030S class

n
X

i=1

n
X
i=1

n
X
i=1

=
=
=

n
X

i=1
n
X

i=1
n
X
i=1

(Yi Yi )
(Yi 0 1 Xi )
Yi

n
X
i=1

Yi n0 1
Yi

n
X

n
X

1 Xi

i=1
n
X

Xi

i=1

Yi

(6.3)

i=1

2. The sum of squared errors (or the residual sums of squares) denoted by SSE is

SSE

n
X
(Yi Yi )2
i=1

n
X

2i

i=1

(SSE is equivalent to the minimal value of Q used before). Under OLSE it is a minimum.
The error mean square or residual mean square (MSE) is defined here as the SSE divided by
n 2 degrees of freedom, thus
M SE =

Pn 2

SSE
= i=1 i
n2
n2
6-5

MSE is an unbiased estimator for 2 , thus E[M SE] = 2 (Homework: Can you prove this?)
Pn
Pn

3.
i=1 Yi =
i=1 Yi (easily seen from (6.3))
Pn
Pn
Pn
Pn
Pn
4.
i = i=1 Xi (Yi Yi ) = i=1 Xi Yi 0 i=1 Xi 1 i=1 Xi2 = 0 (from normal
i=1 Xi
equations)
Pn
5.
i = 0
i=1 Yi

The formulas for the estimators of 0 and 1 are quite messy, imagine how they are going to look
like if we have more than one explanatory variables (multivariate regression). Before we continue
we will move to matrix notation. Matrix notation is very useful in statistics. Without matrix
notation you cannot read most statistical handbooks, it is an elegant mathematical way of writing
complicated models and the matrix notation makes the manipulation of these models much easier.

6.3

Matrix notation for simple regression

Model 6.1 can be written in matrix notation as:


Y = X +

(6.4)

where
Y is a n 1 observed response vector,
is a n 1 vector of uncorrelated random error variables with expectation E[] = 0, and
variance matrix V ar() = 2 I,
is a 2 1 vector of regression coefficients that must be estimated, and
X is a n 2 matrix of fixed regressors or explanatory variables, whose rank is 2 (we will
assume that n > 2). That is

Y =

Y1
Y2

Yn

X =

1
1

X1
X2

Xn

0
1

1
2

or we may rewrite 6.4 as:

Y1
Y2

Yn

1 X1
1 X2

1 Xn



0

1 +

1
2

0 + 1 X 1 + 1
0 + 1 X 2 + 2

0 + 1 X n + n

The error vector expectation and variance can be written as:


6-6

E[1 ]
E[2 ]

E[] =

E[n ]

0 0
0 0


0 1

1 0
0 1

V ar[] = 2 I = 2


0 0

= ;

The normal equations (6.2) can be written in terms of matrix notation as


XX = X Y

(6.5)

where
=

0
1

Pn
Xi

, X X =

Pn
Xi


 P

P
P X2i and X Y =
P Yi
thus
Xi
Xi Yi


 P




P
P
0
n0 + 1 Xi
Yi
P
P X2i
P
P
=
=
Xi
Xi Yi
0 Xi + 1 Xi2
1

To obtain the OLS estimates we pre-multiply 6.5 by the inverse of X X (one of our assumptions
was that X is of full column rank, so that (X X)1 exist):
(X X)1 X X = (X X)1 X Y
thus
= (X X)1 X Y
We consider the data of Example RA. At this stage we assume the X variable is height and the Y
variable is weight, hence the matrices are then:

Y =

65
64

54

X =

XX =

1
1

1.65
1.62

1.58

10
17.1
17.1 29.3286

XY =

0
1

669
1150.46

thus

(X X)

1
=
0.876

29.3286 17.1
17.1
10

6-7

1
2

10

and

= (X X)1 X Y =

1
0.876

29.3286 17.1
17.1
10



669
1150.46

59.3979
73.8584

and the fitted values are given by

62.46849315
1 1.65
1 1.62
60.25273973


59.3979

Y = X =
=

73.8584

1 1.58
57.29840183

0
1

The residuals (
) are then

Note:

6.4

P10

i
i=1

65
64

54

62.46849315
60.25273973

57.29840183

2.531506849
3.747260274

3.298401826

= 0.

Multivariate regression - model, assumptions

In the previous section we only consider one explanatory variable. We can generalize/extend the
regression models discussed so far to the more general models where we simultaneously consider
more than one explanatory variables. We will consider (p 1) regressor variables.
The linear regression model is given by
Y = X +

(6.6)

where
Y is a n 1 observed response vector (the same as in the simple case),
is a n 1 vector of uncorrelated random error variables with expectation E[] = 0, and
variance matrix V ar() = 2 I (the same as in the simple case),
is a p 1 vector of regression coefficients that must be estimated (note the additional
regression parameters), and
X is a n p matrix of fixed regressors or explanatory variables, whose rank is p (we will
assume that n > p). The X matrix includes a column of ones (for intercept term) as well as
a column for each of the (p 1) X variables.

6-8

The X matrix and the

1
1

X =

vector is now:
X11
X21

X12
X22

Xn1

Xn2

X1p1
0
1
X2p1

Xnp1
p1

The row subscript in the X matrix identifies the trial/case, and the column subscripts identifies the X variable.
If is the ordinary least square estimator (OLSE) of obtained by minimizing (Y X) (Y X)
over all then
= (X X)1 X Y
The fitted values are Y = X = X(X X)1 X Y = HY, and the residual terms are given by the
vector

=
=
=
=
=

Y Y

Y X
Y X(X X)1 X Y

Y H Y where H = X(X X)1 X


(I H)Y

H is called the hat matrix. It is a n n square matrix, symmetric and idempotent (see Appendix
B.5). Note that (I H) is therefore also idempotent.
The variance-covariance matrix of the residuals is:

V ar[
] =
=
=
=

V ar[(I H)Y ]

(I H)V ar[Y ](I H)


(I H) 2 I(I H)

2 (I H)

Properties of the regression coefficients:


= (unbiased)
1. E[]

=
E[]

E[(X X)1 X Y ]

=
=

(X X)1 X E[Y ]
(X X)1 X X

6-9

= 2 (X X)1
2. V ar[]

V ar[]

= V ar[(X X)1 X Y ]
= (X X)1 X V ar[Y ]X(X X)1
= (X X)1 X 2 IX(X X)1
= 2 (X X)1

3. M SE = 2 (X X)1
4. is the best linear unbiased estimator (BLUE) of .
5. If we assume that are Gaussian (normal) in distribution then the BLUEs and the maximum
likelihood estimators (MLEs) coincide (proof in STA3030).

6.5

Graphical residual analysis

To recap, a residual is the difference between an observed value and a fitted value, thus:
i = Yi Yi
or in matrix notation the residual vector (n 1 is
= Y Y
Some properties of residuals were discussed before, e.g. the residuals sum to one (implying that
the mean of the residuals are 0). Thus the residuals are not independent random variables. If we
SSE
know n 1 residuals we can determine the last one. The variance of the residuals is
which
n2
is an unbiased estimator of the variance ( 2 ) (see STA3030F for properties of estimators).
The standardized residual is

i
i
=

M SE

Informally (by graphical analysis of residuals and raw data) we can investigate how the regression
model (6. 1 and 6.4) departs from the model:
To investigate if the (linear) model is appropriate we should plot the explanatory variable(s) on the
X axis and the residuals on the Y axes. The residuals should fall in an horizontal band centered
around 0; (with no obvious pattern, as in Figure 6.3 (a)). Figures 6.3(b), (c) and (d) suggest
non-standard properties of the residuals:
1. Non-linearity of regression function
We start off in the case of the simple regression model by using a scatterplot. Does this scatterplot suggest strongly that a linear regression function is appropriate? (e.g. see Figure 6.1).
Once the model is fitted, we use the resulting residuals to plot the explanatory variable(s) on
the X axes and the residuals on the Y axes. Do the residuals depart from 0 in a systematic
way? (Figure 6.3 (b) is an example that suggest a curvilinear regression function).
If the data departs from linearity we can consider some transformations on the data in order
to create/force a linear model for new resulting forms of the variables.
Logarithmic transformation: Consider the model
6-10

(a)

(b)

(c )

(d )

X
Figure 6.3: Prototype residual plots (Neter et al. (1996, Figure 3.4)

Y = 0 1X

(6.7)

However this model (6.7) is intrinsically linear since it can be transformed to a linear format
by taking log10 or loge = ln on both sides:

log10 Y =
| {z }
Y

log10 0 +X log10 1 + log10


| {z }
| {z } | {z }
0

+X

(6.8)

Note that model (6.8) is now in the standard linear regression form, whereas (6.7) was not.
Homework: Choose two functions like (6.7). Graph both functions, then transform them
to obtain forms like (6.8). Plot these new relationship. What do you see?
Note sometimes it may only be necessary to take the logarithm of either X or Y .
Reciprocal Transformation
Consider the following model which is not linear in X:
Y = 0 +
The transformation

6-11

1
+
X

(6.9)

X =

1
X

makes model (6.9) linear.


Homework: Choose two functions like (6.9). Graph both functions, then transform (reciprocal). Plot these new relationship. What do you see?
2. Non-constancy of error variance
Figure 6.3 (c) is an example where the error variance increases with X. In the case of
multiple regression a plot of residuals (on Y axis)versus fitted values (Y on X-axis) is an
effective manner to weather or not the error variance is constant over the fitted values for
mean response.
Heteroscedasticity: error variance is not constant over all observations (the plot of the
residuals fan out).
Homoscedasticity; error variance is constant (horizontal band of residuals).
When we have heteroscedasticity the OLSE parameter estimators are still unbiased but they
do not have minimum variance.
In the case of Poisson distributions (mean equals the variance) a useful transformation to

stabilize the variance (and improve normality) is the square-root transformation ( y)


Standard deviation proportional to X
When
V ar[] = kXi2
then an appropriate transformation (in the case of Gaussian error terms) is to divide by Xi :

Yi
Yi
Xi
|{z}
Y

=
=
=

0 + 1 Xi + i ; i N [0, kXi2 ]
0
i
+ 1 +
Xi |{z} Xi
|{z}
|{z}

1 X i
+ 0
+

and the variance is now constant:


V ar[ ] = V ar[

i
V ar[i ]
kXi2
]=
=
=k
2
Xi
Xi
Xi2

3. Error terms are not independent


When the residuals are independent we should see a random distribution of residuals around
the base line (Figure 6.3 (a))
When data are obtained in a time order, plot the residuals against time (time does not need
to be a variable) (Figure 6.3(d)).
4. Outliers are extreme observations. In a graph a particular data point does not fall into
the random scatter of residuals but outside (Figure 6.4). Outliers need to be investigated
carefully. Did we made a typing error? Did we measure it wrong? Outliers can be excluded
from the analysis, but be careful - you only exclude it if you have cause!
In this course we are only going to examine outliers graphically and explore the standardized
residuals (residual divided by its standard error (MSE)). We are not going to study measures
of influence (most of them are based on the hat matrix) such as studentized residual, DFFITS,
Cooks distance and so on.

6-12

i
0

X
Figure 6.4: Residual plot with one outlier
5. The error terms are not Gaussian in distribution
Normality of the error terms will be investigated graphically (see specific graph in software).
We expect to see a straight line. Departures from a straight line might be an indication that
the residuals are not Gaussian.
Sometimes a transformation of the variables will bring the residuals nearer to normality
(some variance stabilizing transformations will also improve normality).
6. Important explanatory variables omitted in model
Plot your residuals against any variables omitted from the model. Do the residuals vary
systematically?

6.6
6.6.1

Variable diagnostics
Analysis of variance (ANOVA)

The deviation Yi Yi (a quantity measuring the variation of the observation Yi from its conditional
mean) can be decomposed as follows

Yi Y =
| {z }
I
=

Yi Y + Yi Yi
| {z } | {z }
II
III

where I is the total deviation, II is the deviation of fitted OLS regression value from the overall
mean and III is the deviation of the observed value from the fitted value on regression line (Figure
6.5).
The sum of the squared deviations (the mixed terms are zero) is given by

6-13

Yi

Yi Yi
Yi Y

Yi

Yi Yi

Y
Yi = 0 + 1 X
Figure 6.5: Decomposition for one observation - ANOVA

n
X
i=1

(Yi Y )2

n
X
i=1

{z
}
SSTO
=

(Yi Y )2 +
{z
SSR

n
X
i=1

(Yi Yi )2

{z
SSE

where SSTO is the total sum of squares (corrected for the mean) with n 1 degrees of freedom(df),
SSR is the regression sum of squares with p 1 degrees of freedom (p 1 exploratory regressor
variables and SSE denotes the error sum of squares (defined before) with n p degrees of freedom
(p parameters are fitted).
In matrix notation and for any value of p the sums of squares are
SST O = Y Y nY 2
SSR = X Y nY 2 = Y X nY 2
SSE = Y Y X Y = Y Y Y X
A sum of squares divided by its degree of freedom is called a mean square (MS). The breakdown of
the total sum of squares and associated degrees of freedom are displayed in an analysis of variance
table (ANOVA table):
ANOVA Table

6-14

Source

SS

df

MS

Regression

2
SSR = X Y nY

p1

M SR =

SSR
p1

Error

SSE = Y Y X Y

np

MSE =

SSE
np

Total

SST0 = Y Y nY 2

n1

The coefficient of multiple determination is denoted by R2 and is defined as

R2

SSR
SSTO
SSE
= 1
SSTO
=

Note that
0 R2 1
.
R2 measures the proportionate reduction in the (squared) variation of Y achieved by the introduction of the entire set of X variables considered in the model. The coefficient of multiple correlation
R is the positive square root of R2 . In the case of simple regression, R is the absolute value of the
coefficient of correlation.
Adding more explanatory variables to the model will increase R2 (SSR becomes larger, SSTO does
not change). A modified measure that adjust for increase in regressor variables is the adjusted
coefficient of multiple determination, denoted by Ra2 . It is defined by:

Ra2

n 1 SSE
n p SSTO

Note that Ra2 < R2 .

6.7

Subset selection of regressor variables - building the regression model

Although p regressor variables are available, not all of them may be necessary for an adequate
fit of the model to the data. After the functional form of each regressor variable is obtained (i.e.
Xi2 , log(Xj ), Xi Xj , and so on), we seek a best subset of regressor variables. This best subset
is not necessarily unique but may be one of a unique set of best subsets.
To find a subset there are basically two strategies, all possible regressions and stepwise regression
(which we take to include the special cases of forward selection and backward elimination).

6-15

6.7.1

All possible regressions

In the all possible regressions search procedure, all possible subsets (equations) are computed and
a selection of a pool of best models (subsets of variables ) are performed under some criterion
(R2 , Adjusted R2 , MSE, Cp ).
The purpose of the all possible regression approach is to identify a small pool of potential models
(based on a specific criteria). Once the pool is identified, the models are scrutinized and a best
model is selected.
If there are (p 1) = k explanatory variables and one intercept term there will be 2k possible
models. For example if p = 3 (constant,X1 , X2 ) the following 22 models are possible:

E[Y ] =

0 + 1 X 1

0 + 1 X 2

0 + 1 X 1 + 2 X 2

where the meaning and the values of the coefficients 0 , 1 , 2 is different in each model.
The criterion (one of R2 , Adjusted R2 , MSE, Cp or others) that is used to find the all possible
subsets does not form part of this course. It is expected that students must be able to apply this in
practical applications. As a general guideline one might prefer either Mallows Cp or the Adjusted
R2 (go and explore/experiment with some of them!)

6.7.2

Stepwise regression

Some practitioners prefer stepwise regression because this technique requires less computation than
all-possible subsets regression. This search method computes a sequence of regression equations.
At each step an explanatory variable is added or deleted. The common criterion for adding (or
deleting) some regressor variable examines the effect of that particular variable which produces
the greatest reduction (or smallest increase) in the error sums of squares, at each step. Under
stepwise regression we can distinguish basically three procedures (i) forward selection, (ii) backward
elimination procedure and (iii) forward selection with a view back.
The theory involved in Stepwise regression will not be discussed further. It is expected that the
student should explore/experiment with these methods and compare the outcomes with the set
of best models found under all-possible-regression. Do the two main stream regression methods
produce the same result? Most software programs require some input values (Stepwise regression,
e.g. F to include ext.). How do the models with different starting parameters differ?

6.8

Further residual analysis

In this chapter we did not explore a number of refined diagnostics for identifying the optimal model,
instead we only discussed graphical methods to identify and correct model problems. Other topics
(or problems) that help us to refine the regression models are outliers, influential observations,
collinearity (Neter et al., (1996)) and do not form part of this course.

6-16

6.9

Inference about regression parameters and predicting

6.9.1

Inference on regression parameters

Earlier in this chapter we have noted that


The regression parameter estimator is unbiased, that is
=
E[]
and that the variance-covariance matrix of is
= 2 (X X)1 .
V ar[]
The estimated co-variance matrix is
2 (X X)1 . The diagonal elements of this estimated
co-variance matrix report the variance for 0 , 1 , , , , p1 . Thus
V ar[j ] =
2 (X X)1
jj
th
( where (X X)1
diagonal element of (X X)1 ).
jj is the j

If we further assume that the error terms are Gaussian (N (0, 2 I)), and the population
variance 2 is unknown, then using the results of chapter 5:
j
qj
tnp for k = 0, 1, , p 1

(X X)1
jj
Now we consider the null hypothesis:
H0 : j = 0 vs H1 : j = 0.
Then under H0 the test statistic is

t0 =

j
q

X X)1
jj

and the underlying distribution is the t distribution with (n p) degrees of freedom.


The (1 )% confidence interval for j is given by
j t((np),/2)

q
q

(X X)1

+
t

(X X)1
j
j
((np),/2)
jj
jj .

This interval is sometimes written as


j t((np),/2)

6-17

q
(X X)1
jj .

(6.10)

6.9.2

Drawing inferences about E[Y |xh ]

Define the (1 p) row vector


Xh = [1 Xh1 Xh2 Xh(p1) ]
The corresponding prediction for the Y value at Xh , is denoted by Yh and is

Yh = Xh .
= X E[]
= X
E[Yh ] = E[Xh ]
h
h
So that Xh unbiased, with variance
h = 2 X (X X)1 Xh
= X V ar[]X
V ar[Yh ] = V ar[Xh ]
h
h
and the estimated variance is given by

2 Xh (X X)1 Xh
Thus we can construct a t random variable based on Yh namely

and a confidence interval of the form

Y Xh
p h

Xh (X X)1 Xh

Yh t((np),/2)

6.9.3

(6.11)

q
Xh (X X)1 Xh .

Drawing inferences about future observations

Define a new observation Yh(new) corresponding to Xh . Yh(new) is assumed to be independent of


the n Yi s observed.

Yh(new) = Xh .
= X E[]
= X
E[Yh(new) ] = E[Xh ]
h
h
E[Yh(new) Yh(new) ] = Xh Xh = 0
and

V ar[Yh(new) Yh(new) ] = V ar[Yh(new) ] + V ar[Yh(new) ]


= 2 Xh (X X)1 Xh + 2

= 2 [1 + Xh (X X)1 Xh ]
6-18

and the estimated variance is

2 [1 + Xh (X X)1 Xh ]
Thus just as in the previous section we can construct a t random variable with n p degrees of
freedom, and use this random variable to set up a confidence interval for a new predicted value:
Yh(new) t((np),/2)

q
[1 + Xh (X X)1 Xh ]

Tutorial Exercises
The tutorial exercises for this chapter are different than the previous chapters. This tutorial
involve hands on (lab work). Under resources in the Course Vula site you will find a folder called
Regression tutorial. The data sets are from problems taken from Neter et al. (1996)).
1. This question is based on the data in the file Gradepointaverage.dat (Neter et al. (1996),
p 38).
A new entrance test for 20 students selected at random from the a first year class was
administrated to determine whether a students grade point average (GPA) at the end of the
first year (Y) can be predicted from the entrance test score (X).
X
Y

1
5.5
3.1

2
4.8
2.3

3
4.7
3.0

18
5.9
3.8

19
4.1
2.2

20
4.7
1.5

(a) Assume a linear regression model and use any statistical software package to answer the
following questions:
Fit an appropriate model;
Obtain an estimate for the GPA for a student with an entrance test score X = 5.0.
Obtain the residuals. Do the residuals sum to zero?
Estimate 2 ?
(b) Using matrix methods, find :
i. Y Y ; X X; X Y
ii. (X X)1 , and hence the vector of estimated regression coefficients.
iii. The Anova table (using matrix notation)
iv. Estimated variance-covariance matrix of
v. Estimated value of Yh(new) , when Xh = [1, 5.0]
vi. Estimated variance of Yh(new)
vii. Find the hat matrix H
viii. Find
2
(c) Did you encounter any violations of model assumptions? Comment.
For the following exercises: fit a linear model, analyse the residuals (graphically) and come
up with a best model. Also follow additional guidelines from class. Interpret all the results
and report back on the best model.
2. The data file Kidneyfunction.dat (Neter et al. (1996) Problem 8.15, p358)
3. The data file roofingshingles.dat(Neter et al. (1996) Problem 8.9, p356)
References: Neter J., Kutner M.H., Nachtsheim C.J. and Wasserman W., (1996). Applied linear
statistical models (fourth edition).
19

Appendix A

Attitude
Stats come slowly, everyday is building on the previous day. I do not miss a class.
I come to every tutorial, and I come prepared (leave the problem cases for the tutorial).
Exercise, exercise, exercise. (Everyday)
I am paying for this course, use all the resources (classes, tutorials, hotseats) the lecturers
are there to answer your questions, they are available!
Questions need to be specific! (show that you are doing your bit)

A-1

Appendix B

Maths Toolbox
The maths toolbox includes a summary (not necessary complete) of some of the maths tools
that you need to understand and succeed in this course. It is important to realize that this is a
statistical course and not a maths course, but we do not teach maths we assume that your maths
tools are sharp.

B.1

Differentiation (e.g. ComMath, chapter 3)

B.2

Integration (ComMath, chapter 7)

B.3

General

1. Binomial theorem (ComMath, chapter 2)


(a + b)n =


n 
X
n
ai bni
i
i=0

2. Geometric series
n1
X

arj = a

j=0

arj = a

j=0

1 rn
1r

1
, for |r| < 1
1r

3. Expansion of ex (ComMath, chapter 4)

ex

= 1+x+
=

X
xj
j=0

j!

x2
x3
+
+
2!
3!
for < x <

e
B-1

B.4

Double integrals

Now let us turn to double integrals (a special case of multiple integrals), which are written as
follows:
Z b Z d
f (x, y) dy dx
x=a

y=c

The evaluation of the double integral (or any multiple integral) is in principle very simple: it is the
repeated application of rules for single integrals. So in the above example, we would first evaluate:
Z

f (x, y) dy
y=c

treating x as if it were a constant. The result will be a function of x only, say k(x), and thus in
the second stage we need to evaluate:
Z b
k(x) dx.
x=a

In the same way that the single integral can be viewed as measuring area under a curve, the double
integral measures volume under a two-dimensional surface.

There is a theorem which states that (for any situations of interest to us here), the order of
integration is immaterial: we could just as well have first integrated with respect to x (treating y
as a constant), and then integrated the resulting function of y. There is one word of caution required
in applying this theorem however, and that relates to the limits of integration. In evaluating the
above integral in the way in which we described initially, there is no reason why the limits on y
should not be functions of the outer variable x, which is (remember) being treated as a constant;
the result will still be a function of x, and thereafter we can go on to the second step. When
integrating over x, however, the limits must be true constants. Now what happens if we reverse
the order of integration? The outer integral in y cannot any longer have limits depending on x,
so there seems to be a problem! It is not the theorem that is wrong; it is only that we have to be
careful to understand what we mean by the limits.
The limits must describe a region (an area) in the X-Y plane over which we wish to integrate,
which can be described in many ways. We will at a later stage of the course encounter a number of
examples of this, but consider one (quite typical) case: suppose we wish to integrate f (x, y) over
all x and y satisfying x 0, y 0 and x + y 1. The region over which we wish to integrate is
that shaded area in figure B.1; the idea is to find the volume under the surface defined by f (x, y),
in the column whose base is the shaded triangle in the figure. If we first integrate w.r.t. y, treating
x as a constant, then the limits on y must be 0 and 1 x; and since this integration can be done
for any x between 0 and 1, these become the limits for x, i.e. we would view the integral as:
Z 1 Z 1x
f (x, y) dy dx.
x=0

y=0

But if we change the order around, and first integrate w.r.t. x, treating y as a constant, then the
limits on x must be 0 and 1 y, and this integration can be done for any y between 0 and 1, in
which case we would view the integral as:
Z

y=0

1y

f (x, y) dx dy.

x=0

We may also wish to transform or change the variables of integration in a double integral. Thus
suppose that we wish to convert from the x and y to variables u and v defined by the continuously
differentiable functions:
u = g(x, y)
v = h(x, y).
B-2

y
@
@
`` `
`` @
`` `` `
` `` @
` `
`` `` `` `` `` @
`` `
`` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
x+y =1
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` ``@`
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` @
`` `
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` ``@`
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` ``@`
`` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` `` ``@`
` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` ` `@
@

Figure B.1: Region of double integration


We shall suppose that these functions define a 1-1 mapping, i.e. for any given u and v we can find
a unique solution for x and y in terms of u and v, which we could describe as inverse functions,
say:
x = (u, v)
y = (u, v).
We now define the Jacobian of the (bivariate) transformation from (x, y) to (u, v) by the absolute
value of the determinant of the matrix of all partial derivatives of the original variables (i.e. x and
y) with respect to the new variables, in other words:

|J| =
=


x/u

y/u

x y

u v

Note that the Jacobian is a function of u and v.


x/v
y/v

y x
u v

Theorem B.1 Suppose that the continuously differentiable functions g(x, y) and h(x, y) define a
one-to-one transformation of the variables of integration, with inverse functions (u, v) and (u, v).
Then:
Z
Z
Z Z
b

f (x, y) dy dx =

x=a

y=b

u=a

v=c

f [(u, v), (u, v] |J| dv du

where {a u b ; c v d } describes the region in the transformed space corresponding to


{a x b ; c y d} in the original variables.
This equation then defines a procedure for changing the variable of integration from x and y to
u = g(x, y) and v = h(x, y):

B-3

1. Solve for x and y in terms of u and v to get the inverse functions (u, v) and (u, v).
2. Obtain all partial derivatives of (u, v) and (u, v) with respect to u and v, and hence obtain
the Jacobian |J|.
3. Calculate the minimum and maximum values for u and v (where the ranges for the variable
in the inner integral may depend on the variable in the outer integral).
4. Write down the new integral, as given by the theorem.
Example: Evaluate:

ex e2y dy dx

Lets do the transformation


V = X (we sometimes refer to this as a dummy transformation) and U = 2Y . The inverse
U
transformation is easy to write down directly as X = V and Y = . The Jacobian is given
2
by:

0, 1
1
, 0
2




0 1

2
1
2

|J| =
=
=

Clearly the ranges for the new variables are 0 u and 0 v < . The integral is thus
given by:
Z Z
1
ev eu dv du
2
u=0 v=0
thus
Z

u=0

1
ev eu dv du
2
v=0

=
=
=

Z


1 u
e (1) ev v=0 du
2 u=0


1
(1) eu u=0
2
1
2

Note: It is possible to start by evaluating the derivatives of u = g(x, y) and v = h(x, y) w.r.t.
x and y. But it is incorrect simply to invert each of the four derivatives (as functions of
x and y), and to substitute them in the above. What you have to do is to evaluate the
corresponding determinant first, and then to invert the determinant. This will still be a
function of x and y, and you will have to then further substitute these out. This seems to be
a more complicated route, and is not advised, although it is done in some textbooks.

B-4

B.5

Matrices: e.g. ComMath, Chapter 5

B5.1.1 You need to know the following definitions and matrix operators
Matrices (n p)
Vectors (n 1)

Manipulating rows and columns of a matrix


Matrix operations (add, subtract, multiply)
Square, symmetric matrices
The transpose of a matrix
The trace of a matrix

The inverse of a non-singular matrix


The rank of a matrix

B5.1.2 Expectation of random matrix.


For a random matrix T with dimension m k, the expectation is

E[T11 ] E[T12 ]
E[T21 ] E[T22 ]

E[T ] =

E[Tm1 ] E[Tm2 ]

E[T1k ]
E[T2k ]

E[Tmk ]

If we consider W = AT , where T is a random matrix m k and A is a constant matrix


(r m), then
E[A] = A
E[W ] = E[AT ] = AE[T ]
B5.2.3 The variance-covariance matrix of a random vector T (k 1) is:

V ar[T ] =

V ar[T11 ] Cov[T12 ]
Cov[T21 ] V ar[T22 ]

Cov[Tk1 ] Cov[Tk2 ]

Cov[T1k ]
Cov[T2k ]

V ar[Tkk ]

The variance-covariance matrix is a symmetric matrix: Cov[Tij ] = Cov[Tji ] for i 6= j.

If we consider W = AT , where T is a random vector k 1 and A is a constant matrix of


dimension m k, then
V ar[A] = 0
V ar[W ] = V ar[AT ] = AV ar[T ]A

B-5

B5.2.3 A matrix S is idempotent if


S 2 = SS = S.
An example of an idempotent matrix is the hat matrix. The hat matrix is
H = X(X X)1 X
for which

H2

= HH
= X(X X)1 X X(X X)1 X
= X(X X)1 X
= H

B-6

You might also like