Lecture Notes Statistics I

Advanced Statistics I
Institut fr Statistik und konometrie

(Christian-Albrechts-Universitt zu Kiel)
October 21, 2016
Contents
1. Elements of probability theory
1.1. Sample Space and Events . . . . . . .
1.2. Probability . . . . . . . . . . . . . . .
1.3. Properties of the Probability Function
1.4. Conditional Probability . . . . . . . . .
1.5. Independence . . . . . . . . . . . . . .
1.6. Total Probability Rule and Bayes Law
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Random variables and their probability distributions

2.1. Univariate Random Variables . . . . . . . . . . .
2.2. Univariate Cumulative Distribution Functions . .
2.3. Multivariate Random Variables . . . . . . . . . .
2.4. Marginal and conditional distributions . . . . . .
2.5. Independence of Random Variables . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Moments of random variables

3.1. Expectation of a Random Variable . . . . . . . . . . . . . . . . . .
3.2. Properties of the Expectation and Extensions . . . . . . . . . . . .
3.3. Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . .
3.4. Moments of a Random Variable . . . . . . . . . . . . . . . . . . . .
3.5. Moment-Generating Functions . . . . . . . . . . . . . . . . . . . . .
3.6. Joint Moments and Moments of Linear Combinations . . . . . . . .
3.7. Means and Variances of Linear Combinations of Random Variables
4. Parametric families of density functions
4.1. Discrete Density Functions . . . . . .
4.2. Continuous Density Functions . . . .
4.3. Normal Family of Densities . . . . .
4.4. Exponential Class of Distributions . .
4.5. Further extensions . . . . . . . . . .
.
.
.
.
.
5. Basic asymptotics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
5
10
13
16
18
.
.
.
.
.
21
21
29
31
36
41
.
.
.
.
.
.
.
45
45
48
53
55
60
64
68
.
.
.
.
.
71
72
80
84
91
92
97
ii
Contents
5.1.
5.2.
5.3.
5.4.
5.5.
Convergence of Number and Function Sequences . . . . . . . . . . . . . . . . . .

Convergence Concepts for Sequences of Random Variables . . . . . . . . . . . .
Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Asymptotic Distributions of Functions of Asymptotically Normally Distributed
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6. Samples and Statistics

6.1. Random (iid) Sampling . . . . . . . . . . . . . . . . . . . .
6.2. Empirical Distribution Function . . . . . . . . . . . . . . .
6.3. Sample Moments . . . . . . . . . . . . . . . . . . . . . . .
6.4. Sample Mean and Variance from Normal Random Samples
6.5. Pdfs of Functions of Random Variables . . . . . . . . . . .
6.6. Order Statistics . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
100
110
114
121
123
124
126
129
135
140
147
Appendix
A. Tables
iii

Probability, or chance, is a way of expressing knowledge or belief that an event will occur or
has occurred. In mathematics, the concept has been given an exact meaning in probability
theory, that is used extensively in such areas of study as statistics, finance, gambling, science,
and philosophy to draw conclusions about the likelihood of potential events and the underlying
mechanics of complex systems. Of course, we all have some understanding or intuition of what
probability stands for; what we need is a way of working with probability that is tractable and
free of contradictions.
1.1. Sample Space and Events

The probability theory is the foundation upon which all of statistics is built. The objective
of probability theory is to quantify the level of uncertainty associated with observing various
outcomes of a chance situation (i.e. the possible outcomes of a random experiment or a random
phenomenon). For example, we might be interested in
1. the level of uncertainty associated with the event that an ideal coin will turn up heads;
2. the level of uncertainty associated with the event that the German GDP (gross domestic
product) increases this year by more than 3%.
The fundamental tool to measure uncertainty is probability.1 In the following, we consider
some important definitions (sample space and events) that are used to discuss probability.
Definition 1.1 (Sample space) A set S, that contains all possible outcomes of a given random experiment is called sample space.
Example 1.1 If the experiment consists of tossing a die, the sample space contains six possible
outcomes given by:
1
At least for statisticians; other scientists regard constructs building on fuzzy sets and fuzzy logic as an
alternative; see e.g. Novk (2005, Fuzzy Sets and Systems 156: 341348).
Example 1.2 If the experiment consists of recording the number of traffic deaths in Germany
next year, the sample space would contain all positive integers
S = {0, 1, 2, ....}.
(In principle, there is an upper bound given by the current total German population, but considering the actual outcomes observed so far, taking this into account is not making a real
difference.)
Example 1.3 If the experiment consists of observing the length of life of a light bulb, the sample
space would contain all positive numbers
S = (0, ).
Remark 1.1 The sample space need not be identically equal to the set of all possible outcomes
(could be larger). The only concern of practical importance is that the sample space is specified
large enough to contain the set of all possible outcomes as a subset.
The sample space S, as all sets, can be classified according to whether the number of elements
in the set are
1. finite (discrete sample space), e.g., S = {0, 1, 2, .., 6}
2. countably infinite (discrete sample space), e.g., S = N = {0, 1, 2, ...}
3. uncountably infinite (continuous sample space), e.g., S = R.
The fundamental entities to which probabilities will be assigned are events.
Definition 1.2 (Event) An event, say A, is a subset of the sample space S (including S
itself).
Note the following:

1. Events may be composite; an event consisting of a single element or outcome is called
elementary event.
2

2. Let A be an event, a subset of S. We say the event A occurs if the outcome of the random
experiment is in A.
3. The event S is called the sure or certain event.
Example 1.4 The experiment consists of tossing a die and counting the number of dots facing
up. The sample space is defined to be S = {1, 2, ..., 6}. Consider the following subsets of S:
A1 = {1, 2, 3},
A2 = {2, 4, 6},
A3 = {6}.
1. A1 is an event whose occurrence means that the number of dots is less than four.
2. A2 is an event whose occurrence means that the number of dots is even.
3. A3 is an elementary event.
Note that the intersection of A1 and A2 and of A1 and A3 are
A1 A2 = {2},
A1 A3 =
This means A1 and A3 can not, in contrast to A1 and A2 , occur simultaneously. They are called
mutually exclusive events.
For each event A in S we associate a number between 0 and 1 that will be called the probability
of A. As said, each of us has some intuition about what probability is, but that is irrelevant at
this time (see below for possible interpretations of probabilities). What were after right now
is an axiomatic notion of probability. For this purpose we will use an appropriate set function,
say P (), with the set of all events as the domain of P ().
It seems natural to define the domain of P (and hence the collection of all events) as the
collection of all subsets of S. This is quite ok for finite and even countably infinite sample
spaces. However, a technical problem arises for uncountable sample spaces such that certain
subsets of S will not be considered as events because it will be impossible to assign probability
to them in a consistent manner2 . This issue is addressed in the definition of an event; it implies
that every event is a subset of S but does not say that every subset of S is an event! We thus
need...
Definition 1.3 (Event space) The set of all events in the sample space S is called the event
space Y .
2
Subsets of S that cannot be an event are so complicated that they are irrelevant for all practical purposes.

The definitions of events and event space as the set of all events introduced above do not
indicate which subset of S is an event belonging to the event space Y to which we want to
assign probability. For this reason well work with event spaces having a certain, tractable
structure. We thus use a collection of subsets of S which represents a so-called sigma algebra
in S as our event space Y .
Definition 1.4 (Sigma algebra) A collection of subsets of S is called a sigma algebra, denoted by B, if it satisfies the following conditions:
1. B (empty set is an element of B);
2. If A B, then A B (B is closed under complementation);
3. If A1 , A2 , ... B, then
i=1 Ai B (B is closed under countable unions).
Remark 1.2 Property (i) states that the empty set is always in a sigma algebra. Since S = ,
property (i) and (ii) imply that S is always in a sigma algebra also. Hence, by using a sigma
algebra in S as our events space we make sure that it contains the certain event.
Remark 1.3 An event space with property (ii) ensures the following: If A is an event (to which
we can assign a certain probability), then A is also an event so that we can assign a probability
that A does not occur.
Remark 1.4 An event space with property (iii) ensures the following: If A1 , A2 , ... are events,
then
i=1 Ai is also an event.
Remark 1.5 Finally, by using DeMorgans Law we obtain from properties (ii) and (iii) that
i=1 Ai B (B is also closed under countable intersections). This result is obtained as follows:
If A1 , A2 , ... B, then it follows from property (ii) that A1 , A2 , ... B and from property (iii)
that
i=1 Ai B. According to property (ii) it follows that i=1 Ai B and by DeMorgans Law
(A B = A B) we have
i=1 Ai = i=1 Ai = i=1 Ai B.
Associated with the sample space S we can have many different sigma algebras. For example,
the collection of the two sets {, S} is a sigma algebra B in S.
A typical sigma algebra used as event space Y if the sample space S is finite or countable is
B = {all subsets of S, including and S} .
Note that if S has n elements there are 2n sets in B.
4

Example 1.5 If S = {1, 2, 3}, then the sigma algebra consisting of all subsets of S is the
following collection of 23 = 8 sets:
{1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}, {}.
As mentioned above, if S is uncountable a sigma algebra containing all subsets of S cannot be

used as an event space.
A typical sigma algebra used as event space Y if the sample space is an interval on the real
line, i.e., S R, is given by B containing all sets of all closed, open and half-open intervals:
[a, b], (a, b], [a, b), (a, b),
a, b S,
as well as all sets that can be formed by taking (possibly countably infinite) unions and intersections of these intervals.3
1.2. Probability
Having defined the sample space S and the event space Y of an experiment, we are now in a
position to define probability. Before we discuss the axiomatic probability definition used in
probability theory, we consider the most important non-axiomatic probability definitions. There
were three of them in the course of the development of probability, the classical probability, the
relative frequency probability and the subjective probability.
Non-axiomatic Probability Definitions

Definition 1.5 (Classical probability) Let S be the finite sample space for an experiment
having N (S)4 equally likely outcomes, and let A S be an event containing N (A) elements.
Then the probability of the event A, denoted by P (A), is given by P (A) = N (A)/N (S) (relative
size of the event set A).
This probability concept was introduced by the French mathematician Pierre-Simon Laplace.
According to this definition probabilities are images of sets generated by a set function, P , with
a domain consisting of all subsets of a finite sample space S and with a range given by the
interval [0, 1].
3
This special sigma algebra is usually referred to as a collection of Borel sets (see, e.g., Mittelhammer, 1996,
p.21).
4
N () denotes the size-of-set function assigning to a set A the number of elements that are in set A.

Example 1.6 The experiment consists of tossing a fair die and counting the number of dots
facing up. The sample space with equally likely outcomes is S = {1, 2, ..., 6}. We have N (S) = 6.
Let Ei (i = 1, ..., 6) denote the elementary events in S. According to the classical definition we
have
P (Ei ) =
1
N (Ei )
= ,
N (S)
6
and P (S) =
N (S)
= 1 (probability of the certain event).
N (S)
For the event A = {1, 2, 3} we obtain

P (A) =
N (A)
1
= .
N (S)
2
Remark 1.6 The classical definition has two major drawbacks.

First, its use requires that the sample space is finite. Note that for infinite sample spaces
we have N (S) = and possibly N (A) = .
Second, its use requires that the outcomes of an experiment must be equally likely. Hence,
the classical definition cannot be used in an experiment consisting, e.g., of tossing an
unfair die.
A probability definition which does not suffer from these limitations is the relative frequency
probability which is defined as follows:
Definition 1.6 (Relative frequency probability) Let n the number of times that an experiment is repeatedly performed under identical conditions. Let A be an event in the sample space
S, and define nA to be the number of times in n repetitions of the experiment that the event A
occurs. Then the probability of the event A is given by the limit of the relative frequency nA /n,
nA
.
as P (A) = lim
n n
According to this definition the probability of an event A is the image of A generated by a set
function P , where the image is defined as the limiting fraction of the number of outcomes in a
sequence of experiments that are observed to be elements in A. Note that the range of the set
function is the interval [0, 1], since 0 nA n.
Example 1.7 The experiment consists of tossing a coin with S = {head, tail}. The coin was
tossed various numbers of times, with the following results:

n (No. of tosses)
100
500
5000
nhead (No. of heads) nhead /n (Rel. freq.)

48
.4800
259
.5180
2,509
.5018
It would appear that n

lim (nhead /n) = 1/2.
Remark 1.7 The frequency definition allows in contrast to the classical definition for an
infinite sample space S as well as for outcomes which are not equally likely. However, the
frequency definition has the following drawbacks:
First, while for many types of experiments nA /n will converge to a limit value (such as
in our coin-tossing example) we can not exclude situations where the limit of nA /n does
not exist.
Second, how could we ever observe the limiting value if an infinite number of repetitions
of the experiment is required?
A third approach to defining probability is the subjective probability which is defined as follows:
Definition 1.7 (Subjective probability) The subjective probability of an event A is a real
number, P (A), in the interval [0, 1], chosen to express the degree of personal belief in the
likelihood of occurrence or validity of event A, the number 1 being associated with certainty.
Like the classical and frequency definition of probability, subjective probabilities can be interpreted as images of set functions. However, P (A) as the image of A can obviously vary
depending on who is assigning the probability.5
Remark 1.8 The subjective probability definition has the following properties: Unlike the relative frequency approach, subjective probabilities can be defined for experiments that cannot be
repeated. For example, consider the event that the social democrats will win the next election.
This does not fit into the frequency definition of probability, since one can observe the outcome
of that election once.
Remark 1.9 Often we are interested in the true likelihood of an event and not in our personal
perceptions. For example, if we consider some game of chance (such as a lottery or roulette)
we are typically interested in the loss probability as a result of the particular construction of the
game.
5
The subjective probability plays a crucial role in Bayesian statistics and Decision theory.
Axiomatic Probability Definition

In this subsection we consider the axiomatic definition of probability which is the fundament
of modern probability theory and of modern statistics. This axiomatic definition was introduced by the Russian mathematician Andrey N. Kolmogorov.6 It consists of a set of axioms
defining desirable mathematical properties of the measure (P ) which we use in order to assign
probabilities to events.
The role of probability theory is to allow one to work with probabilities, whatever they may
stand for, and not impose a certain interpretation. As we shall see, the axiomatic definition of
probability is general enough to accommodate all the non-axiomatic concepts discussed above.
Definition 1.8 (Probability function) Given a sample space S and an associated event
space Y (a sigma algebra on S), a probability (set) function is a set function P with domain Y
that satisfies the following axioms:
1. (Axiom 1)P (A) 0 for all A Y (non-negativity).
2. (Axiom 2)P (S) = 1 (standardization).
3. (Axiom 3)If A1 , A2 , ... Y is a sequence of disjoint events (that is, Ai Aj = for i 6= j;
P
i, j = 1, 2, ...), then P (
i=1 P (Ai ) (countable additivity).
i=1 Ai ) =
Remark 1.10 This definition tells us which set functions can be used as probability set functions to assign probabilities; it does not tell us what value the probability set function P assigns
to a given event and it makes no attempt to tell what particular set function P to choose. For
any sample space S many different probability functions can be defined.
Example 1.8 Let S = {1, 2, ..., 6} be the sample space for rolling a fair die and observing the
number of dots facing up. The set function
P (A) =
N (A)
6
for
AS
(where N (A) is the size of set A) represents a probability set function on the events of S. We
can verify this by noting that
1. the value of the function P (A) 0 for all A S (non-negativity);
2. the value of the function for the set S is P (S) = N (S)/6 = 1 (standardization);
6
See Kolmogorov, N.A. (1956), Foundations of the Theory of Probability, New York: Chelsea; the original
German version (Grundbegriffe der Wahrscheinlichkeitsrechnung) appeared in 1933.

3. for any collection of disjoint sets A1 , A2 , ..., An we have
P (ni=1 Ai )
N (ni=1 Ai )
=
=
6
Pn
i=1
n
N (Ai ) X
=
P (Ai ) (additivity).
6
i=1
Example 1.9 Let the sample space be S = {1, 2, ...} = N and consider the set function
X 1 x
P (A) =
A S.
for
xA
This set function represents a probability set function since

the value of the function P (A) 0 for all A S, because P is defined as the sum of
non-negative numbers (non-negativity);
the value of the function for the set S is
P (S) =
X 1 x
xS
x
X
1
x=1
x
X
1
x=0
1 =
{z
1
1
1
2
1=1
inf inte geom. series
(standardization);
for any collection of disjoint sets A1 , A2 , ..., An we have
P (ni=1 Ai ) =
x
X
x(n
i=1 Ai )
1
2
n
X
X 1 x
i=1
xAi
n
X
P (Ai ) (additivity).
i=1
Example 1.10 Let S = [0, ) be the sample space for an experiment consisting of observing
the length of life of a light bulb and consider the set function
P (A) =
xA
1 x
e 2 dx
2
for
A Y.
This set function represents a probability set function since

1. the value of the function P (A) 0 for all A S, because P is defined as an integral with
a non-negative integrand (non-negativity);
2. the value of the function for the set S is
P (S) =
xS
1 x
e 2 dx =
2
1 x
e 2 dx = 1 (standardization);
2

3. for any collection of disjoint sets A1 , A2 , ..., An we have
P (ni=1 Ai )
=
x(n
i=1 Ai )
n
X
1 x
e 2 dx =
2
i=1
"
xAi
1 x
e 2 dx
2
{z
n
X
P (Ai )
i=1
A0i s are nonoverlapping intervals: addititivity property of Riemann integrals
(additivity).
Once we have defined the 3-tuple {S, Y, P } (called Probability space) for an experiment under
consideration all of the information is established that is needed to assign (and compute)
probabilities to the various events of interest.
In probability theory, we know what the probability space looks like and are thus capable of
computing the probabilities of interest. In practice, we typically only observe outcomes of random experiments but dont know the details of the model (the probability space generating
the outcomes). It is the discovery of the appropriate probability set function P that represents the major challenge in statistical real-life applications. This is the objective of statistical
inference procedures (inferential statistics) to be discussed in the course: Advanced Statistics
II.
The three axioms governing the behavior of a probability function entail many properties of
the probability function. Some of these properties will be discussed in the next subsection.
1.3. Properties of the Probability Function

Theorem 1.1 Let A be an event in S. Then P (A) = 1 P A .

Proof
The sets A and A form a partition of S, that is, A A = S with A A = . Therefore
(Ax.3)
(Ax.2)
P (S) = P (A A)
= P (A) + P (A)
= 1
Solving the last Equation for P (A) leads to the desired result.
Theorem 1.2 It holds that P () = 0.

Proof
Since = S we immediately have
P ()
(T h.1.1)
1 P (S)
10
(Ax.2)
1 1 = 0.

Theorem 1.3 Let A and B be events in S such that A B. Then P (A) P (B) and
P (B A) = P (B) P (A).
Proof
Since A B, we have A (B A) = and A (B A) = B and thus
P (B)
(Ax.3)
P (A) + P (B A).
The second result of the theorem follows immediately. Since P (B A) 0 by Axiom 1, we also have
P (A) P (B).
Theorem 1.4 Let A and B be events in S. Then P (A) = P (A B) + P (A B).

Proof
The set A can be written as
A = A S = A (B B)
= (A B) (A B)
(intersection is distributive).
= , we have by Axiom 3
Since (A B) (A B)
(Ax.3)
P (A) = P [(A B) (A B)]

= P (A B) + P (A B).
Theorem 1.5 Let A and B be events in S. Then P (A B) = P (A) + P (B) P (A B).

Proof
where B (A B)
= . Therefore
The union of A and B can be written as A B = B (A B),
P (A B)
=
(Ax.3)
(T h.1.4)
P [B (A B)]
P (B) + P (A B)
P (B) + P (A) P (A B).
Corollary 1.1 (Booles Inequality) It holds that P (A B) P (A) + P (B).

Proof
Follows from Theorem 1.5 since P (A B) 0.
11

Theorem 1.6 Let A be an event in S. Then P (A) [0, 1].
Proof
The fact that A implies that P () P (A) (Theorem 1.3) and the fact that A S implies that
P (A) P (S) (Theorem 1.3). Since P (S) = 1 and P () = 0, we have 0 P (A) 1.
Theorem 1.7 (Bonferronis Inequality) Let A and B be events in S. Then P (A B)

P (B).
1 P (A)
Proof
By Theorem 1.1 we have P (A B) = 1 P (A B). DeMorgans law states that A B = A B.

Therefore
P (A B)
=
(T h.1.5)
1 P (A B)
P (B)
+ P (A B).
1 P (A)
0 (Axiom 1) we have P (A B) 1 P (A)

P (B).

Since P (A B)
Pn
Theorem 1.8 Let A1 , . . . , An be events in S. Then P (ni=1 Ai ) 1

P
1 ni=1 P (Ai ).
i=1
P (Ai ) and P (ni=1 Ai )
Proof
The proposition can be proven by induction using the base case n = 2, for which the statement holds
according to Theorem 1.7 (for further details, see Mittelhammer, 1996, p.17).
Theorem 1.9 (Classical probability) Let S be the finite sample space for an experiment
having n = N (S) equally likely outcomes, say E1 , ..., En , and let A S be an event containing
N (A) elements. Then the probability of the event A is given by N (A)/N (S).
Proof
(say)
Since all outcomes are equally likely with P (E1 ) = ... = P (En ) = k, and since S = ni=1 Ei with
Ei Ej = i 6= j, we have by Axioms 2 and 3 that
P (S) = P (ni=1 Ei )
(Ax.3)
n
X
i=1
12
P (Ei ) = nk
(Ax.2)
1.

It follows that P (Ei ) = 1/n. Let I {1, ..., n} be the index set identifying the N (A) number of
outcomes that define A, that is, A = iI Ei . Then by Axiom 3 we have
P (A) = P (iI Ei )
(Ax.3)
P (Ei ) =
iI
X1
iI
N (A)
.
N (S)
Remark 1.11 Theorem 1.9 states that the classical probability definition is implied by the
axiomatic definition. Thus whenever the conditions of the classical definition (finite S with
equally likely outcomes) apply, we can use the classical definition to assign probabilities to
events.
1.4. Conditional Probability

So far, we have considered probabilities of events on the assumption that no information was
available about the experiment other than the sample space S. Sometimes, however, it is
known that an event B has happened. The question is then, how can we use this information
in making a statement concerning the outcome of another event A, that is, how can we update
the probability calculation for the event A based on the information that B has happened.
Example 1.11 The experiment consists of tossing two fair coins. The sample space is S =
{(H,H), (H,T), (T,H), (T,T)} (H= Head, T=Tail). Consider the events
A = {both coins show same face},
B = {at least one coin shows H}.
Then P (A) = 2/4 = 1/2. If B is known to have happened, we know for sure that the outcome
(T,T) cannot happen. This suggest that
P (A conditional on Bhaving happened) = 1/3.
This update of the probability calculation leads to the concept of conditional probability which
is defined as follows.
Definition 1.9 (Conditional probability) Let A and B be any two events in a sample space
S. If P (B) 6= 0, then the conditional probability of event A, given event B, is given by P (A |
B) = P (A B)/P (B).
13

Note that what happens in the conditional probability calculation is that B becomes the sample
space:
Def.
P (B|B) = P (B B)/P (B) = P (B)/P (B) = 1.
The intuition is that the original sample space S, has been updated to B (since it is given
that B occurs). Note further that since B will occur, it is clear that A occurs iff A occurs
concurrently with B, that is iff A B occurs. Hence, P (A | B) P (A B). The division by
P (B) ensures that P (A|B), as defined, represents a probability function.
Remark 1.12 It is clear that the conditional probabilities as defined above are values of a
set function. That these are values of a probability set function satisfying the Axioms 13 is
established in the following theorem.
Theorem 1.10 Given a probability space {S, Y, P } and an event B for which P (B) 6= 0,
P (A | B) = P (A B)/P (B) defines a probability set function with domain Y .
Proof
To prove the theorem we need to show that the set function P (A | B) = P (A B)/P (B) adheres to
the Axioms 13 of probability on the domain Y .
1. Clearly, P (A|B) 0 for all A Y , since P (A B) 0 and P (B) 0.
2. Also, P (S|B) = 1, since P (S|B) = P (S B)/P (B) = P (B)/P (B).
3. If A1 , A2 , ... is a disjoint sequence of sets in Y , then
P (
i=1 Ai |B) = P [(i=1 Ai ) B]/P (B)
= P [
i=1 (Ai B)]/P (B)
=
(by def. of conditional probability)

(since is distributive)
P (Ai B)/P (B)
i=1
(since (Ai B) (Aj B) = for i 6= j)

=
P (Ai |B)
(by def. of conditional probability).
i=1
Remark 1.13 Since P (A|B) adheres to the probability axioms, all of the properties that we
have discussed for unconditional probabilities apply to conditional probabilities (see Mittelhammer, 1996 Theorem 1.1c -1.8c ).
14

Example 1.12 The experiment consists of tossing two fair coins. The sample space is S =
{(H,H), (H,T), (T,H), (T,T)}. The conditional probability of the event obtaining two heads
A = {(H,H)},
given the first coin toss results in heads
B = {(H,H), (H,T)}
is
P (A|B) = P (A B)/P (B)
(classical def.)
(1/4)/(1/2) = 1/2.
The definition of conditional probability can be transformed to obtain the multiplication rule.
It allows one to factorize the joint probability for the events A and B into the conditional
probability for event A, given event B and the unconditional probability of B.
Theorem 1.11 (Multiplication Rule) Let A and B be any two events in S for which P (B) 6=
0. Then P (A B) = P (A | B)P (B).
Proof
The result follows from the definition of conditional probability.
Example 1.13 A computer manufacturer has quality-control inspectors examine every produced computer. A computer is shipped to a retailer only if it passes inspection. The probability that a computer is defective, say event D, is P (D) = 0.02. The probability that
an inspector assigns a pass (event A) to a defective computer (that is given event D) is
P (A|D) = 0.05. The joint probability that a computer is defective (D) and shipped to the
retailer (A) is P (A D) = P (A | D)P (D) = 0.05 0.02 = 0.001.
The multiplication rule can be extended to more than two events as follows:
Theorem 1.12 (Extended Multiplication Rule) Let A1 , A2 , . . . , An , n 2, be events in

S. Then if all of the conditional probabilities exist,
P (ni=1 Ai ) = P (A1 ) P (A2 |A1 ) . . . P (An |An1 An2 . . . A1 )
= P (A1 )
n
Y
i1
P Ai | j=1
Aj .
i=2
15

Proof
n
Let B = n1
i=1 Ai such that P (i=1 Ai ) = P (An B). Hence we have by the multiplication rule for
n=2
n1
P (ni=1 Ai ) = P (An |B)P (B) = P (An | n1
i=1 Ai ) P (i=1 Ai ).
n1
Now consider the last factor of the last equation and let C = n2
i=1 Ai such that P i=1 Ai
P (An1 C) = P (An1 |C)P (C). Hence we have

n2
n2
P (ni=1 Ai ) = P (An | n1
i=1 Ai ) P (An1 | i=1 Ai ) P (i=1 Ai ).
Sequentially repeating this factorization of the last factor leads to the result.
1.5. Independence
Definition 1.10 (Independence of events (2-event case)) Let A and B be two events in
S. Then A and B are independent iff P (A B) = P (A)P (B). If A and B are not independent,
A and B are said to be dependent events.
An intuitively appealing interpretation of independence is obtained by considering its implication for conditional probabilities. In particular, independence of A and B implies
P (A|B) = P (A B)/P (B) = P (A)P (B)/P (B) = P (A), as long as P (B) > 0
P (B|A) = P (B A)/P (A) = P (B)P (A)/P (A) = P (B), as long as P (A) > 0.
Thus the probability of event A occurring is unaffected by the occurrence of event B, and vice
versa.
Independence of A and B implies independence of the complements also. In fact we have the
following theorem:
A and B, and A
Theorem 1.13 If events A and B are independent, then events A and B,
are also independent.
and B
Proof
note that
To establish the independence of A and B
= P (A) P (A B) (Theorem 1.4)
P (A B)
= P (A) P (A)P (B) (independence of Aand B)
= P (A)[1 P (B)]
(Theorem 1.1).
= P (A)P (B)
16
The independence of A and B is obtained analogously. To establish the independence of A and B

note that
= P (A B)
P (A B)
(DeMorgans law)
= 1 P (A B) (Theorem 1.1)
= 1 [P (A) + P (B) P (A B)] (Theorem 1.5)
= 1 P (A) P (B) + P (A)P (B)
(independence of Aand B)
= 1 P (A) P (B)[1 P (A)]

P (B)P (A)
= P (A)
(Theorem 1.1)
P (B)] = P (A)P
(B)
= P (A)[1
(Theorem 1.1).
The following example illustrates the concept of independence.

Example 1.14 The work force of a company has the following distribution among type and
gender of workers:
Type of worker
Gender
Sales
Clerical
Production
Total
Male
Female
825
1,675
675
825
750
250
2,250
2,750
Total
2,500
1,500
1,000
5000
The experiment consists of randomly choosing a worker and observing type and gender. Is the
event of observing a female (A) and the event of observing a clerical worker (B) independent
from the point of view of classical probability? We obtain P (A) = 2, 750/5000 = 0.55 and
P (B) = 1, 500/5000 = 0.30 with P (A)P (B) = 0.165. Also, P (A B) = 825/5000 = 0.165.
Hence A and B are independent.
The concept of independent events can be generalized to more than two events as follows:
Definition 1.11 (Independence of events (n-event case)) Let A1 , A2 , . . . , An , be events
in the sample space S. The events A1 , A2 , . . . , An are (jointly) independent iff
P (jJ Aj ) =
P (Aj ) ,
for all subsets J {1, 2, . . . , n}
jJ
for which N (J) 2. If the events A1 , A2 , . . . , An are not independent, they are said to be
dependent events.
17

Remark 1.14 Note that this definition requires that the joint probability of any subcollection
of events from A1 , A2 , . . . , An can be factorized. In the case of n = 3 events, joint independence
requires:
P (A1 A2 ) = P (A1 )P (A2 ), P (A1 A3 ) = P (A1 )P (A3 ), P (A3 A2 ) = P (A3 )P (A2 ),
(pairwise independence) and
P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 ).
The following example illustrates that pairwise independence between all pairs of events (Ai , Aj )
is not sufficient for joint independence of A1 , A2 , . . . , An .
Example 1.15 Let the sample space S consists of all permutations of the letters a, b, c along
with three triples of each letter, that is,
S = {aaa, bbb, ccc, abc, bca, cba, acb, bac, cab}.
Furthermore, let each element of S have probability 1/9. Consider the events
Ai = {ith place in the triple is occupied by a}.
According to the classical probability we obtain for all i = 1, 2, 3
P (Ai ) = 3/9 = 1/3,
and
P (A1 A2 ) = P (A1 A3 ) = P (A2 A3 ) = 1/9,
so A1 , A2 , A3 are pairwise independent. But they are not jointly independent since
P (A1 A2 A3 ) = 1/9 6= P (A1 )P (A2 )P (A3 ) =
1
.
27
1.6. Total Probability Rule and Bayes Law

Bayess rule (or law) provides an alternative representation of conditional probabilities. This
representation provides the means for reevaluating the probability of an event B, given the
additional information that event A occurs. By this rule the probability of the event B is, in
effect, updated in light of the additional information provided by the occurrence of event A.
This rule was discovered by the English clergyman and mathematician Thomas Bayes.
Bayess rule is a simple consequence of the total probability rule established in the following
theorem:
18

Theorem 1.14 (Law of Total Probability) Let the events Bi , i I, be a finite or countably
infinite partition of S, so that Bj Bk = for j 6= k, and iI Bi = S. Let P (Bi ) > 0 i.
Then total probability of event A is
P (A) =
P (A | Bi )P (Bi ).
iI
Proof
First note that
A = A S = A (iI Bi ) = iI (A Bi )
(since is distributive).
From Bj Bk = it follows that (A Bj ) (A Bk ) = for all j 6= k. By Axiom 3 and the

multiplication rule we have
P (A) = P [iI (A Bi )]
=
P (A Bi )
(by Axiom 3)
iI
P (A|Bi )P (Bi )
(by multiplication rule).
iI
Note that the total probability rule states that the total (unconditional) probability of an event
A can be represented by the sum of conditional probabilities given the events Bi weighted by
the unconditional probabilities P (Bi ).
The Bayess rule is obtained as the following corollary to the Law of Total Probability:
Corollary 1.2 (Bayess Law) Let the events Bi , i I, be a finite or countably infinite partition of S, so that Bj Bk = for j 6= k and iI Bi = S. Let P (Bi ) > 0 i I. Then,
provided P (A) 6= 0,
P (A | Bj )P (Bj )
,
iI P (A | Bi )P (Bi )
P (Bj | A) = P
j I.
Proof
By the definition of the conditional probability for Bj |A, the multiplication rule and the total probability rule we immediately have
P (Bj | A) =
P (A | Bj )P (Bj )
P (Bj A)
=P
.
P (A)
iI P (A | Bi )P (Bi )
19

Bayess rule implies
Remark 1.15 In the case of two events with S = B B
P (B | A) =
P (A | B)P (B)
(B)
,
P (A | B)P (B) + P (A | B)P
which is obtained by setting I = {1, 2}.
Example 1.16 Consider a blood test for detecting a certain disease. Let A be the event that
the test result is positive and B be the event that the individual has the disease. The test
detects with probability 0.98 the disease given that the disease is, in fact, in the individual being
tested, that is, P (A|B) = 0.98. The test yields a false positive result for 1 percent, that is,
= 0.01. 0.1 percent of the population has the disease which implies that P (B) = 0.001.
P (A|B)
What is the probability that a randomly chosen person to be tested actually has the disease if
the test result is positive?
The application of Bayes rule yields
P (B | A) =
.98 .001
P (A | B)P (B)
=
= .089.
P (B)
.98 .001 + .01 .999

P (A | B)P (B) + P (A | B)
| {z }
1P (B)
Hence, Bayes rule provides the means for updating the (unconditional) probability of the event
B, given the information signal that the event A occurs.
Bayes rule is general enough to lead to an inferential principle leading to so-called Bayesian
statistics. More about this in Advanced Statistics II.
20
2. Random variables and their probability

distributions
2.1. Univariate Random Variables
Although we are able to define probability for any event space, in most experiments it is often
easier to deal with a summary variable (i.e. with numbers) than with the original probability
structure.
Example 2.1 Consider the experiment of tossing a coin 50 times. Typically, we are not interested in knowing which of the 250 possible 50-tuples in sample space S has occurred. Rather
we would like to know the number of heads in 50 tosses, which can be defined as a variable X.
Note that the sample space of X is the set {0, 1, 2, ..., 50} which is much easier to deal with than
the original sample space S. By defining the variable X, we have defined a function from the
original sample space S to a new sample space.
Definition 2.1 (Univariate random variable) Let {S, Y, P } be a probability space. If X :

S R (or simply, X) is a real-valued function having as its domain the elements of S, then
X : S R (or X) is called a random variable.
Example 2.2 In some experiments random variables are implicitly used; Examples are:
Experiment
Random variable
Toss two dice

Toss a coin 50 times
Toss a coin 50 times
X =sum of the numbers

X =number of heads in 50 tosses
X = squared number of heads in 50 tosses
A few words on notation:

1. X(w) denotes the image of w S generated by the random variable X : S R.
21

2. x = X(w) denotes a (realized) value of the function X.
Uppercase letters (X) will be used to denote random variables, and corresponding lowercase
letters (x) will denote the realized values.1
By defining a random variable, we have also defined a new sample space, namely, the range
of the random variable. This range, denoted by R(X), is the set of all x values which can be
generated on the sample space S using the function X:
R(X) = {x : x = X(w), w S}.
This raises the following important questions. First, how can we embed the new sample space
R(X) within a probability space that can be used for assigning probabilities to events in terms
of random-variable outcomes? Second, what is the probability function on R(X), say PX ?
The answer is given by the induced probability function. Suppose we have a discrete sample
space
S = {w1 , ..., wn } with a probability function P .
Now define a random variable
X(w) with range R(X) = {x1 , ..., xm }.
Assume that we observe X = xi iff the experiments outcome is wj such that
xi = X(wj ).
Since the elementary event wj S is equivalent to the event xi R(X), both events should
have the same probability. Thus
PX (X = xi ) = P ({wj : xi = X(wj ), wj S}).
Note that the function PX on the left-hand side is an induced probability set function on R(X)
defined in terms of the original function P .
Example 2.3 Consider the experiment of tossing a fair coin two times. Define the random
variable X to be the number of heads in the two tosses. Thus
Experiments outcome w S
Variables Realization x = X(w)
1
(H,H)
(H,T)
(T,H)
(T,T)
As with any rule, there are exceptions: econometrics, time series analysis...
22

The random variables range is R(X) = {0, 1, 2}. Since, for example, PX (X = 1) = P ({H, T })+
P ({T, H}), the induced probability function on R(X) is obtained as
x
PX (X = x)
1/4
1/2
1/4
In practice, it is useful to have a representation of the induced probability set function, PX , in a

compact closed-form formula. This leads us to the definition of a probability density function.
With every random variable, we associate a density function. Random variables can be either
discrete or continuous.2 In the following, we start to consider discrete random variables and
their probability density function.
Definition 2.2 (Discrete random variable) A random variable X is called discrete iff its
range R(X) is countable.
Definition 2.3 (Discrete probability density function) The discrete probability density
function (pdf) of a discrete random variable X, denoted by f , is defined by
f : R [0, 1] with
f (x) =
PX (X = x) if x R(X)
.
0
else.
Remark 2.1 Even though the range R(X) of a discrete random variable consists of a countable
number of elements, the domain of the pdf is the entire uncountable real line R. However, the
value of f (x) is set to zero at all points x
/ R(X). This definition is adopted for the sake of
convenience it standardizes the domain for all random variables to be R.
Remark 2.2 Quite often, the pdf of a discrete random variable is called probability mass
function. The reason is that, unlike for continuous distributions, probability amasses at discrete points. (The consequence is that these points have nonzero probability measure.) The
interpretation is the same, namely the pmf gives the (same) probability of observing x as outcome.
2
Combinations are also possible, but not encountered too often.
23
Remark 2.3 The pdf allows us to obtain the probability for an event in R(X). Consider the
event A R(X), written as a union of elementary events A = xA {x}. Since elementary
events are disjoint, we know from Axiom 1.3 that
(Ax.3)
PX (A) = PX (xA {x}) =
X
xA
PX (x) =
f (x).
xA
Thus, we can use the pdf to calculate probabilities for events on R(X) by summing the probabilities of the elementary events given by the pdf.
Example 2.4 Consider the experiment of tossing two fair dice and observing the number of
dots facing up. The sample space is S = {(i, j) : i = 1, . . . , 6; j = 1, . . . , 6}, where i, j are the
number of dots. S consists of 36 elementary events. Define a random variable X to be the sum
of the dots, such that x = X((i, j)) = i + j. We obtain the following correspondence between
outcomes of X and events in S:
x = X((i,
j))
R(X) =
Bx = {(i, j) : x = i + j, (i, j) S)} PX (x) = f (x) = P (Bx )
2
3
4
{(1, 1)}
{(1, 2), (2, 1)}
{(1, 3), (2, 2), (3, 1)}
..
.
1/36
2/36
3/36
12
{(6, 6)}
1/36
Consider the event x A = {3, 4}. The probability is obtained as PX (A) = xA f (x) =
f (3) + f (4) = 5/36. A compact algebraic form for the pdf f is f (x) = 6|x7|
I{2,3,...,12} (x).
36
P
If the the range R(X) is continuous with events A defined as intervals on R(X), the summation
P
operation over the element in A (i.e. xA ) is not defined. Thus, defining a probability set
P
function on events in R(X) as PX (A) = xA f (x) will not be possible! For this reason, we
P
substitute the summation operation xA by integration xA . This leads us to the following
definition of a continuous probability density function:
Definition 2.4 (Continuous probability density function) A random variable X is called

continuous iff
24

its range R(X) is uncountably infinite and
there exists a function
f : R [0, )
such that A,
PX (A) =
f (x)dx
xA
and
f (x) = 0 x
/ R(X).
The function f is called a continuous probability density function.
Example 2.5 Consider a Formula 1 circuit of 10 km. Suppose that accidents are equally likely
to occur at each point of the circuit. Define the continuous random variable X to be the point
of a potential accident with range R(X) = [0, 10]. In order to obtain the pdf for X, consider
the event A of an accident between two points a and b, such that A = [a, b]. Since all points
are equally likely, we obtain
PX (A) =
length of A
ba
=
.
R(X)
10
According to the definition, the pdf f for X has to satisfy
f (x)dx =
xA
f (x)dx = PX (A) =
a
with
[
b
a
ba
,
10
ba
f (x)dx]
1
! [ 10 ]
= f (b) =
= ,
b
b
10
0 a b 10,
b [0, 10].
Hence, the function

f (x) =
1
I[0,10] (x)
10
can be used as a pdf for X, and for any event A on R(X) we obtain PX (A) =
5 1
example, the probability for A = [0, 5] is PX (A) = 0 10
dx = 1/2.
1
dx.
xA 10
For
Remark 2.4 The definition of the continuous pdf implies that the probability for an elementary
event A = {a} is zero, since
PX (A) =
f (x)dx = 0.
a
But this does not mean that the event A is impossible! Instead, we might interpret this to mean
that A is relatively impossible, relative to all other outcomes that can occur in R(X)A. Note
25

the difference between zero probability and impossibility in the continuous case. Any number in
the range of a continuous random variable has zero probability of occurring; yet some number
is always realized.
Remark 2.5 Consider the sets [a, b], (a, b], [a, b), (a, b) and note that
[a, b] = (a, b] {a} = [a, b) {b} = (a, b) {a} {b}.
Since the sets are disjoint and since PX ({a}) = PX ({b}) = 0, probability Axiom 1.3 implies
that
b
PX ([a, b]) = PX ((a, b]) = PX ([a, b)) = PX ((a, b)) =
f (x)dx.
a
The interpretation of the function value of a continuous pdf f (x) is fundamentally different
from that of a discrete pdf:
If f is discrete, f (x) = PX (x) = probability of the outcome x.
If f is continuous, f (x) is not the probability of outcome x, which is PX (x) = 0. Note
that if f (x) was a probability, we would have f (x) = 0 x.
Remark 2.6 Sometimes, generalized functions are used to circumvent this problem and have
a unified notion of the density function for both continuous and discrete random variables.
The Dirac function (x) plays an important role in this approach, being defined as (x) = 0
a
x 6= 0 and (x) dx = 1. Then, one can integrate such that a f (x)dx = P (X = a) 6= 0 if
conveniently redefining f to satisfy f (a) = P (X = a) (a).
An important task of statistical inference is the identification of an appropriate function f which

can be used as a pdf representing the stochastic behavior of a random variable. Note that the
selected f should ensure that the probabilities obtained from f adhere to the probability axioms.
The following definition identifies the restrictions on admissible choices of f :
Definition 2.5 (Class of discrete pdfs) The function f : R R is a member of the class
of discrete pdfs iff
(ia ) the set C = {x : f (x) > 0, x R} is countable;
(iia ) f (x) = 0 x C;
(iiia )
xC
f (x) = 1.
26

Definition 2.6 (Class of continuous pdfs) on f : R R is a member of the class of
continuous pdfs iff
(ib ) f (x) 0 x R;
(iib )
xR
f (x)dx = 1.
Remark 2.7 The definition tells us whether a particular function f can be used as a pdf or not.
The conditions (ia )-(iiia ) and (ib ),(iib ), respectively, ensure that the corresponding set functions
used to compute probabilities, namely,
PX (A) =
f (x) and
PX (A) =
f (x)dx
xA
xA
are in fact probability set functions which adhere to the probability axioms. As to the sufficiency
of the conditions (ia )-(iiia ) for the resulting PX to satisfy the axioms for the discrete case:
(ia ), (iia ) imply that f (x) 0 x
P
PX (A) = xA f (x) 0 events A (Ax.1).
(ia ), (iia ) imply that we can set C = R(X) such that together with (iiia )
P
PX (R(X)) = xR(X) f (x) = 1 (Ax.2).
Let {Ai , i I} be a collection of disjoint events. Then the set function used to compute
the probability for iI Ai is
PX (iI Ai ) =
f (x)
(disjoint Ai s)
iI
x[iI Ai ]
f (x) =
xAi
PX (Ai ) (Ax.3).
iI
As to the sufficiency of the conditions (ib ) and (iib ) for the resulting PX to satisfy the axioms
for the continuous case:
(ib ) says that f (x) 0 x
PX (A) = xA f (x)dx 0 events A (Ax.1).

(iib )
says that R f (x)dx = 1.
at least one event A R such that A f (x)dx = 1

We can set A = R(X)
PX (R(X)) = xR(X) f (x)dx = 1 (Ax.2).
27

Let {Ai , i I} be a collection of disjoint events. Then the set function used to compute
the probability for iI Ai is
"
PX (iI Ai ) =
f (x)dx =
x[iI Ai ]
X
iI
xAi
{z
f (x)dx
PX (Ai )
(Ax.3).
iI
nonoverlapping Ai s: addititivity prop. of Riemann integrals
For a discussion of the necessity of the conditions (ia )-(iiia ) and (ib ),(iib ) for the resulting PX
to satisfy the probability axioms, see Mittelhammer 2000, p. 57. He shows that all conditions,
except for the condition that f (x) 0 x (ib ) in the continuous case, are necessary. In the
continuous case, the property f (x) 0 is not necessary for the following reason. The function
b
f could technically be negative for a finite number of x values, because the value of a f (x)dx
is invariant to changes in f (x) at a finite number of points having measure zero:
Example 2.6
1. Consider the function f (x) = (0.3)x (0.7)1x I{0,1} (x). Can this f serve as pdf?
P
Since (i) f (x) > 0 on the countable set {0, 1}, and (ii) 1x=0 f (x) = 1, and (iii) f (x) = 0
x
/ {0, 1}, the function f can serve as a pdf.
2. Consider the function f (x) = (x2 + 1)I[1,1] (x). Can this f serve as pdf?
While f (x) 0 x R, f does not integrate to 1:
(x2 + 1)dx =
f (x)dx =
R
Thus, f can not serve as a pdf.
28
8
6= 1.
3

Remark 2.8 Obtaining a pdf based on functions like f (x) = (x2 + 1)I[1,1] (x) (i.e. satisfying
(ib ) but not (iib )) is not complicated. In general, a pdf based on this such so-called density
kernels is given by simply dividing by R f (x)dx.
2.2. Univariate Cumulative Distribution Functions

Pdfs offer one view on the distribution of a random variable.
Definition 2.7 (Cumulative distribution function) The cumulative distribution function

(cdf) of a random variable X, denoted by F , is defined by
F : R [0, 1] such that F (b) = PX (X b),
b R.
Remark 2.9 For a discrete random variable the cdf is then

F (b) =
b R,
f (x),
xb
and for a continuous random variable as
F (b) =
f (x)dx,
b R.
Example 2.7 Let the random variable X be the duration of a telephone call (in min), with
range R(X) = {x : x > 0}.
Let the pdf be:
f (x) = 1 e I(0,) (x) , with > 0.
The cdf is obtained as:
x
1
e dx
0
F (b) =
= (1 e ) I(0,) (b)
Assume that = 100 (average duration). Then the probability that the duration is less
50
than 50 min is: F (50) = 1 e 100 = 0.39.
Example 2.8 Let the random variable X be the number of dots observed rolling a die, with
range R(X) = {1, 2, . . . , 6}.
The pdf is:
f (x) =
1
6
I{1,...,6} (x).
The cdf is obtained as: F (b) =
1
xb 6
I{1,...,6} (x) = 16 bbc I{1,...,6} (b) + I(6,) (b)
(bbc denotes the integer part of the number b).

29

The definition of the cdf implies that a cdf F (x) satisfies certain properties.
Theorem 2.1 (Properties of a cdf) For any cdf F , it holds that

(i) limx F (x) = 0
limx F (x) = 1;
and
(ii) F (x) is a non decreasing function on x; that is, F (a) F (b) for a < b;
(iii) F (x) is right-continuous; that is, limh0 F (x + h) = F (x).
Proof
Property (i) follows from the fact that
limx F (x) = limx PX (X x) = PX () = 0, and
limx F (x) = limx PX (X x) = PX (R(X)) = 1
Property (ii) follows from the fact that we accumulate (by integration / summation) non-negative
values if we move from the left to the right.
Property (iii) follows from the fact that
limh0 F (x + h) = limh0 PX (X x + h) = PX (X x) = F (x).
The following theorems establish the relationship between a cdf and pdf.
Theorem 2.2 Let x1 < x2 < x3 < be the countable set of outcomes in the range of the
discrete random variable X. Then the pdf for X is obtained as
f (xi ) =
F (xi ),
i=1
F (xi ) F (xi1 ), i = 2, 3, . . .
0,
x
/ R(X).
Proof
Since summation of f leads to F , taking differences of F leads to f .
Theorem 2.3 Let f (x) and F (x) denote the pdf and cdf of a continuous random variable X.
Then the pdf for X is obtained as
f (x) =
dF (x) ,
dx
0,
wherever f (x)is continuous

elsewhere.
30

Proof
Wherever f is continuous we have
dF (x)
d
=
dx
dx
"
f (u)du = f (x)
(Fundamental Theorem of Calculus).
At points where f is discontinuous (such that the derivative of F does not exist) we can set f to an
x
arbitrary non-negative value (for example 0), since the value of F (x) = f (u)du is invariant to
changes in f (u) at a finite set of points having measure zero.
Example 2.9 Recall the Example, where X is the duration of a telephone call, with cdf
x
F (x) = (1 e ) I(0,) (x).

A continuous pdf for X is given by
dF (x)
dx
= 1 e for
f (x) = 0 (say)
for
0
for
x (0, )
x=0
x (, 0).
2.3. Multivariate Random Variables

So far, we have discussed univariate random variables, where only one real-valued function was
defined on the sample space S. If we define concurrently two or more real-valued functions, we
obtain multivariate random variables (random vectors).
Definition 2.8 (Multivariate random variable) Let {S, Y, P } be a probability space. If
X : S Rn (or simply, X) is a real-valued vector function having as its domain the elements
of S, then X : S Rn (or X) is called a multivariate (n-variate) random variable.
Remark 2.10 The realized value of the multivariate random variable is
x=
x1
x2
..
.
xn
X1 (w)
X2 (w)
..
.
Xn (w)
= X(w)
for w S,
and its range is

R(X) = {(x1 , ..., xn ) : xi = Xi (w), i = 1, ..., n, w S}.
31

The definitions of pdfs for multivariate discrete and continuous random variables are analogous
to those in the univariate cases, and are as follows:
Definition 2.9 (Discrete multivariate pdf) A multivariate random variable X = (X1 , ..., Xn )
is called discrete iff its range R(X) is countable. The discrete joint pdf of a discrete random
variable X, denoted by f , is defined by f : Rn [0, 1] such that
f (x) =
PX (X1 = x1 , ..., Xn = xn ) if x R(X)

.
0
else.
Definition 2.10 (Continuous multivariate pdf) A random vector X = (X1 , ..., Xn )0 is

called continuous iff
its range R(X) is uncountably infinite and
there exists a function
f : Rn [0, )
such that for any event A,
PX (A) =
f (x1 , ..., xn )dx1 dxn
(x1 ,...,xn )A
and
f (x1 , ..., xn ) = 0 (x1 , ..., xn )
/ R(X).
The function f is called a continuous joint pdf.
As in the univariate case, a function f selected to serve as a joint pdf should ensure that
the probabilities obtained from the selected function f adhere to the probability axioms. The
following definition identifies the restrictions on admissible choices of f :
Definition 2.11 (Class of discrete joint pdfs) The function f : Rn R is a member of
the class of discrete joint pdfs iff
(ia ) the set C = {(x1 , ..., xn ) : f (x1 , ..., xn ) > 0, (x1 , ..., xn ) Rn } is countable;
(iia ) f (x1 , ..., xn ) = 0 (x1 , ..., xn ) C;

(iiia )
(x1 ,...,xn )C
f (x1 , ..., xn ) = 1.
Definition 2.12 (Class of continuous joint pdfs) The function f : Rn R is a member

of the class of continuous joint pdfs iff
32

(ib ) f (x1 , ..., xn ) 0 (x1 , ..., xn ) Rn ;
(iib )
f (x1 , ..., xn )dx1 dxn = 1.
Remark 2.11 The conditions (ia )(iiia ) and (ib ),(iib ), respectively, ensure that the corresponding set functions used to compute joint probabilities, namely,
PX (A) =
f (x1 , ..., xn )
(x1 ,...,xn )A
and
PX (A) =
f (x1 , ..., xn )dx1 dxn
(x1 ,...,xn )A
are in fact probability set functions which adhere to the probability axioms.
Sufficiency and necessity of the conditions can be shown by generalizing the arguments used in
the univariate case.
Example 2.10 Consider that the NASA announces that a small meteorite will hit a rectangular
area of 12km2 = 4km 3km but cant get any more precise. Suppose further that each point in
that rectangle is equally likely to be struck. Define X = (X1 , X2 ) to be the coordinates of the
point of strike, with a range R(X) = {(x1 , x2 ) : x1 [2, 2], x2 [1.5, 1.5]} (coordinates in
km relative to center of rectangle).
In order to derive the continuous pdf of X, consider some closed rectangle A in R(X). Since
all points are equally likely, we obtain
Px (x A) =
(b a)(d c)
area of A
=
.
R(X)
12
According to the definition, the pdf f for X has to satisfy
f (x1 , x2 )dx1 dx2 =

c
(b a)(d c)
,
12
2 a b 2; 1.5 c d 1.5, with

2
d
c
b
a
f (x1 , x2 )dx1 dx2
db
= f (b, d) =
2 [(b a)(d c)/12]

1
= ,
db
12
b [2, 2], d [1.5, 1.5]. Hence, the function

f (x1 , x2 ) =
1
I[2,2] (x1 )I[1.5,1.5] (x2 )
12
33

can be used as a joint pdf for X, and for any event A R(X) we obtain PX (A) =
1
dx1 dx2 .
xA 12
The concept of a cdf can be extended to the multivariate case as well.
Definition 2.13 (Joint cdf) The joint cdf of an n-dimensional random variable X, denoted
by F , is defined by
F : Rn [0, 1] such that F (b1 , ..., bn ) = PX (X1 b1 , ..., Xn bn ),
(b1 , ..., bn ) Rn .
Remark 2.12 For a discrete random variable the joint cdf is obtained as
F (b1 , ..., bn ) =
x1 b1
(b1 , ..., bn ) Rn ,
f (x1 , ..., xn ),
xn bn
and for a continuous random variable as
bn
b1
f (x1 , ..., xn )dx1 dxn ,
F (b1 , ..., bn ) =
(b1 , ..., bn ) Rn .
Theorem 2.4 (Properties of joint cdfs) For any multivariate cdf F , it holds that
(i) limbi F (b1 , ..., bn ) = PX () = 0,
for any
i = 1, ..., n;
(ii) limbi ,i F (b1 , ..., bn ) = PX (R(X)) = 1;

(iii) F is a non decreasing function on (x1 , ..., xn ),
defined vector inequality)
a1
b
1
..
..
a= . < .
an
bn
that is, F (a) F (b) for (the suitably
= b;
(iv) Discrete joint cdfs have a countable number of jump discontinuities and joint cdfs for
continuous random variables are continuous without jump discontinuities.
Proof
Analogous to the proof of Theorem 2.1.
Similar to the univariate case the joint cdf can be used to obtain the joint pdf.
34

Theorem 2.5 Let (X, Y ) be a discrete bivariate random variable with joint cdf F (x, y) and
range R(X, Y ) = {x1 < x2 < x3 < , y1 < y2 < y3 < }. Then the joint pdf is obtained as
f (x1 , y1 ) = F (x1 , y1 ),
f (x1 , yj ) = F (x1 , yj ) F (x1 , yj1 ),
j 2,
f (xi , y1 ) = F (xi , y1 ) F (xi1 , y1 ),
i 2,
f (xi , yj ) = F (xi , yj ) F (xi , yj1 ) F (xi1 , yj ) + F (xi1 , yj1 ),
i, j 2.
Proof
Since summation of f leads to F , taking differences of F leads to f .
Remark 2.13 The result of Theorem 2.5 for the bivariate case can be generalized to the nvariate case. However, this will require a somewhat cumbersome notation.
Theorem 2.6 Let f (x1 , ..., xn ) and F (x1 , ..., xn ) denote the joint pdf and cdf for a continuous
multivariate random variable X = (X1 , ..., Xn ). Then the joint pdf for X is obtained as
f (x1 , ..., xn ) =
n
F (x1 ,...,xn ) ,
x1 xn
0,
wherever f ()is continuous

elsewhere.
Proof
Wherever f is continuous we have
n F (x
1 , ..., xn )
=
x1 xn
x1
xn
f (u1 , ..., un )du1 dun
x1 xn
|
= f (x1 , ..., xn ) .
{z
(Fundamental Theorem of Calculus)
At points where f is discontinuous (such that the derivative of F does not exist) we can set f to an arbi xn
x1
trary non-negative value (for example 0), since the value of F (x1 , ..., xn ) =

f (u1 , ..., un )du1 dun
is invariant to changes in f () at a finite set of points having measure zero.
Example 2.11 Recall the meteorite example, where X = (X1 , X2 ) is the point of strike. The
joint cdf is obtained as
b2
b1
F (b1 , b2 ) =
1
I[2,2] (x1 )I[1.5,1.5] (x2 )dx1 dx2 ,
12
35

with four different integration areas, such that
" b
2
b1
1
F (b1 , b2 ) =
dx1 dx2 I[2,2] (b1 )I[1.5,1.5] (b2 )
1.5 2 12
" b 2
#
2
1
+
dx1 dx2 I(2,) (b1 )I[1.5,1.5] (b2 )
1.5 2 12
#
" 1.5 b
1
1
dx1 dx2 I[2,2] (b1 )I(1.5,) (b2 )
+
1.5 2 12
" 1.5 2
#
1
+
dx1 dx2 I(2,) (b1 )I(1.5,) (b2 )
1.5 2 12
or
F (b1 , b2 ) =
(b1 + 2)(b2 + 1.5)

I[2,2] (b1 )I[1.5,1.5] (b2 )
12
4(b2 + 1.5)
+
I(2,) (b1 )I[1.5,1.5] (b2 )
12
3(b1 + 2)
+
I[2,2] (b1 )I(1.5,) (b2 ) + 1 I(2,) (b1 )I(1.5,) (b2 ).
12
2.4. Marginal and conditional distributions

From the joint pdf f (x1 , x2 ) of a bivariate random variable (X1 , X2 ) we can easily derive the
marginal pdf of X1 and X2 , denoted by f1 (x1 ) and f2 (x2 ), which can be used to assign (marginal)
probabilities to the events x1 A1 and x2 A2 , that is P (x1 A1 ) and P (x2 A2 ). This
bears an intimate relation to the law of total probability.
Theorem 2.7 Let X = (X1 , X2 ) be a discrete random variable with joint pdf f (x1 , x2 ) and a
range R(X) = R(X1 ) R(X2 ). The marginal pdfs are given by
f1 (x1 ) =
f (x1 , x2 ),
and f2 (x2 ) =
x2 R(X2 )
X
x1 R(X1 )
Proof
For any x1 R(X1 ), let
A = {(x1 , x2 ) : x2 R(X2 )}.
36
f (x1 , x2 ).

That is, A is a line in the plane R(X) with first coordinate equal to x1 . Then for any x1 R(X1 ),
f1 (x1 ) = P(x1 )
[by def.]
= P(x1 , x2 R(X2 ))
[P (x2 R(X2 )) = 1]
= P((x1 , x2 ) A)
X
[def. of A]
f (x1 , x2 )
(x1 ,x2 )A
f (x1 , x2 )
x2 R(X2 )
The proof for f2 (x2 ) is analogous.
Remark 2.14 In order to obtain the marginal pdf, we simply sum out the variables that are
not of interest in the joint pdf. If the bivariate random variable is continuous, the marginal
pdfs are obtained as in the discrete case with integrals replacing sums.
Theorem 2.8 Let X = (X1 , X2 ) be a continuous random variable with joint pdf f (x1 , x2 ). The
corresponding marginal pdfs are given by
f (x1 , x2 )dx2 ,
f1 (x1 ) =
f (x1 , x2 )dx1 .
and f2 (x2 ) =
Proof
For any event x1 B, let A = {(x1 , x2 ) : x1 B, x2 R(X2 )}. Then for any event x1 B,
P (x1 B) = P(x1 B; x2 R(X2 )) = P((x1 , x2 ) A)

=
f (x1 , x2 )dx2 dx1
(x1 ,x2 )A
"
f (x1 , x2 )dx2
=
x1 B
dx1 =
f1 (x1 )dx1
x1 B
{z
has to be the pdf of X1 , f1 (x1 ), in order to obtain P (X1 B)!
The proof for f2 (x2 ) is analogous.
Example 2.12 Consider the continuous random variable X = (X1 , X2 ) with a joint pdf
f (x1 , x2 ) = (x1 + x2 )I[0,1] (x1 )I[0,1] (x2 ).
37

The corresponding marginal pdf of X1 is obtained as
f1 (x1 ) =
f (x1 , x2 )dx2 =
#x2 =1
"
x2
(x1 + x2 )I[0,1] (x1 )dx2 = (x1 x2 + 2 )I[0,1] (x1 )
2
x2 =0
1
x1 +
I[0,1] (x1 ).
2

Remark 2.15 The concept of marginal pdfs can be straightforwardly generalized from the bivariate to the n-variate case. In this case marginal pdfs can be joint pdfs themselves. The
n-variate generalization is presented in the following definition.
Definition 2.14 (Marginal pdfs) Let f (x1 , . . . , xn ) be the joint pdf for the n-dimensional
random variable (X1 , . . . , Xn ). Let J = {j1 , j2 , . . . , jm } , 1 m < n, be a set of indices selected
from the index set I = {1, 2, . . . , n}. Then the marginal density function for the m-dimensional
random variable (Xj1 , . . . , Xjm ) is given by
fj1 ...jm (xj1 , . . . , xjm ) =
P
P

f
xi R(Xi ),iIJ
(discrete case).
(x1 , . . . , xn )
f (x1 , . . . , xn )
dxi
(continuous case).
iIJ
The conditional pdf is in a sense the exact opposite of the marginal cdf: while the marginal cdf
averages over other variables, the conditional one takes the others explicitly into account at
least in a particular sense. Recall the discussion on conditional probabilities of events.
From the joint pdf of a bivariate random variable (X1 , X2 ) we can easily derive the conditional
pdf of X1 given X2 , which can be used to assign the probability to the event X1 C given that
(i.e. conditional on) X2 D. This probability is obtained as
X
P (X1 C|X2 D) =
x
1 C
f (x1 |x2 D)
(discrete case)
f (x1 |x2 D)dx1 (continuous case),

x1 C
where f (x1 |x2 D) denotes the conditional pdf of X1 given that x2 D. The pdf f (x1 |x2 D)
can be derived as follows:
Consider a discrete bivariate random variable (X1 , X2 ), with joint pdf f (x1 , x2 ), and the following two pairs of equivalent events:
x1 C
(x1 , x2 ) A = {(x1 , x2 ) : x1 C, x2 R(X2 )}

38

x2 D
(x1 , x2 ) B = {(x1 , x2 ) : x1 R(X1 ), x2 D}
Then the conditional probability for X1 C given X2 D is given by:

P (X1 C|X2 D)
(equiv. of events)
P (A|B)
(by Def.)
P (A B)
P (B)
for P (B) > 0.
Since the intersection of A and B is A B = {(x1 , x2 ) : x1 C, x2 D}, we get:

P
P (X1 C|X2 D) = P
x1 C
"P
X
f (x1 , x2 )
X
=
f (x1 , x2 ) x1 C
x2 D
x2 D
x1 R(X1 )
f (x1 , x2 )
P
x2 D f2 (x2 )
x2 D
{z
.
}
has to be the pdf f (x1 |x2 D)!
{z
marginal f2 (x2 )
Thus, if (X1 , X2 ) is a discrete random variable, the conditional pdf for X1 given x2 D can be
defined by
P
x2 D f (x1 , x2 )
f (x1 |x2 D) = P
,
x2 D f2 (x2 )
and, if D is a single point d, by
f (x1 |x2 = d) =
f (x1 , d)
.
f2 (d)
If (X1 , X2 ) is a continuous random variable, we can substitute the summation operations by

integrations, such that the conditional pdf for X1 given x2 D is defined as
f (x1 , x2 )dx2
.
f (x )dx2
x2 D 2 2
f (x1 |x2 D) = x2 D
However, a problem arises when D is a single point d, such that

d
f (x1 , x2 )dx2
0
f (x1 |x2 = d) = d d
= ,
0
f (x )dx2
d 2 2
which is undetermined! This problem is circumvented by redefining the conditional probability
in the continuous case in terms of a limit. In particular, in the continuous case we define the
39

conditional probability for X1 A given X2 = d as
P (X1 A|X2 = d) lim P (X1 A|d X2 d + )
0
d+
f
(x
,
x
)dx
1
2
2
dx1
d
= lim
d+
0 x A
f
(x
)dx
1
2
2
2
d
(by def. of a conditional prob.)
"
2f (x1 , x
2 )
dx1 , (x2 , x
2 ) [d , d + ]
2f2 (x2 )
(by the Mean Value Theorem for integrals)
"
f (x1 , x
2 )
dx1 .
f2 (x2 )
= lim
0
x1 A
= lim
0
x1 A
As 0 , the interval [d , d + ] reduces to [d, d] = d, so that x2 d and x

2 d. Thus, we
get:
f (x1 , d)
P (X1 A|X2 = d) =
dx1 .
f2 (d)
x1 A
|
{z
has to be the pdf f (x1 |x2 = d)!
This implies that the conditional pdf of x1 given x2 = d in the continuous case can be defined
as
f (x1 , d)
.
f (x1 |x2 = d) =
f2 (d)
Note that this conditional pdf has exactly the same form as in the discrete case.
Example 2.13 Consider the continuous random variable X = (X1 , X2 ) with joint pdf
f (x1 , x2 ) = (x1 + x2 )I[0,1] (x1 )I[0,1] (x2 ),
and marginal pdf (see above):
1
f2 (x2 ) = (x2 + )I[0,1] (x2 ).
2
Then the conditional pdf of X1 given X2 0.5 is obtained as
.5
f (x1 |x2 .5)
(def.)
f (x1 , x2 )dx2
.5
.5
=
f2 (x2 )dx2
4
1
( x1 + )I[0,1] (x1 ).
3
3
(x1 + x2 )I[0,1] (x1 )I[0,1] (x2 )dx2

.5
(x + 21 )I[0,1] (x2 )dx2
2
The conditional pdf of X1 given x2 = .75 is

(def.)
f (x1 |x2 = .75) =
f (x1 , .75)
4
3
= ( x1 + )I[0,1] (x1 ).
f2 (.75)
5
5
40

The concept of conditional pdfs can be straightforwardly generalized from the bivariate to the
n-variate case. The n-variate generalization is presented in the following definition:
Definition 2.15 (Conditional pdfs) Let f (x1 , . . . , xn ) be the joint pdf for the n-dimensional
random variable (X1 , . . . , Xn ). Let J1 = {j1 , . . . , jm } and J2 = {jm+1 , . . . , jn } be two mutually
exclusive index sets whose union is equal to the index set {1, 2, . . . , n}. Then the conditional pdf

for the m-dimensional random variable (Xj1 , . . . , Xjm ), given Xjm+1 = dm+1 , . . . , Xjn = dn is
given by
f (x1 , . . . , xn )
f (xj1 , . . . , xjm | xji = di , i = m + 1, . . . , n) =
fjm+1 jn (dm+1 , . . . , dn )
where xji = di if ji J2 , when the marginal density in the denominator is positive valued.
Remark 2.16 From the conditional pdf we can straightforwardly derive the conditional cdf by
using the conditional pdf in the general definition of a cdf.
2.5. Independence of Random Variables

The independence of two events A and B means that P (A B) = P (A) P (B) (see Section
1.5). This concept of independence can be straightforwardly applied to multivariate random
variables.
Definition 2.16 (Independence of random variables) The random variables X1 and X2

are said to be independent iff
P (X1 A1 , X2 A2 ) = P (X1 A1 ) P (X2 A2 ), A1 , A2 .
Remark 2.17 This definition is not immediately operational since the factorization has to hold
for all pairs of events. Thus, the following result can be useful in practice:
Theorem 2.9 The random variables X1 and X2 with joint pdf f (x1 , x2 ) and marginal pdfs
f1 (x1 ) and f2 (x2 ) are independent, iff
f (x1 , x2 ) = f1 (x1 ) f2 (x2 ) (x1 , x2 ),
(except possibly at points of discontinuity for a joint continuous pdf f ).
41

Proof
(Continuous case) Let A1 , A2 be any pair of events. Then, if the joint pdf can be factorized,
(def.)
P (X1 A1 , X2 A2 )
x1 A1
(by f actorization)
f (x1 , x2 )dx2 dx1
f2 (x2 )dx2
f1 (x1 )dx1
x2 A2
x2 A2
x1 A1
(def.)
P (X1 A1 ) P (X2 A2 ).
Thus, the factorization is sufficient for independence. Now suppose that X1 , X2 are independent and
let Ai = {xi : xi < ai }, (i = 1, 2) for arbitrary ai s. Then, by independence,
(def.)
P (X1 A1 , X2 A2 )
a1
a2
f (x1 , x2 )dx2 dx1
(by independence)
(def.)
P (x1 A1 ) P (x2 A2 ).
a2
a1
f2 (x2 )dx2 .
f1 (x1 )dx1
Differentiating the integrals w.r.t. a1 and a2 yields f (a1 , a2 ) = f1 (a1 ) f2 (a2 ). Thus, factorization is
necessary for independence.
The proof for the discrete case is analogous.
Remark 2.18 An important implication of the independence of X1 and X2 is that the conditional pdfs are identical to the corresponding marginal pdfs, that is,
(def.)
f (x1 |x2 = d) =
f (x1 , d)
f1 (x1 )f2 (d)
=
= f1 (x1 ).
f2 (d)
f2 (d)
Thus the probability of event X1 A is unaffected by the occurrence or nonoccurrence of event

X2 = d.
Example 2.14 Recall the meteorite example, where X = (X1 , X2 ) is the point of strike with
joint pdf
1
f (x1 , x2 ) = I[2,2] (x1 )I[1.5,1.5] (x2 ),
12
Are X1 and X2 independent? The marginal pdfs are
1,5
1
1
f1 (x1 ) =
I[2,2] (x1 )
1dx2 = I[2,2] (x1 )
12
4
1.5
2
1
1
f2 (x2 ) =
I[1.5,1.5] (x2 )
1dx1 = I[1.5,1.5] (x2 )
12
3
2
Thus, f (x1 , x2 ) = f1 (x1 )f2 (x2 ), and X1 and X2 are independent.
42

Remark 2.19 If X1 and X2 are independent, then knowing the marginal pdfs f1 and f2 is
sufficient to determine the joint pdf: f (x1 , x2 ) = f1 (x1 )f2 (x2 ). However, if X1 and X2 are
dependent, then knowing the marginal pdfs f1 and f2 is not sufficient to determine the joint pdf
f.
Example 2.15 Consider the joint pdf
f (x1 , x2 ; ) = [1 + (2x1 1)(2x2 1)]I[0,1] (x1 )I[0,1] (x2 ),
[1, 1].
For any choice of [1, 1], the marginal pdfs are

f1 (x1 ) = I[0,1] (x1 )
and f2 (x2 ) = I[0,1] (x2 ).
Hence, for all suitable values of in the joint pdf f , we obtain the very same marginal pdfs f1
and f2 . Thus, knowing f1 and f2 is insufficient to determine f and, in particular, the value of
.
So far, we considered the concept of independence for bivariate random variables. It can be
extended to the n-variate case with the following definition:
Definition 2.17 (Independence in the n-variate case) The random variables X1 , . . . , Xn
are said to be independent iff
P (X1 A1 , . . . , Xn An ) =
n
Y
P (Xi Ai ), A1 , . . . , An .
i=1
The generalization of the joint pdf factorization theorem from the bivariate to the n-variate
case is given in the following theorem:
Theorem 2.10 The random variables X1 , . . . , Xn with joint pdf f (x1 , . . . , xn ) and marginal
pdfs fi (xi ), i = 1, . . . , n, are all independent of each other, iff
f (x1 , . . . , xn ) =
n
Y
fi (xi )
(x1 , . . . , xn ),
i=1
(except possibly at points of discontinuity for a joint continuous pdf f ).

Proof
The proof is a direct extension of that for the bivariate case (Theorem 2.9).
The independence concept for random variables can be extended to the independence of random
variables, which are defined as functions of other independent random variables:
43

Theorem 2.11 If X1 and X2 are independent random variables, and if Y1 and Y2 are defined
as functions y1 = g1 (x1 ) and y2 = g2 (x2 ), then Y1 and Y2 are independent.
Proof
Define the events Yi Ai and Xi Bi , i = 1, 2, such that they are equivalent, i.e.
Bi = {xi : gi (xi ) Ai , xi R(Xi )},
i = 1, 2.
The joint probability for Y1 A1 and Y2 A2 is

P (Y1 A1 , Y2 A2 )
(equiv. of events)
(independence)
(equiv. of events)
P (X1 B1 , X2 B2 )
P (X1 B1 )P (X2 B2 )
P (Y1 A1 )P (Y2 A2 ).
An extension for random vectors and vector functions exists.

While it is quite convenient to think of random variables as simply coding actual events into
numbers, and of pdfs translating the corresponding probabilities to sets of real numbers, we
can discuss the distribution of random variables independently of the probability space where
they originate. In fact, for any distribution, one can construct a suitable probability space. For
this reason the main aspect of interest in the following chapters is the probability distribution
of a random variable.
44

3.1. Expectation of a Random Variable
The expected value, or expectation, of a random variable represents its average value and can
be thought of as a measure of the center of its pdf.
Definition 3.1 (Expectation; discrete case) The expected value of a discrete random variable exists, and is defined by
E (X) =
x f (x), iff
xR(X)
|x f (x)| =
xR(X)
|x| f (x) < .
xR(X)
Remark 3.1 The existence condition ensures that the series

pectation is absolutely convergent.
xR(X)
xf (x) defining the ex-
Furthermore, note that absolute convergence implies standard convergence, that is

x f (x)
xR(X)
|x| f (x) <
xR(X)
such that the (countable) sum defining the expectation is finite and exists.
Remark 3.2 Also note that, if
R(X) is finite and |x| < x R(X)
xR(X)
|x| f (x) < ,
such that the existence condition is automatically satisfied; but if

R(X) is countably infinite there is no guarantee that
xR(X)
|x| f (x) < .
Finally note that standard convergence does not ensure the uniqueness of the limit in the countably infinite case. This means that a change of the ordering of the terms in an infinite sum
can result in a change of the value of the sum (the Riemann sum theorem). For this reason
standard convergence is sometimes called conditional convergence (conditional on the ordering).
The absolute convergence ensures the uniqueness of the converged value (see Example below).
45

Example 3.1 Consider the experiment of rolling a die, and recall the pdf for the number of
dots facing up given by
1
f (x) = I{1,2,...,6} (x) with
6
The expected value equals E (X) =
P6
x=1
R(X) = {1, 2, . . . , 6}.
x 61 I{1,2,...,6} (x) = 3.5.
Example 3.2 Consider a random variable with pdf

f (xk ) =
1
2k
with R(X) = {xk = (1)k
2k
, k = 1, 2, ...}.
k
The sum defining the expectation is
(1)k
xk f (xk ) =
k
k=1
k=1
=
(1)k1 k
1
k
k=1
= ln(1 + 1).
(1)k1 k
x
k=1
k
h P
= ln(1 + x)
x(1,1]
Thus, the sum is convergent.
Example 3.3 But the sum is not absolutely convergent since
|xk |f (xk ) =
k=1
1
1 1 1 1
1
= 1 + + + + + + + = .
2 |3 {z 4} |5
8}
k=1 k
{z
>1/2
>1/2
Thus, the uniqueness of the converged value (and hence of the expected value) is not ensured.
The expectation of a continuous random variable is defined as follows:
Definition 3.2 (Expectation; continuous case) The expected value of a continuous random variable exists, and is defined by
x f (x) dx,
E (X) =
|x| f (x) dx < .
iff
The condition is sometimes called integrability (of a random variable).
46

Remark 3.3 The existence condition is necessary to ensure that the improper Riemann integral
x f (x) dx (and hence the expectation) exists.
Example 3.4 Consider the pdf f (x) = 3x2 I[0,1] (x). The expected value equals
x3 dx = 0.75.
x 3x I[0,1] (x)dx = 3
E (X) =
Example 3.5 Consider a random variable with pdf

f (x) =
1 1
,
1 + x2
< x <
(Cauchy distribution).
This is the classical example of a random variable whose expected value does not exist. In order
to see this, write
|x|f (x)dx =
|x| 1
2
dx =
2
1+x
x
dx,
1 + x2
For any positive number a we obtain
Thus,
"
ln(1 + x2 )
x
dx
=
1 + x2
2
2
|x|f (x)dx = lim
a
#x=a
=
x=0
ln(1 + a2 )
.
2
1
x
dx = lim ln(1 + a2 ) = .
2
1+x
a
In applications, the following result w.r.t. sufficient conditions for the existence of the expectation can be useful:
Theorem 3.1 If |x| < c x R(X), for some choice of c (0, ). Then E (X) exists.
Proof
For a discrete random variable we obtain
X
xR(X)

|x| f (x)

<
|x|<c
c f (x) = c
xR(X)
The proof for the continuous case is analogous.
f (x) = c < ;
xR(X)
Remark 3.4 The theorem indicates that the expectation exists if the outcomes of the random
variables are bounded. In general, the existence of the expectation depends on the behavior
of the tails of the distribution: too much probability mass for large realizations may prevent
convergence of the sum/integral defining the expectation.
47
3.2. Properties of the Expectation and Extensions

In many situations we are interested in the expectation of a function of a random variable
rather than the expectation of the random variable itself. Consider, for example, the revenue of
a company Y = pX, where p is the selling price, which is fixed, and X represents the (random)
selling. How could E (Y ) = E(p X) be determined (without having to find the distribution of
Y )?
The following theorem identifies a straightforward approach of obtaining the expectation of a
function Y = g(X) of a random variable X.
Theorem 3.2 Let X be a random variable with pdf f (x). Then the expectation of random
variable Y = g(X) is given by1
E (g(X)) =
g (x) f (x)
xR(X)

g (x) f (x) dx
(discrete)
(continuous).
Proof
(Discrete Case) Since the outcome y is equivalent to the event x {x : g(x) = y}, the pdf of Y = g(X),
say h, can be represented as
X
h(y) = PY (y) = PX ({x : g(x) = y, x R(X)}) =
f (x).
x{x:g(x)=y}
Thus,
E (g(X)) = E (Y ) =
X
yR(Y )
y h(y) =
yR(Y )
f (x)
x{x:g(x)=y}
g(x) f (x)
(y: fixed value in the inner sum)
yR(Y ) x{x:g(x)=y}
g(x) f (x)
xR(X)
yR(Y )
x{x:g(x)=y} is
equivalent to
summing over all x R(X)) .
The proof for the continuous case is analogous.
Example 3.6 Consider the experiment of rolling a die and the number of dots facing up denoted by X. The expectation of the function Y = g(X) = X 2 is

E X
1
6
X
91
1
x2 I{1,2,...,6} (x) = .
6
6
x=1
It is tacitly assumed that the sum and integral are absolutely convergent for the expectation to exist.
48

Remark 3.5 An implication of Theorem 3.2 is that the expectation of an indicator function
equals the probability of the set being indicated. Let X be a variable with pdf f and define
1 xA
0 else.
f (x) = P (x A)
IA (x) =
Then,
E (IA (X)) =
IA (x) f (x) =
(discrete case)
xA
xR(X)
IA (x) f (x)dx =
xR(X)
f (x)dx = P (x A) (continuous case).

xA
It follows that probabilities can be represented as expectations.
The following theorem indicates that, in general E (g(X) 6) = g (E (X)):
Theorem 3.3 (Jensens Inequality) Let X be a non-degenerate random variable2 with expectation E (X), and let g be a function with smooth derivative on an open interval I containing
R(X) (that is R(X) I).
If gis convex on I,
then E (g(X)) g(E(X));
if gis strictly convex on I,
then E (g(X)) > g(E(X)).
Proof
Let `(x) be a tangent to g(x) at point g(E(X)), say `(x) = a + bx. Now, if g is convex on I (g 00 0),
we have
g(x) `(x) = a + bx
x I.
Thus, for a (discrete) X with pdf f , we obtain

E (g(X)) =
g(x)f (x)
xR(X)
(a + bx)f (x) = a + b E(X)
xR(X)
= `(E(X))
(def. of `(x))
= g(E(X))
(ìs tangent at E(X)),
so that E (g(X)) g(E(X)), as was to be shown (for a continuous X analogous).

Now, if g is strictly convex on I (implying g 00 > 0), we have
g(x) > `(x) = a + bx
2
x I
for which x 6= E(X).
A degenerate random variable has only one outcome that is assigned a probability of 1.
49

Then, assuming that no x R(X) is assigned probability one (this means that X is non-degenerate),
the previous inequality results become strict, implying
as was to be shown.
E (g(X)) > g(E(X)),
Remark 3.6 Jensens Inequality also applies to concave functions. If g is concave, then
E (g(X)) g(E(X)). Moreover, it applies in fact to any (strictly) convex/concave functions,
but the proof is a bit more involved if dropping the smoothness of the 2nd order derivative
(the key argument is that any convex function is actually continuous with piecewise smooth
derivative, so the extension only has to deal with the discontinuities).
Example 3.7 One immediate application of Jensens Inequality shows that

E X 2 (E(X))2 ,
since g(x) = x2 is convex.
Note that this implies that Var(X) = E (X 2 ) (E(X))2 0.

We now discuss some properties of the expectation operator.
Theorem 3.4 If c is a constant, then E(c) = c.
Proof
E(c) =
f (x)dx = c
f (x)dx
= c.
Theorem 3.5 If c is a constant, then E(cX) = c E(X).

Proof
E(cX) = c
f (x)dx = c E(X).
Theorem 3.6 E
P
k
i=1
gi (X) =
Pk
i=1
E (gi (X)).
Proof
Let g(X) =
E
Pk
i=1 gi (X).
k
X
Then, by Theorem 3.2,
gi (X)
k
X
gi (x)f (x)dx =
i=1
i=1
k
X
i=1
{z
gi (x)f (x)dx =
(additivity property of Riemann integrals )
k
X
E (gi (X)) .
i=1
Note that Theorem 3.6 indicates that the expectation of a sum is the sum of the expectations.
50

Corollary 3.1 E(a + bX) = a + b E(X).
Proof
This follows directly from Theorem 3.6, by defining g1 (X) = a and g2 (X) = bX.
So far, we considered the expectation of a function of a univariate random variable. This

concept can be generalized to a function of a multivariate random variable as indicated in the
following theorem:
Theorem 3.7 Let (X1 , ..., Xn )0 be a multivariate random variable with joint pdf f (x1 , ..., xn ).
Then the expectation of random variable Y = g(X1 , ..., Xn ) is given by3
E(Y ) =
X
X
g (x1 , ..., xn ) f (x1 , ..., xn )
(x ,...,x )R(X)
n
1

g (x1 , ..., xn ) f (x1 , ..., xn ) dx1 dxn
(discrete)
(continuous).
Proof
This follows from a direct extension of the proof of Theorem 3.2 for the univariate case.
Example 3.8 Consider a bivariate random variable (X1 , X2 ) with joint pdf
f (x1 , x2 ) = 6x1 x22 I[0,1] (x1 )I[0,1] (x2 ),
and the function g(x1 , x2 ) = .5(x1 + x2 ). Then, by Theorem 3.7,
E (g(X1 , X2 )) =
0
h
i
1
(x1 + x2 ) 6x1 x22 dx1 dx2 = .7083.
2

The expectation property in Theorem 3.6 concerning the sum of functions of univariate random
variables can be extended to sums of functions of multivariate random variables. This is
indicated in the following theorem:
Theorem 3.8 E
P
k
i=1
gi (X1 , . . . , Xn ) =
Pk
i=1
E (gi (X1 , . . . , Xn )).
It is tacitly assumed that the sum and integral are absolutely convergent for the expectation to exist.
51

Proof
(Continuous case) Let g(X1 , . . . , Xn ) =
Pk
i=1 gi (X1 , . . . , Xn ).
Then, by Theorem 3.7 and the additivity
property of Riemann integrals,

E
k
X
gi (X1 , . . . , Xn )
k
X
=
k
X
gi (x1 , . . . , xn )f (x1 , . . . , xn )dx1 dxn
i=1
i=1
gi (x1 , . . . , xn )f (x1 , . . . , xn )dx1 dxn
i=1
k
X
E (gi (X1 , . . . , Xn ))
i=1
(discrete case analogous).
Corollary 3.2 E
tions).
P
k
i=1
Pk
Xi =
i=1
E (Xi ) (Expectation of a sum is the sum of the expecta-
Proof
This follows by Theorem 3.8 with gi = Xi .
In the case that the random variables are independent, the expectation of a product is the
product of the expectations as indicated in the following theorem:
Qn
Theorem 3.9 Let X1 , . . . Xn be independent random variables. Then E (
i=1
Xi ) =
Proof
(Continuous case) Let g(X1 , . . . , Xn ) =
E
n
Y
Xi
" n
Y
i=1
i=1 Xi .
i=1
Qn
n
Y
Then, by Theorem 3.7 we have

#
xi
=
xi fi (xi )dxi
i=1
n
Y
E (Xi )
i=1
(discrete case analogous).
f (x1 , . . . , xn )
52
Qn|
i=1
{z
fi (xi )by independence
dx1 dxn
Qn
i=1
E (Xi ).
3.3. Conditional Expectation

So far, we have considered unconditional expectations, this means the expectations of unconditional/marginal distributions. If we take the expectation w.r.t. a conditional distribution, we
have the conditional expectation.
The conditional expectation is one of the most important concepts used in statistics and econometrics, and is the key element of regression analysis.
Definition 3.3 (Conditional expectation) Let (X1 , . . . , Xn ) and (Y1 , . . . , Ym ) be random

vectors with joint pdf f (x1 , . . . , xn , y1 , . . . , ym ). Let g (Y1 , . . . , Ym ) be a real-valued function
of (Y1 , . . . , Ym ).
Then the conditional expectation of g (Y1 , . . . , Ym ), given (x1 , . . . , xn ) B, is defined as
E (g (Y1 , . . . , Ym ) | (x1 , . . . , xn ) B)
(discrete)
=
g (y1 , . . . , ym ) f (y1 , . . . , ym | (x1 , . . . , xn ) B)
(y1 ,...,ym )R(Y )
(continuous) E (g (Y1 , . . . , Ym ) | (x1 , . . . , xn ) B)

g (y1 , . . . , ym ) f (y1 , . . . , ym | (x1 , . . . , xn ) B)dy1 dym .
Remark 3.7 An important special case of the definition given above is obtained by setting
g(Y1 , ..., Yn ) = Y , where Y is a univariate random variable,
E (Y | (x1 , . . . , xn ) B) =
y f (y | (x1 , . . . , xn )
yR(Y )

y f (y | (x1 , . . . , xn )
B)
(discrete)
B) dy (continuous).
Example 3.9 Consider a bivariate random variable (X, Y ) with joint pdf
f (x, y) =
1 2
(x + 2xy + 2y 2 )I[0,4] (x)I[0,2] (y).
96
What is the conditional

expectation E(Y |x = 1)? To answer this, we need to compute f (y|x =

1) = f (x, y)/fX (x)

. The marginal pdf for X is

x=1
1
1
1 2
fX (x) =
f (x, y)dy =
x + x+
I[0,4] (x),
48
24
18
53

such that
f (y|x = 1) = [.088235 + .176471(y + y 2 )] I[0,2] (y). Thus, we find that
y .088235 + .176471(y + y 2 ) dy = 1.3529.
y f (y|x = 1)dy =
E(Y |x = 1) =
Remark 3.8 All properties of expectations discussed above also apply analogously to conditional expectations.
The conditional expectation E(Y |(x1 , ..., xn ) B) was introduced as being conditional on a
particular event (x1 , ..., xn ) B. Rather than specifying a particular event, we might conceptualize leaving the event for (X1 , ..., Xn ) unspecified and interpret the conditional expectation
of Y as a function of (X1 , ..., Xn ) denoted by E(Y |X1 , ..., Xn ). Note that E(Y |X1 , ..., Xn ) is
then a function of random variables and, therefore, itself a random variable. E(Y |x1 , ..., xn ) is
referred to as the regression function of a regression of Y on the Xi s.
Example 3.10 Recall the Example of the bivariate random variable with joint pdf
f (x, y) =
1 2
(x + 2xy + 2y 2 )I[0,4] (x)I[0,2] (y).
96
The regression function of a regression of Y on X is obtained as
E(Y |x) =
f (x, y)
y
dy =
fX (x)
16
x
3
2x +
+8
2
2x + 4x + 16
3
for
y (x2 + 2xy + 2y 2 )I[0,4] (x)

dy
(2x2 + 4x + 16
)I[0,4] (x)
3
x [0, 4].
For x 6 [0, 4], the regression function is not defined.
Remark 3.9 Note that the regression function is a nonlinear function in x.
The following theorem (referred to as the law of iterated expectation) indicates how we obtain
the unconditional expectation of g(Y ) from the conditional expectation of the random variable
g(Y ) conditional on the random variable X.
Theorem 3.10 E (E(g(Y )|X)) = E (g(Y )).
54

Proof
(Continuous case) Let f (x, y) be the joint pdf of X and Y , fX (x) the marginal pdf of X, and f (y|x)
the conditional pdf. Then
{z
g(y) f (y|x)dy fX (x)dx(3.1)
[ E(g(Y )|x)] fX (x)dx =
E [ E(g(Y )|X) ] =
|
R(X)
random variable !
R(X)
g(y) f (x, y)dydx
=
R(X)
( f (y|x)fX (x) = f (x, y) )
R(Y )
f (x, y)dx dy
g(y)
R(Y )
R(Y )
f (x, y)dx = fY (y) )
R(X)
= E (g(Y )) .
(Discrete case analogous).
Remark 3.10 The law of iterated expectations straightforwardly extends to the case where the
random variables Y and/or X are multivariate (being random vectors). In particular, we get
E (E(g(Y1 , ...Yn )|X1 , ..., Xn )) = E (g(Y1 , ..., Yn )) .
An interesting consequence is that E(Y ) = c whenever E (Y |X) = c for some random variable
X.
3.4. Moments of a Random Variable

Moments of random variables are expectations of power functions of random variables. They
can be used to measure certain characteristics of the pdf of the random variable, for example,
the dispersion and skewness.
There are two types of moments, namely, non-central and central moments.
Definition 3.4 (rth non-central moment) Let X be a random variable with pdf f (x). Then
the rth non-central moment of X, denoted by 0r , is defined as
0r = E (X r ) =
xr f (x)
xR(X)

xr f (x) dx
(discrete)
(continuous).
Remark 3.11 The first non-central moment is simply the expectation (also called the mean)
of the random variable, that is 01 = E (X), and will be denoted by the symbol . Furthermore
note that 00 = E (X 0 ) = 1.
55

Definition 3.5 (rth central moment) Let X be a random variable with pdf f (x). Then the
rth central moment of X, denoted by r , is defined for r N as
r = E ((X )r ) =
(x )r f (x)
xR(X)

(x )r f (x) dx
(discrete)
(continuous).
Remark 3.12 Note that 0 = E ((X )0 ) = 1, and 1 = E(X ) = 0. Note further that
one may actually extend definition of the moments to real r, with proper care; e.g. to avoid
complex roots of negative numbers, it is not uncommon to work with E (|X|r ) when r is not a
natural number. The more used moments are the ones defined by natural r.
The second central moment is commonly known as the variance.
Definition 3.6 (Variance and standard deviation) The variance of a random variable

X is the 2nd central moment, Var(X) = E (X )2 , and will be denoted by the symbol 2 .
The non-negative square root of Var(X) is the standard deviation of X and will be denoted
by the symbol .
The variance and standard deviation are measures of the dispersion of a distribution around
the mean. The larger the variance, the larger the dispersion. They are not the only measures
of dispersion (or scale).
The relationship between the variance and the dispersion can be examined by means of Chebyshevs inequality which is a special case of Markovs inequality.
Theorem 3.11 (Markovs inequality) Let X be a random variable with pdf f , and let g be
a non-negative function of X. Then
P (g(x) a)
E (g(X))
a
Proof
56
for any a > 0.

(Discrete case) We can decompose E (g(X)) into
X
E (g(X)) =
g(x)f (x) =
{x:g(x)<a}
xR(X)
|
X
g(x)f (x) +
( 0) ( 0)
{z
g(x)f (x)
{x:g(x)a}
g(x)f (x)
{x:g(x)a}
(since g(x) ax {x : g(x) a})
af (x)
{x:g(x)a}
= a
f (x)
aP (g(x) a),
{x:g(x)a}
and thus
E(g(X))
a
P (g(x) a). (Continuous case analogous).
Remark 3.13 Markovs Inequality states that the probability for g(x) a has always an
upper bound (independent of the probability distribution) as long as g(x) is non-negative valued.
Note that the upper bound for the probability is increasing with the expectation E (g(X)), and
that finiteness of the expectation is required.
As a special case of Markovs Inequality we obtain Chebyshevs Inequality.
Corollary 3.3 (Chebyshevs inequality) P (|x | k)
1
k2
for k > 0.
Proof
Let
(x )2
0,
g(x) =
2
E (X )2
E (g(X)) =
= 1,
2
where
and for convenience, set a = k 2 > 0. Then Markovs inequality implies
)2
(x
2
k2
(X)2
2
k2
1
.
k2
Doing some obvious algebra, we get the inequality

P (|x | k)
1
.
k2
Remark 3.14
1. From further obvious algebra, we also get P (|x | < k) 1
57
1
.
k2

2. Chebyshevs Inequality allows us to examine the relationship between the variance 2 and
the dispersion of a pdf. For this purpose, set in Chebyshevs Inequality k = c, where
c > 0. Then

2
0.
P (|x | c) 2
c 2 0
This implies e.g. that, as 2 0, the pdf concentrates over the interval ( c, + c) for
any arbitrarily small positive c.
The third central moment E ((X )3 ) can be used as a measure of skewness of the pdf of
X, that means the deviation from symmetry around .
Definition 3.7 (Symmetry of a pdf) The pdf f is said to be symmetric around iff
f ( + ) = f ( ) for any
> 0.
Otherwise f is said to be skewed.
A symmetric pdf has 3 = E ((X )3 ) = 0.4 For 3 > 0 (3 < 0) the pdf said to be skewed
to the right (left); but note that precise interpretations of the value of the skewness depend
on the scale and shape of the density; it is not uncommon to report a standardized skewness
(3 / 3 ). E.g. one can easily build counterexamples of asymmetric pdfs with zero skewness.
With respect to the existence of non-central moments, the following theorem is useful:
Theorem 3.12 If E (|X|r ) exists for an r > 0, then E (|X|s ) exists s [0, r].
Proof
(Continuous Case) We need to show that if
r
|x| f (x)dx
< , then
s
|x| f (x)dx
< for s r.
For this purpose, define the sets

A<1 = {x : |x|s < 1}
|x|s f (x)dx =
such that
A1 = {x : |x|s 1},
and
|x|s f (x)dx +
xA<1
|x|s f (x)dx.
xA1
Since f (x) |x|s f (x) x A<1 , we can write
f (x)dx
P (|x| < 1) =
xA<1
4
|x|s f (x)dx.
xA<1
Though the condition 3 = 0 is necessary for a symmetric pdf, it is not sufficient - see Mittelhammer (1996,
p.136).
58

Now note that, for r > s, we have |x|r |x|s x A1 . It follows that
|x|r f (x)dx
|x|s f (x)dx.
xA1
xA1
Finally,
|x|r f (x)dx
|x| f (x)dx P (|x| < 1) +
xA1

P (|x|s < 1) +
|x|r f (x)dx
(since
xA<1
|x|r f (x)dx 0)
(since P (|x|s < 1) [0, 1]and E (|X|r )exists).
<
The proof for the discrete case is analogous.
Remark 3.15 The theorem implies that if we can show the existence of the rth order noncentral moment, then all lower-order non-central moments are known to exist. It also implies,

that if E (|X|r ) does not exist (is infinite), then necessarily E |X|k cannot exist for k > r.
Otherwise, Theorem 3.12 would be contradicted.
Remark 3.16 The proof of the

theorem can
be shortened with the help of of Lyapunovs (norm)
q
q
s
r
s
inequality, which states that E [|X| ] E [|X|r ] for all 0 < s r.
Example 3.11 Consider the pdf

f (x) =
2
I[0,) (x).
(x + 1)3
Examine E (X ), that is
E (X ) =
0
x 2
dx = 2
(x + 1)3
(y 1) y 3 dy,
1
(the 2nd Eq. is obtained by substituting y = x + 1, so that y 1 = x and dy = dx). If = 2, we

get
"
#y=b

1 2
2
1
E X = 2 lim ln(y) + 2y y
= .
b
2
y=1
Thus, E (X 2 ) does not exist. By Theorem 3.12, moments of order larger than 2 also do not
exist.
With respect to the existence of central moments, the following theorem is useful:
59

Theorem 3.13 If E (|Y |r ) exists for an r > 0, then E (|Y |s ) exists s [0, r].
Proof
This follows from Theorem 3.12 upon defining X = Y .
In addition to the moments of a random variable, there exist further useful measures of pdf
characteristics, including the median and quantiles.
Definition 3.8 (Median) Any number, b, satisfying
P (x b) 1/2
P (x b) 1/2
and
is called a median of X and is denoted by med(X).

The median is a measure of location (for the center of the distribution) and also a special
quantile of a distribution.
Definition 3.9 (Quantile) Any number, bp , satisfying
P (x bp ) p
and
P (x bp ) 1 p
is called a pth quantile of the distribution of X (or a (100p)th percentile of the distribution of
X).
3.5. Moment-Generating Functions

As its name suggests, the Moment-Generating Function (MGF) can be used to determine
moments of a random variable. However, the main use of the MGF is not to generate moments,
but to help in characterizing a distribution.
Definition 3.10 (Moment-Generating Function) The MGF of a random variable X, denoted by MX (t), is
MX (t) = E etX =
etx f (x)
xR(X)

etx f (x) dx
(discrete)
(continuous),
provided that the expectation exists for t in some neighborhood of 0. That is, there exists an

h > 0 such that E etX exists t (h, h).
60

Remark 3.17 The condition that MX (t) be defined t (h, h) is a technical condition
ensuring that MX (t) is differentiable at the point t = 0. This is a property which will become
evident shortly.
The following theorem indicates how the MGF generates non-central moments.
Theorem 3.14 Let X be a random variable for which the MGF MX (t) exists. Then

dr MX (t)
0r = E (X r ) =
.
dtr t=0
Proof
(Continuous Case) Given differentiability of MX in a neighbourhood of 0, we may differentiate under
the integral sign (interchanging the order of integration and differentiation), and obtain5
dr MX (t)
dtr
r h

i
d
dr
tx
tx
e
f
(x)dx
=
=
e
f
(x)
dx
r
dtr
dt

=
xr etx f (x)dx
= E (X r ) .
(Discrete case analogous).
t=0
Example 3.12 Consider the pdf

f (x) = ex I(0,) (x)
(pdf of an exponential distribution).
The MGF is given by
tx x
MX (t) =
"
#
x(t1) x=
e
t1
ex(t1) dx
e e I(0,) (x)dx =
x=0
=0
t<1
1
1
=
.
t1
1t
The mean and the 2nd non-central moment are given by

dMX (t)
1

=

= 1,
=
dt t=0 (1 t)2 t=0
02
d2 MX (t)
2
=

=

= 2.
dt2 t=0 (1 t)3 t=0
If MX (t) = etx f (x)dx exists for t (h, h), then dr MX (t)/dtr exists t (h, h) and for all positive
integers r, and the derivative of MX (t) can be found by differentiating under the integral sign (see Mittelhammer 1996, Lemma 3.3, p.142.). More details can be found in Casella and Berger (2002) in Chap. 2.4.
61

Remark 3.18 The MGF MX (t) = E etX can be written as a series expansion in terms of
the moments of the pdf of X. In particular, a Taylor-series expansion of g(t) = etX around
t = 0 yields

MX (t) = E e
tX
1
1
1
= E e + [Xe0x ]t + [X 2 e0x ]t2 + [X 3 e0x ]t3 +
1!
2!
3!
1
1
= 1 + 01 t + 02 t2 + 03 t3 +
2!
3!
X
1 i 0
=
t i .
i=0 i!

0X
This representation indicates that if the MGF exists, it characterizes an infinite set of moments.
But note that the existence of all moments is not equivalent to the existence of the MGF6 .
The following list summarizes useful elementary results for MGFs. Let X1 , ..., Xn be independent random variables having respective MGFs MXi (t), i = 1, . . . , n. Then we get
1. for Y = aXi + b the MGF

MY (t) = E e(aXi +b)t = ebt MXi (at);
2. for Y =
Pn
i=1
Xi the MGF
Pn
MY (t) = E e(
i=1
Xi )t
=E
n
Y
eXi t =
i=1
3. for Y =
Pn
i=1
n
Y
E eXi t =
i=1
{z
by independence
n
Y
MXi (t);
i=1
ai Xi + b the MGF
MY (t) = ebt
n
Y
MXi (ai t).
i=1
The following theorem indicates that the MGF can be useful for identifying the pdf of a given
random variable
Theorem 3.15 (MGF Uniqueness Theorem) If an MGF exists for a random variable X
having pdf f (x), then
1. the MGF is unique;
An example for a distribution that does not have an existing MGF but non-central moments that all exist
and are finite can be found in Casella and Berger (2002), Example 2.3.10.
62

2. and, conversely, the MGF determines the pdf of X uniquely, at least up to a set of points
having probability 0.
Proof
See, e.g., Widder (1961,p. 41), Advanced Calculus, Englewood Cliffs, Prentice-Hall.
Remark 3.19 The theorem says that there is essentially a one-to-one correspondence between
pdfs and MGFs:
1. A pdf has one and only one MGF associated with it, if an MGF exists at all.
2. Furthermore, there is typically only one pdf associated with a given MGF. (If there is
more than one pdf, then they differ only at a set of points having probability 0).
Remark 3.20 This correspondence between pdfs and MGFs implies the following: If the MGF
of a given random variable X is known, and if one knows a pdf that produces exactly this MGF,
then the pdf can treated as the pdf of the random variable X.
Example 3.13 Suppose that Z has an MGF defined by MZ (t) =
the pdf
f (x) = ex I(0,) (x),
which has an MGF
1
1t
for |t| < 1. Now, consider
MX (t) =
1
1t
(see the previous Example). Then, by the uniqueness theorem, the pdf of Z can be specified as
Z f (z) = ez I(0,) (z).
The MGF can be extended to the case of a multivariate random variable.

Definition 3.11 (Moment-Generating Function; multivariate) The MGF of a multivariate random variable X = (X1 , . . . , XN )0 is

Pn
MX (t) = E et X = E e
tX
i=1 i i
where
t = (t1 , ..., tn )0 ,
provided that the expectation exists for all ti in some neighborhood of 0, i = 1, ..., n. That is,
0
there exists an h > 0 such that E et X exists ti (h, h), i = 1, ..., n.
Remark 3.21 The rth order non-central moment of Xi is obtained from the rth order partial
derivative w.r.t. ti

r MX (t)
r
0
r (Xi ) = E (Xi ) =
.
tri t=0
63

Remark 3.22 If we take cross partial derivatives of the multivariate MGF, we obtain joint
non-central moments (which will be discussed in the next section in more detail)

E Xir Xjs
r+s MX (t)
.
=
tri tsj t=0
Remark 3.23 An analog of the MGF uniqueness theorem establishing a correspondence between joint pdfs and multivariate MGFs applies to the multivariate MGF.
Remark 3.24 In cases where the MGF of a random variable X does not exist, it may be

replaced with the so-called characteristic function, X = E eitX , where i is the imaginary
unit, i2 = 1. The characteristic function can be shown to always exist, but the presence of the
complex unit i makes it more difficult to deal with and we do not go into details.
3.6. Joint Moments and Moments of Linear Combinations

In the case of multivariate random variables, joint moments characterize the relationship between the individual variables.
Definition 3.12 (Joint non-central moment) Let X and Y be two random variables with
joint pdf f (x, y). Then the joint non-central moment of (X, Y ) of order (r, s) is defined as
0r,s = E (X r Y s ) =
xr y s f (x, y)
(discrete)
xR(X) yR(Y )

r s
x y f (x, y) dxdy
(cont.).
Definition 3.13 (Joint central moment) Let X and Y be two random variables with joint
pdf f (x, y). Then the joint central moment of (X, Y ) of order (r, s) is defined as
(discrete)
xR(X) yR(Y )

r,s =
(x E X)r (y E Y )s f (x, y)
(x E X)r (y E Y )s f (x, y) dxdy
(cont.).
The joint moment of order (1, 1), namely 1,1 , is commonly known as the covariance, which
measures the linear association between X and Y .
64

Definition 3.14 (Covariance) The covariance between the random variables X and
Y is the joint central moment of the order (1, 1),
Cov(X, Y ) = E (X E(X)) (Y E(Y )) ,
and will be denoted by the symbol XY .
Remark 3.25 The covariance can be represented in terms of non-central moments, namely
XY
= E (X E(X)) (Y E(Y )) = E (XY E(X)Y E(Y )X + E(X) E(Y ))

= E(XY ) E(X) E(Y ).
From this relationship we obtain the result that

E (XY ) = E(X) E(Y )
iff
XY = 0.
The covariance has an upper bound in absolute values, which depends on the variances of
the corresponding random variables. This upper bound follows from the Cauchy-Schwarz
Inequality.
Theorem 3.16 (Cauchy-Schwarz Inequality) E(W Z)2 E (W 2 ) E (Z 2 ).
Proof
Consider the random variable (1 W + 2 Z)2 which is non-negative (1 , 2 );
E (1 W + 2 Z)2 0,

21 E W 2 + 22 E Z 2 + 21 2 E (W Z) 0
(1 , 2 )
E W2
E (W Z) 1
0
[1 , 2 ]

2
E (W Z) E Z 2
{z
(1 , 2 )
(1 , 2 ).
=A
Note that the last inequality is the defining property of positive semidefiniteness for the (2 2) matrix
A. The positive semidefiniteness of A requires that

E W 2 0,
E Z 2 0,
The last inequality implies the desired result.
|A| = E W 2 E Z 2 E(W Z)2 0.

A
generalization
of Cauchy-Schwarz is given by Hlders inequality, stating that E (|W Z|)
q
q
p q
p
E (|W | ) E (|Z|q ) for any 1/p + 1/q = 1, p, q > 1, for which the expectations exist. The
65

Cauchy-Schwarz Inequality allows us to establish an upper bound for the covariance indicated
in the following theorem.
Theorem 3.17 (Covariance bound) |XY | X Y .
Proof
Let W = X E(X) and Z = Y E(Y ) in the Cauchy-Schwarz Inequality. Then

(E (X E(X)) (Y E(Y )))2 E (X E(X))2 E (Y E(Y ))2 ,

2
2 2 or equivalently |
such that XY
X
XY | X Y .
Y
Using this upper bound, we can define a normalized version of the covariance, the so-called
correlation. The correlation, unlike the covariance, is scale-invariant.
Definition 3.15 (Correlation) The correlation between the random variables X and Y is
defined by
XY
.
corr(X, Y ) = XY =
X Y
From the upper bound of the covariance, we obtain directly an upper bound for the correlation
as indicated in the following theorem.
Theorem 3.18 (Correlation bound) 1 XY 1.
Proof
This follows directly from the upper bound for the covariance |XY | X Y .
A fundamental relationship between the covariance and the stochastic (in)dependence is indicated in the next theorem.
Theorem 3.19 If X and Y are independent, then XY = 0 and XY = 0 .
Proof
(Discrete case) If X and Y are independent, then f (x, y) = fX (x) fY (y). It follows that
XY
(x E(X))(y E(Y ))fX (x)fY (y)
xR(X) yR(Y )
(x E(X))fX (x)
xR(X)
(y E(Y ))fY (y)
yR(Y )
(E(X) E(X))
(Continuous case analogous).
(E(Y ) E(Y )) = 0.
66
(by def. of XY )

Remark 3.26 The converse of Theorem 3.19 is not true: The fact that XY = 0 does not
necessarily imply that X and Y are independent. This is illustrated in the following example.
Example 3.14 Let X and Y have the joint pdf f (x, y) = 1.5I[1,1] (x)I[0,x2 ] (y). Note that this is
a uniform pdf with support given by the points (x, y) on and below the parabola y = x2 . The
range of y depends on x, so that the support of f (y|x) depends on x, and thus X and Y must
be dependent! Nonetheless, XY = 0. To see this, note that
xy dydx = 1.5
E (XY ) = 1.5
x2
E(X) =
y=x2
dx = 1.5
y=0
1 5
x dx = 0,
2
x2
1.5dy dx = 0,
x
1
1
x y2
2
E(Y ) = 0.3.
{z
=fX (x)
Therefore, XY = E (XY ) E(X) E(Y ) = 0.

The next theorem indicates that if the correlation takes its maximum absolute value, that
means XY = 1 or XY = 1, then there is a perfect linear relationship between X and Y .
Theorem 3.20 If XY = 1 or 1, then P (y = a + bx) = 1, where b 6= 1.
Proof
Define Z = 1 (X E(X)) + 2 (Y E(Y )), where E Z = 0 and

Var(Z) = E Z 2 = 21 E (X E(X))2 + 22 E (Y E(Y ))2 + 21 2 E ((X E(X))(Y E(Y )))
2
= [1 , 2 ] X
XY
|
XY 1
0
Y2
2
{z
(1 , 2 ).
=A
2
2 2 , which implies that |A| = 0 such that A is singular. In
Now, if XY = 1 or 1, then XY
= X
Y
this case, the columns of A are linearly dependent, so that there exist 1 6= 0 and 2 6= 0 such that
2
X
XY
XY 1 0
=
.
Y2
0
2
2 and = 1. Since Var(Z) = 0 at those values

A solution for 1 and 2 is given by 1 = XY /X
2
for 1 , 2 , we have
P (z = E(Z)) = P (z = 0) = 1.
2 and = 1 yields
Inserting the definition of Z together with 1 = XY /X
2
2
2
P (y = [E(Y ) (XY /X
) E(X)] + [XY /X
] x) = P (y = a + b x) = 1.
67

Remark 3.27 If XY = 1 or 1 such that P (y = a + bx) = 1, then the joint pdf f (x, y) is
degenerate. All the probability mass of f (x, y) is concentrated on the line y = a + bx. This
generates a perfect linear relationship between X and Y .
3.7. Means and Variances of Linear Combinations of

Random Variables
The earlier results for the mean and variance of random variables can be extended to obtain
the mean and variance of linear combinations of random variables. The first result concerns
the mean.
Pn
Theorem 3.21 Let Y =

.
i=1
ai Xi , where the ai s are real constants. Then E(Y ) =
Pn
i=1
ai E (Xi )
Proof
This follows directly from Theorem 3.8 indicating that the expectation for a sum of random variables
is equal to the sum of their expectations.
Remark 3.28 The matrix representation of this result is obtained as follows. Let
a = (a1 , . . . , an )0
and X = (X1 , . . . , Xn )0 .
Then Y = a0 X such that E(Y ) = a0 E (X).

For the variance of a linear combination, we have the following result.
Theorem 3.22 Let Y =
Pn
i=1
ai Xi , where the ai s are real constants. Then

Y2 =
n
X
2
a2i X
+2
i
i=1
ai aj Xi Xj .
i<j
Proof
We have
Y2
= E (Y E(Y ))2 = E
n
X
ai Xi
i=1
n
X
i=1
!2
ai E(Xi )
= E
n
X
i=1
n
X
X
= E a2i (Xi E(Xi ))2 + 2
ai aj (Xi E(Xi ))(Xj E(Xj ))
i=1
n
X
i=1
2
a2i X
+2
i
i<j
ai aj Xi Xj .
i<j
68
!2
ai (Xi E(Xi ))

In order to rewrite this result in matrix notation we shall define the covariance matrix of a
multivariate random variable.
Definition 3.16 (Covariance matrix) The covariance matrix of the n-dimensional random
vector X = (X1 , . . . , Xn )0 is the n n symmetric matrix
Cov(X) = E ((X E(X))(X E(X))0 ) =
2
X1 X2
X
1
2
X2 X1 X
2
..
..
.
.
Xn X1 Xn X2
X1 Xn
X2 Xn
..
..
.
.
2
Xn
Remark 3.29
The variance of the ith variable in X is given by the (i, i)th diagonal entry in the covariance matrix.
The covariance of the ith and jth variable is displayed in the (i, j)th as well as in the
(j, i)th off-diagonal entry in the covariance matrix. Note that this implies that a covariance
matrix is symmetric, that is Cov(X) = Cov(X)0 .
Remark 3.30 Let a = (a1 , . . . , an )0 and X = (X1 , . . . , Xn )0 . Then the variance of Y = a0 X
given in Theorem 3.22 can obviously be represented as
Y2 = a0 Cov(X)a.
Note that since a variance is non-negative (Y2 0) the expression a0 Cov(X)a is also nonnegative for any a. This implies that a covariance matrix is necessarily positive semidefinite!
The preceding results can be extended to the case where Y is a vector of linear combinations
of a random vector X .
Theorem 3.23 Let Y = AX, where A = (ahm ) is a k n matrix of real constants, and
X = (Xi ) is an n 1 vector of random variables. Then E(Y ) = A E(X).
Proof
The linear combination AX can be written as
a X + a12 X2 + + a1n Xn
11 1
..
Y = AX =
,
.
ak1 X1 + ak2 X2 + + akn Xn
69

such that E(Y ) = A E(X) follows immediately from the application of Theorem 3.21 (expectation of
one linear combination) to each of the k linear combinations.
For the covariance matrix of k linear combinations, we have the following result.
Theorem 3.24 Let Y = AX, where A = (ahm ) is a k n matrix of real constants, and
X = (Xi ) is an n 1 vector of random variables. Then Cov(Y ) = A Cov(X)A0 .
Proof
We have by definition of a covariance matrix
Cov(Y ) = E (Y E(Y ))(Y E(Y ))0 ,
where Y E(Y ) = AX A E(X) = A(X E(X)). Thus we get

Cov(Y ) = E A(X E(X))(X E(X))0 A0
= A E(X E(X))(X E(X))0 A0

= A Cov(X)A0 .
70

In this chapter we consider specific density functions of common statistical distributions, including those of binomial, Poisson, and Normal distributions.
We usually deal with a parametric family of density functions. That is, the pdf f is
indexed by one or more parameters, say , which allow us to vary certain characteristics
of the distribution, while staying within one functional form.
A specific member of a family of densities will be associated with a specific value of
the parameters.
In the following we will use the generic notation f (x; ) to denote a family of densities for
random variable X. The admissible values of the parameters are called the parameter
space and will be denoted by .
Each parametric family of densities has its own distinguishing characteristics that make the
pdfs appropriate for specifying the probability space of some experiments and inappropriate
for others.
The characteristics include
whether the pdfs are discrete or continuous,
whether the pdfs are restricted to positive-valued and/or integer valued random variables,
whether the pdfs are symmetric or skewed.
In the following we will consider a list of commonly used parametric families of density functions.
For each family, we will consider the major characteristics and application contexts.
71
4.1. Discrete Density Functions

Family Name: Discrete Uniform
Parameterization
Density Definition
Moments
MGF
N = {N : N is a positive integer}
f (x; N ) = N1 I{1,2,...,N } (x)
2
= N2+1 , 2 = N121 , 3 = 0
P
jt
MX (t) = N1 N
j=1 e
Background and Application: The discrete uniform distribution assigns equal probability
to each of N possible outcomes of an experiment. Hence, the discrete uniform can be used to
model the probability space of any experiment having N outcomes that are all equally likely.
Example 4.1 Consider the experiment of rolling a die. The pdf of the number of dots facing
up is
1
f (x; N = 6) = I{1,2,...,6} (x),
6
and belongs to the family of discrete uniforms.
Family Name: Bernoulli
Parameterization
Density Definition
Moments
MGF
p = {p : 0 p 1}
f (x; p) = px (1 p)1x I{0,1} (x)
= p, 2 = p(1 p), 3 = 2p3 3p2 + p
MX (t) = pet + (1 p)
Background and Application: The Bernoulli distribution can be used to model experiments
that have two possible outcomes often termed as success and failure, which are coded 0 (failure)
and 1 (success). The event x = 0 has probability f (0; p) = 1 p, and the event x = 1 has
probability f (1; p) = p
Example 4.2 A simple example consists of tossing a coin with a probability of a head p and
x = 1 if head faces up.
The Bernoulli distribution plays an important role in microeconometrics, where we often
consider discrete binary 0-1 decisions of consumers or households (for example the decision to
buy or not to buy a certain product).
Family Name: Binomial
Parameterization
Density Definition
(n, p) ={(n, p) : n N \ {0},0 p 1}

n!
px (1 p)nx , x N
x!(n x)!
f (x; n, p) =
0 otherwise
Moments
= np, 2 = np(1 p), 3 = np(1 p)(1 2p)
MGF
MX (t) = (1 p + pet )n
72

Background and Application: The binomial density is used to model an experiment that
consists of n independent repetitions of a Bernoulli-type experiment with a success probability
p. The quantity of interest x is the total number of successes in n of such Bernoulli trials.
The functional form of the pdf is obtained directly from the construction of the experiment and
can be derived as follows. Let (Z1 , ..., Zn ) be a collection of n independent Bernoulli distributed
P
random variables, each with P (zi = 1) = p. Then the random variable X = ni=1 Zi represents
the number of the successful Bernoulli trials with zi = 1. Since the Zi s are independent, the
probability of obtaining in a sequence of n trials a particular sequence of zi outcomes with x
successful Bernoulli trials and n x failures is
px (1 p)nx .
The number of different sequences of n trials that result in x successful Bernoulli trials and
n x failures is
!
n
n!
=
.
x
x!(n x)!
(That is the number of different ways of placing x outcomes with zi = 1 into the n positions.)

Since the nx different sequences are mutually exclusive and have the same probability, it follows
that the probability for x successful trials is the sum of the probabilities for the individual
sequences
!
n
PX (x) = f (x; n, p) =
px (1 p)nx .
x
Example 4.3 What is the probability of obtaining at least one 6 in four rolls of a fair die?
This experiment can be modeled as a sequence of n = 4 Bernoulli trials with success probability
p = 1/6 = P (6 dots face up). Define the random variable X = total number of 6s in four rolls.
Then X binomial(n = 4 , p = 1/6) and
P (at least one 6) = P (x > 0) = 1 P (x = 0)
!
1 0 5 4
4
= .518.
= 1
0
6
6
A generalization of the binomial distribution to the case where there is interest in more than 2
types of outcomes for each trial is the multinomial distribution.
73

Family Name: Multinomial
Parameterization
(n, p1 , . . . , pm ) = {(n, p1 , . . . pm ) : n > 0

integer, 0 pi 1, i,
Density Definition
i=1 pi
= 1}
f (x
, . . . , xm ; n, p1 , . . . , pm )
1
m
m
Y
X
xi
Q n!
pi , xi = 0, . . . , n,
xi = n
m
=
i=1 xi ! i=1
i=1
Moments
Pm
otherwise
i = npi , i2 = npi (1 pi ), Cov (Xi , Xj ) = npi pj

3,i = npi (1 pi ) (1 2pi )
MGF
MX (t) =
Pm
i=1 pi e

ti n
Background and Application: The multinomial density is used to model an experiment that
consists of n independent repetitions of an experiment with m > 2 different types of outcomes
each with probability pi . The quantities of interest x1 , ..., xm are the total numbers of each type
of outcome of the experiment in n repetitions of the experiment. That is, xi represents the
total number of outcomes of type i. Note that the range of the random variable (X1 , ..., Xn ) is
P
given by R(X) = {(x1 , ..., xn ) : xi {0, 1, ..., n}i, m
i=1 xi = n}.
xi
The functional form of the pdf, given by f (x1 , . . . , xm ; n, p1 , . . . , pm ) = Qmn! xi ! m
i=1 pi is obi=1
tained by directly extending the arguments used in the binomial case based upon recognizing
that the probability of obtaining in a sequence of n repetitions a particular sequence of outcomes
x1 , ..., xn is
m
Y
xi
pi ;
i=1
and that the number of different sequences of n repetitions that result in xi type i outcomes
for i = 1, ..., m equals
n!
.
x1 ! xm !
Family Name: Negative Binomial (Pascal)
Parameterization
Density Definition
(r, p) {(r,
p) : r > 0 integer, 0 < p < 1}
x 1 pr (1 p)xr , x = r, r + 1, . . .
r1
f (x; r, p) =
Moments
otherwise
r((1p)+(1p)2 )
= pr , 2 = r(1p)
, 3 =
p2
p3
MGF
MX (t) = ert pr 1 (1 p) et
r
, t < ln (1 p)
Background and Application: The negative binomial density is used to model an experiment consisting of independent Bernoulli trials with a success probability p just like the case
for the binomial density. The quantity of interest x is the number of Bernoulli trials which are
necessary to obtain r successes.
74

The comparison of the binomial and the negative binomial distribution reveals that the roles of
the number of trials and the number of successes are reversed w.r.t. what is a random variable
and what is the parameter.

The functional form of the negative binomial pdf, given by f (x; r, p) = x1

pr (1 p)xr , is
r1
obtained from arguments similar to those used in the binomial case. Let (X1 , X2 , ...) be a
collection of independent Bernoulli distributed random variables, each with P (Xi = 1) = p.
The probability of obtaining a particular sequence of x trials that result in r successes with the
last trial being the rth success is
pr (1 p)xr .
The number of different sequences of x trials that result in r successes and end with the rth
success is
!
x1
(x 1)!
.
=
(r 1)!(x r)!
r1
(That is the number of different ways of placing r 1 successes in the first x 1 positions of
the sequence. Note that the last trial has to be a success.)
A special case of the negative binomial distribution is the geometric distribution, which is
obtained by setting the parameter r = 1. Its pdf has the form
f (x; p) = p (1 p)x1
for x = 1, 2, . . . .
Background and Application: The geometric density is used to model an experiment that
consists of independent Bernoulli trials with a success probability p. The quantity of interest
x is the Bernoulli trial at which the first success occurs.
The geometric distribution has a property known as the memoryless property. It means
that for some positive integers i and j we obtain
P (x > i + j|x > i) = P (x > j).
That is, the probability of getting an additional j failures, having already observed i failures,
is the same as the probability of observing j failures at the start of the sequence. Thus, the
geometric distribution forgets what has occurred. This memoryless property of the geometric
distribution can be established as follows. First note that for any integer k
P (x > k) = P (no success in ktrials) = (1 p)k .
75

Hence we obtain
P (x > i + j|x > i)
(def.)
=
=
P (x > i + j)
P (x > i + j and x > i)
=
.
P (x > i)
P (x > i)
(1 p)i+j
(1 p)i
(1 p)j
=
P (x > j).
Example 4.4 The geometric distribution is often used to model lifetimes of components. For
example, assume a probability of p = .001 that a light bulb will fail on any given day. Then the
probability that the lifetime of the light bulb X will be at least 30 days is
P (x > 30) =
.001 (1 .001)x1 = .99930 = .970.
|
x=31
{z
geometric pdf
Remark 4.1 The memoryless property of the geometric distribution can be interpreted as a
lack-of-aging property. It indicates that the geometric distribution is not appropriate to model
lifetimes for which the failure probability is expected to increase with time.
Family Name: Poisson
Parameterization
Density Definition
Moments
MGF
= { : > 0}
x
e , for x = 0, 1, 2...
f (x; ) =
x!
0 otherwise
= , 2 = , 3 =
t
M (t) = e(e 1)
X
Background and Application: The Poisson density can be used to model experiments where
the quantity of interest takes values in the non-negative integers, for example, the number of
occurrences of a certain event (such as the number of goals in a soccer game).
An important property of the Poisson distribution is that it provides an approximation to the
probabilities generated by the binomial distribution. In fact, the limit of the binomial density
as the number of Bernoulli trials n is the Poisson density. This can be established as
follows.
The binomial density for n Bernoulli trials and success probability p is given by f (x; n, p) =
n!
x
nx . Let np = for some > 0 such that p = /n. Then the binomial pdf can
x!(nx)! p (1 p)
76

be rewritten as
x
nx
n!
1
x!(n x)! n
n
"
#

n (n 1) (n [x 1]) x
nx
=
1
x!
n
n
#
"
n

x
n (n 1) (n [x 1])
x
.
=
1
nx
x!
n
n

f (x; n, p) =
Then letting n , yields

"
lim f (x; n, p) = lim
n n1
n x + 1 x
n
1
1
n | {z
n }
n } x! | {zn } |
|{z}
|
{z
{zn
#
such that limn f (x; n, p) =
x e
x!

x
,
}
, which represents the Poisson density.
The usefulness of this result is that for a large number of trials n and thus for a small success
probability p = /n, we can approximate the Binomial by a Poisson density, that is
(np)x enp
n!
px (1 p)nx
.
x!(n x)!
x!
Note that the Poisson density is relatively easy to evaluate, whereas, for large n, the calculation
of the factorial expressions in the binomial density can be cumbersome.
Poisson density and Poisson process: The Poisson distribution models experiments whose
outcomes are governed by the so-called Poisson-process. A Poisson process is defined as follows:
Definition 4.1 (Poisson process) Let an experiment consist of observing the occurrence of
a certain event over a time interval [0, t]. The experiment is said to follow a Poisson process
if:
1. the probability that the event occurs once over a small time interval t is approximately
proportional to t as1 (t) + o(t) , where > 0,
2. the probability that the event occurs twice or more often over a small time interval t is
negligible being of order of magnitude o(t),
3. the numbers of occurrences of the event that are observed in non-overlapping intervals are
independent events.
o(t) stands for of smaller order than t and means that the values of o(t) approach zero at a rate faster
than t. That is limt0 o(t)
t = 0.
77

Theorem 4.1 Let X be the number of times a certain event occurs in the interval [0, t]. If the
experiment underlying X follows a Poisson process, then the pdf of X is the Poisson density.
Proof
Partition the interval [0, t] into n disjoint subintervals Ij , j = 1, ..., n, each of length t =
X(Ij ) denote the number of times that the event occurs within subinterval Ij , so that X =
t
n.
Let
Pn
j=1 X(Ij ).
Consider the event X = k, and note that P (x = k) = P (An ) + P (Bn ), where An and Bn are the
disjoint sets
n
X
An =
x(Ij ) = k; x(Ij ) 1 j ,
j=1
n
X
Bn =
x(Ij ) = k; x(Ij ) 2 for at least one j .
j=1
Note that Bn {x(Ij ) 2 for at least one j} nj=1 {x(Ij ) 2}. Now, Theorem 1.3 and Booles
inequality applied to P (nj=1 {x(Ij ) 2}) implies
P (Bn )
n
X
P ({x(Ij ) 2}) =
j=1
j=1

n
X
t
{z
"
=t
t
n
t
n
#
by property (2) of a Poisson process
It follows that limn P (Bn ) = 0. Now consider P (An ). Define for each subinterval a success as
observing the event once, and a failure otherwise. Then by property (1)
t
n

P (success) =
t
+o
n

t
n

and
P (failure) = 1
t
n

Since the occurrences of events are independent across the n subintervals by property (3), they can be
interpreted as a sequence of independent Bernoulli trials. Hence, P (An ) is obtained from a binomial
distribution as
P (An ) =
n
k
!
t
n
+o
k
t
t
n

nk
t
Since limn Binomial(n, p) = Poisson( = np), and o( nt ) disappears at a rate faster than
t
n
as
n , we get
lim P (An ) =
et (t)k
k!
(i.e. a Poisson density with x = k),
where we have set nt = p (: Binomial parameter) such that pn = t = (: Poisson parameter).

Finally, since P (x = k) = P (An ) + P (Bn ) n, we have
P (x = k) = lim [P (An ) + P (Bn )] =
n
et (t)k
.
k!
Thus, under a Poisson process, the number of times an event occurs in an interval follows a Poisson
distribution.
78

Remark 4.2 The parameter is interpreted as the mean rate of occurrence of the event per
unit of time or the intensity of the Poisson process. This follows from the fact that for a Poisson
variable E X = = t such that E X/t = .
Family Name: Hypergeometric
Parameterization
(M, K, n) = {(M, K, n) : M = 1, 2, 3, . . . ;
K = 0, 1, . . . , M ; n = 1, 2, . . . , M }
Density Definition
f (x; M, K, n)
Moments
K M K
x
nx
M
n
max [0, n (M K)] x min (n, K)

0
otherwise
nk
M,
2 = n

3 = n
MGF
for integer values
K
M
MK
M
K
M

M K
M

M 2K
M
M n
M 1
M n
M 1

M 2n
M 2
MX (t) = [((M n)! (M K)!) /M !]

H n, K, M K n + 1, et ,
where H () is the hypergeometric function

H (, , r, Z) = 1 +
Z
r 1!
(a+1)(+1) Z 2
2!
r(r+1)
Background and Application: The hypergeometric density can be used to model experiments where there are M objects (in an urn), of which K are of one type, say type A, and
M K are of a different type, say B; n objects are randomly drawn from the set of the M
objects without replacement; the quantity of interest x is the number of type A objects in the
set of n drawn objects.
Note that the binomial and the hypergeometric density both assign probabilities to outcomes
of the type observe x type A outcomes from a total number of n trials. However, under
the binomial experiment, the n trials are in contrast to the hypergeometric experiment
independent and identical and correspond to drawing objects with replacements.
(K )(M K )
The functional form of the hypergeometric pdf, given by f (x; M, K, n) = x Mnx , is obtained
(n)
from the following arguments. The number of different ways of choosing the sample of size n

K
from M objects is M
;
the
number
of
different
ways
of
choosing
x
type
A
objects
is
; the
n
x

K
number of different ways of choosing n x type B objects is Mnx
. Since all possible samples
having x type A objects and n x type B objects are equally likely, we can apply the classical
probability definition in order to obtain the probability of drawing a sample with x type A
objects

P (sample with xtype-Aobjects) = f (x; M, K, n) =
79
K
x
M K
nx

M
n
4.2. Continuous Density Functions

Family Name: Continuous Uniform
Parameterization
Density Definition
(a, b) = {(a, b) : < a < b < }

1
f (x; a, b) = ba
I[a,b] (x)
Moments
= (a + b) /2, 2 = (b a)2 /12, 3 = 0
MGF
MX (t) =
bt at
e e
(ba)t
for t 6= 0
1 for t = 0
Background and Application: The continuous uniform density is used to model experiments
having a continuous sample space with outcomes that are equally likely in the interval [a, b].
Example 4.5 A simple example consists of spinning a wheel of fortune with radius r. The
point X at which the wheel stops is uniformly distributed with a = 0 and b = 2r.
Family Name: Gamma
Parameterization
(, ) = {(, ) : > 0, > 0}
Density Definition
1
x1 ex/ I(0,) (x),
f (x; , ) = ( ())
() = 0 y 1 ey dy (gamma function)
Moments
= , 2 = 2 , 3 = 2 3
MGF
MX (t) = (1 t) for t < 1
Remark 4.3 The gamma function has the following properties.

1. (1) =
ey dy = 1.
2. If > 0 is an integer, then () = ( 1)!.

3. (1/2) = 1/2 .
4. For any real > 0, the gamma function satisfies the recursion ( + 1) = (). This
can be verified through integration by parts.
Background and Application: The gamma family of densities can be used to model experiments whose outcomes are coded as non-negative real numbers and whose probabilities are to
be assigned via a pdf that is skewed to the right.
80

Gamma density and Poisson process: The Gamma distribution models the waiting time
(duration) between occurrences of events under a Poisson process. This result is obtained from
the following arguments. LetY be the number of occurrences of an event under a Poisson
process in the interval [0, t], such that Y Poisson(), with = t. Let X represent the time
elapsed until the first event occurs. Then the probability for y = 0 can be written as
P (y = 0) = P (no event in [0, t]) = P (x > t) =
et 1
.
0!
Then the cdf for X is obtained as

(def.)
FX (t) = P (X t) = 1 P (X > t) = 1 et .
Hence the pdf of the waiting time X can be derived as
FX (t)
= fX (t) = et ,
t
which is a gamma density with = 1 and = 1/. This important particular case of a Gamma
distribution is the exponential distribution; see below.
The gamma distribution has an additivity property, which is indicated in the following
theorem.
Theorem 4.2 Let X1 , ..., Xn be independent random variables with Xi Gamma(i , ), i =
P
P
1, ..., n. Then Y = ni=1 Xi Gamma( ni=1 i , ).
Proof
Since the MGF of the Xi s are MXi (t) = (1 t)i for t < 1 , and since the Xi s are independent,
the MGF of Y is
MY (t)
(by indep.)
n
Y
i=1
MXi (t) =
n
Y
(1 t)i = (1 t)
Pn
i=1
for
t < 1 .
i=1
Pn
Thus, by the uniqueness theorem, Y Gamma(
i=1 i , ).
A further property of the gamma distribution is that a scaled gamma distributed random
variable has also a gamma distribution, as indicated in the following theorem.
Theorem 4.3 Let X Gamma(, ). Then, for any c > 0, Y = cX Gamma(, c).
Proof
Since the MGF of X is MX (t) = (1 t) for t < 1 , the basic properties of MGFs imply that the
81

MGF of Y is
MY (t) = McX (t) = MX (ct) = (1 ct)
Thus, by the uniqueness theorem, Y Gamma(, c).
for
t < (c)1 .
An important special case of the gamma distribution, obtained by setting = 1 and = , is

the exponential distribution.
Gamma Subfamily Name: Exponential
Parameterization
Density Definition
= { : > 0}
f (x; ) = 1 ex/ I(0,) (x)
Moments
= , 2 = 2 , 3 = 23
MGF
MX (t) = (1 t)1 for t < 1
Background and Application The exponential density can be used to model experiments
whose outcomes are coded as non-negative real numbers and whose probabilities are to be
assigned via a pdf that is monotonically decreasing in x.
A specific application of the exponential distribution is the modeling of the time that passes until
a Poisson process produces the first success (recall the previous discussion of the relationship
between the gamma density and the Poisson process).
The exponential distribution has the memoryless property, such that it is an appealing
candidate to model operating lives until failure of certain objects. This is shown in
Theorem 4.4 If X Exponential(), then P (x > s + t|x > s) = P (x > t) (t, s) > 0.
Proof
We have
P (x > s + t|x > s) =
1 x/
dx
P (x > s + t)
e
= s+t
1 x/
P (x > s)
dx
s e
e(s+t)/
= et/ = P (x > t).
es/
Remark 4.4 The memoryless property implies that given that an object has already functioned
for s units of time without failing, the probability that it will function for at least an additional
t units of time, that is P (x > s + t|x > s), is the same as the unconditional probability that
it would function for at least t units of time, that is P (x > t) (this can be reformulated as
82

the exponential distribution possesses a constant survival function). This indicates that the
exponential distribution is not appropriate to model lifetimes for which the failure probability is
expected to increase with time.
Remark 4.5 Sometimes the exponential distribution is parameterized as f (x; ) = ex , so
do play attention to which variant is meant.
A further important special case of the gamma distribution, obtained by setting = v/2 and
= 2, is the chi-square distribution.
Gamma Subfamily Name: Chi-Square
Parameterization
v = {v : v is a positive integer}
v is called the degrees of freedom
1
x(v/2)1 ex/2 I(0,) (x)
2v/2 (v/2)
Density Definition
f (x; v) =
Moments
= v, 2 = 2v, 3 = 8v
MGF
MX (t) = (1 2t)v/2 for t <
1
2
Background and Application The chi-square distribution plays an important role in statistical inference, especially when sampling from a normal distribution. In particular, (as we will
show later) the sum of the squares of v independent standard normal random variables has a
2(v) distribution. Furthermore, the critical values of many statistical tests are obtained as a
quantile of a 2(v) distribution, that is, a value h for which
P (x h) =
h
1
2v/2 (v/2)
x(v/2)1 ex/2 dx = .
Family Name: Beta

Parameterization
(, ) = {(, ) : > 0, > 0}
Density Definition
1
f (x; , ) = B(,)
x1 (1 x)1 I(0,1) (x),
1
B(, ) = 0 x1 (1 x)1 dx (beta fct.)
Moments
3 =
MGF
,
(++1)(+)2
2()()
(++2)(++1)(+)3
+ ,
MX (t) =
2 =
P B(r+,) tr
r=1
B(,)
r!
Some useful properties of the beta function include the fact that B(, ) = B(, ) and
B(, ) = ()()/( + ), so that the beta function can be evaluated in terms of ().
Background and Application The beta density can be used to model experiments whose
outcomes are coded as real numbers on the interval [0, 1]. It is used in modeling proportions.
The beta density can assume a large variety of shapes depending on the parameters and .
83

Parameter values
Shape of the beta density
<
>
=
> 1 and > 1
<1
<1
< 1 and < 1
= 1 and = 1
skewed to the right, 3 > 0

skewed to the left, 3 < 0
symmetric about the mean
inverted -shaped with limx1 f (x) = 0 and limx0 f (x) = 0
limx0 f (x) =
limx1 f (x) =
-shaped with limx1 f (x) = and limx0 f (x) =
uniform on (0, 1)
4.3. Normal Family of Densities

The normal (Gaussian) family of distributions is the most extensively used distribution in
statistics and econometrics. There are three main reasons for this.
1. The normal distribution is very tractable analytically.
2. The normal density has a bell shape, whose symmetry makes it an appealing candidate
to model the probability space of many experiments.
3. There is the Central Limit Theorem (which we will discuss in Chapter 5), which indicates
that under mild conditions, the normal distribution can be used to approximate a large
variety of distributions in large samples.
Family Name: Univariate Normal
Parameterization
(, ) = {(, ) :
(, ), > 0}
Density Definition
f (x; , ) =
Moments
E(X) = ,
MGF
1
2
exp 21

x 2
Var(X) = 2 ,
n
MX (t) = exp t +
1 2 2
2 t
3 = 0
Background and Application The normal family of densities is indexed by the two parameters and which correspond to the mean and the standard deviation, respectively.
In order to denote a normally distributed random variable with mean and variance 2 , we
will use the usual notation X N (, 2 ). A normal distribution with = 0 and 2 = 1 is
called standard normal distribution, and is abbreviated by N (0, 1).
84

The functional form of the MGF of the normal distribution is obtained as follows:

1
2
1
etx
e 22 (x) dx
MX (t) =
(by def.)
2

1
2
1
t
= e
et(x) 22 (x) dx
(expansion by ett .)
2

1
2
2
4 2
4 2
1
= et
e 22 [(x) 2 t(x)+ t t ] dx
2
4 t2
4 t2
(expansion by e 22 22 )
= et
= et+ 2
e
2

2 t2
1
[(x 2 t)2
2 2
4 2
| {zt} ]
constant
dx
1
2
2
1
e 22 [(x[+ t]) ]
2
|
{z
}
dx
pdf of a N ( + 2 t, 2 )which integrates to 1
t+ 21 2 t2
= e
The normal density is symmetric about its mean , has its maximum at x = and inflection
points (where the curve changes form concave to convex) at x = .
A useful property of normally distributed random variables is that they can easily be transformed into a variable having a standard normal distribution.
Theorem 4.5 If X N (, 2 ), then Z = (X )/ N (0, 1).
Proof
Since is a constant, the MGF of Z = + 1 X is
MZ (t) = e t
MX
|
1
{z }
= e t e( t)+ 2
2 ( 1 t)2
MGF of a N (, 2 )at t = t/
1 2
= e2t ,
1 2
where e 2 t represents the MGF of a normal density with = 0 and 2 = 1.
Remark 4.6 Theorem 4.5 implies that the probability of an event A, PX (A), for the random
variable X N (, 2 ) is equal to the probability PZ (B) of the equivalent event B = {z : z =
(x )/, x A} for a standard normal random variable Z N (0, 1). Hence, the standard
normal distribution is sufficient to assign probabilities to all events involving Gaussian random
variables.
85

Example 4.6 Let X N (17, 41 ). The probability of the event x [16, 18] can be computed as
x 17
18 17
16 17
P (16 x 18) = P
(1/2)
(1/2)
(1/2)
= P (2 z 2)
= (2) (2) = 0.9544,

where () denotes the cdf of a standard normal distribution. Note that is not available in
closed form.
Normal and chi-square distribution: There is relationship between standard normal random variables and the 2 distribution which is subject of the following two theorems:
Theorem 4.6 If X N (0, 1), then Y = X 2 2(1) .

Proof
The MGF of Y is defined as

MY (t) = E eY t
= E eX
2t
1 2
1
2
ex t e 2 x dx
2
1 2
1
e 2 x (12t) dx
2
n
o2

1
x
= (1 2t)1/2
e 2 (12t)1/2
dx
2(1 2t)1/2
{z
pdf of a N (0, [1 2t]1 )which integrates to 1
= (1 2t)1/2 ,
where (1 2t)1/2 represents the MGF of a 2(1) density.
Theorem 4.7 Let (X1 , . . . , Xn ) independent N (0, 1)-distributed random variables. Then Y =
Pn
2
2
i=1 Xi (n) .
Proof
The Xi2 s are 2(1) distributed by Theorem 4.6, and are independent by Theorem 2.11. Thus the MGF
of Y is obtained as
MY (t)
(by indep.)
n
Y
i=1
MX 2 (t) =
i
n
Y
i=1
(1 2t)1/2 = (1 2t)n/2 ,
|
where (1 2t)n/2 represents the MGF of a 2(n) density.
86
{z
MGF of a 2
(1)

The univariate normal distribution discussed so far has a straightforward multivariate generalization.
Family Name: Multivariate Normal
Parameterization
12
..
0
= (1 , . . . , n ) and =
.
n1
(, ) = {(, ) : Rn ,
..
.
1n
..
.
n2
Density Definition
is a (n n)p.d. symmetric matrix}

f (x; , ) = 1 n exp 12 (x )0 1 (x )
Moments
E(X) = ,
MGF
MX (t) = exp{0 t + (1/2) t0 t}, t = (t1 , . . . , tn ) .
(2) ||
3
(n1)
Cov(X) = ,
= [0]
0
Background and Application The n-variate normal family of distribution is indexed by

n + n(n + 1)/2 parameters: In the mean vector () n parameters and in the covariance matrix
() n + (n2 n)/2 parameters.
In order to illustrate graphically some of the characteristics of a multivariate Gaussian density,
we consider the bivariate case with n = 2.
1. The multivariate Gaussian density is bell-shaped and has its maximum at x = (x1 , x2 ) =
= (1 , 2 ).
2. The iso-density contours, given by the set of points (x1 , x2 ) {(x1 , x2 ) : f (x; , ) = c},
have the form of an ellipse. Its origin is given by and its direction depends on . See
e.g. Mittelhammer (1996, Figs. 4-14 and 4-15).
Properties of Multivariate Normal Distributions A useful property is that linear combinations of a vector of multivariate normally distributed random variables are also normally
distributed as stated in the following theorem.
Theorem 4.8 Let X be an n-dimensional N (, )-distributed random variable. Let A be any

(k n) matrix of constants with rank(A) = k, and let b be any (k 1) vector of constants.
Then the (k 1) random vector Y = Ax + b is N (A + b , AA0 ) distributed.
Proof
The MGF of Y is defined as

MY (t) = E e
Y
t0 Y
t0 AX+b
= E e
X

87
= et b E et AX .

Defining t0 A = t0 with A0 t = t allows us to write
0
MY (t) = et b E et X
= et
1 0
= et b Mx (t ) = et b e t + 2 t t
A+b + 21 t0 AA0 t
which is the MGF of a N (A + b , AA0 ) distribution.
Remark 4.7 Theorem 4.8 can be used to standardize a normally distributed random vector.
Let Z be a N (0, I) distributed (n 1) random vector, that is a vector of n independent N (0, 1)
distributed random variables. Then the (n 1) random vector Y with a N (, ) distribution
can be represented in terms of Z as Y = + AZ, where A is selected such that AA0 = .2
This follows from Theorem 4.8, since
Y N (A0 + , AIA0 ) = N (, ).
Furthermore, the inversion of the function Y = + AZ standardizes the normally distributed
vector Y
A1 (Y ) = Z N (0, I).
A further important property of the multivariate normal distribution is that marginal densities
obtained from a multivariate normal are normal densities as stated in the following theorem.
Theorem 4.9 Let Z be an n-dimensional N (, )-distributed random variable, where
Z=
Z (1)
(m 1)
Z (2)
(1)
(m 1)
(n m) 1
(2)
11
12
(m m)
m (n m)
and =
(n m) 1
21
22
(n m) m
(n m) (n m)
Then the marginal pdf of Z (1) is N (1 , 11 ), and the marginal pdf of Z (2) is N (2 , 22 ).
Proof
Let
A=
(m m)
m (n m)
and
b=0
in Theorem 4.8, where I is an identity matrix. It follows that Z (1) = AZ is N (A, AA0 ) distributed,
with A = (1) and AA0 = 11 . (The proof for Z (2) is analogous.)
2
Note that the choice of A is not unique, but this normally doesnt matter, since A typically does not appear
alone, but paired with A0 . Still, A is often selected such that AA0 = is the Cholesky decomposition,
where the so-called Cholesky factor A is restricted to be lower triangular.
88

Remark 4.8 Note that Theorem 4.9 can be applied to obtain the marginal pdf of any subset of
the normal random variable (Z1 , ..., Zn ) by simply ordering them appropriately in the definition
of Z in the theorem. Also note that the normal marginal pdfs are derived from joint multivariate
normality. The derivation does not go in the opposite direction. That is, marginal normality
does not imply joint normality (for an example, see Casella and Berger, 2002, Exercise 4.47).
The following theorem states an important result concerning conditional densities obtained
from multivariate normal distributions.
Theorem 4.10 Let Z be an n-dimensional N (, )-distributed random variable, where
Z=
Z (1)
(m 1)
Z (2)
(1)
(m 1)
(n m) 1
(2)
11
12
(m m)
m (n m)
and =
(n m) 1
21
22
(n m) m
(n m) (n m)
and let z 0 be an n-dimensional vector of constants partitioned conformably with the partition Z
into z 0(1) and z 0(2) . Then the conditional distributions of Z (1) |Z (2) = z 0(2) and Z (2) |Z (1) = z 0(1)
are

1
0
Z (1) |(Z (2) = z 0(2) ) N (1) + 12 1
22 z (2) (2) , 11 12 22 21
0
1
Z (2) |(Z (1) = z 0(1) ) N (2) + 21 1
11 z (1) (1) , 22 21 11 12 .
Proof
By definition, the conditional distribution of Z (1) |Z (2) = z 0(2) is obtained as
f (z (1) |z (2) = z 0(2) ) =
f (z (1) , z 0(2) )
f (z 0(2) )
1
(2)n/2 ||1/2
exp 21
z (1) (1)
z 0(2) (2)
22
|1/2
1
i0
1
(2)(nm)/2 |
z (1) (1)
z 0(2) (2)
0
exp 21 z 0(2) (2) 1
22 z (2) (2)
i.
The determinant || and the inverse 1 can be partitioned as3

|| = |22 | |112 |,
1
3
11 12
=
21 22
1
112
1
1
112 12 22
1
1
1
1
1
1
22 21 112 22 + 22 21 112 12 22
See Ltkepohl, H. (1996, p. 30 and 50), Handbook of Matrices, Chichester. Double-check yourself by multiplication!
89

where
112 = 11 12 1
22 21 .
Inserting the partitioned determinant and inverse and collecting terms produces the conditional density
as stated in the theorem. The proof for the conditional distribution of Z (2) |Z (1) = z 0(1) is analogous.
Remark 4.9 Note that the mean of the conditional distribution given by

E(Z (1) |z (2) ) = (1) + 12 1

22 z (2) (2)
is a linear function in the conditioning variable z (2) . This linearity of the conditional mean is
a specific feature of the multivariate normal distribution as a member of the family of elliptically
contoured distributions.
Remark 4.10 Consider the special case where Z(1) is a scalar and Z (2) is a (k 1) vector.
Then the conditional mean of Z(1) given z (2) , that is, the regression function of Z(1) on Z (2) has
the form
E(Z(1) |z (2) ) = a + b z (2) ,
(1 1)
(1 k)
where
a = (1) 12 1
22 (2) ,
b = 12 1
22 .
Hence, the regression function of Z(1) on Z (2) obtained from a multivariate normal distribution
for (Z(1) , Z (2) ) is linear.
The following theorem states that in the case of a normal distribution, zero covariance implies
independence of the random variables, which in general is not true for other distributions.
Theorem 4.11 Let x = (X1 , ..., Xn )0 be a N (, )-distributed random variable. Then (X1 , ..., Xn )
are independent iff is a diagonal matrix.
Proof
To see that under normality zero covariances imply independence, consider the joint pdf
1
f (x; , ) =
(2)
n/2
1/2
||
1

exp (x )0 1 (x ) .
2
The diagonality of implies that
|| =
n
Y
i=1
i2 ,
1/12
..
1 =
.
0
90
..
.
0
..
. .
1/n2

Therefore, the joint pdf factors into the product
f (x; , ) =
n
Y
i=1
(2)1/2 i
exp
1 (xi i )2
,
2
i2
which is the product of the n marginal densities. This implies independence. The proposition that
independence implies zero covariances was proven in Chapter 3 (Theorem 3.19) and is true for any
joint distribution.
Remark 4.11 Please note that marginal normality does not imply joint normality, not even
for uncorrelated variables. To see this, take e.g. X1 standard normal and X2 = SX1 where S
is 1 or 1 with probability 0.5 and independent of X1 . Then, X2 is standard normal too, and
uncorrelated with X1 , but not independent of X1 .
Remark 4.12 Sometimes, one may be interested in the behavior of so-called quadratic forms
in X, i.e. Q = X 0 AX with A some conformable matrix. While there are several results,
e.g. on mean and variance of Q, we present a result for a relevant particular case, A = 1 :
if X N (, ) , then X 0 1 X 2 (dim(X), 0 1 ) with 2 (r, ) a so-called non-central
chi-squared distribution with r degrees of freedom and non-centrality parameter (If = 0, the
usual 2 with r degrees of freedom is recovered).
4.4. Exponential Class of Distributions

The majority of families of distributions introduced so far are special cases of the exponential
class of distributions.
As we shall see later, the families of distributions from the exponential class have many nice
statistical properties which often simplify procedures for statistical inference (see Advanced
Statistics II ).
Definition 4.2 (Exponential class of densities) The pdf f (x; ) is a member of the exponential class of pdfs iff it has the form
f (x; ) =
nP
k
exp
i=1 ci () gi (x) + d () + z(x)
xA
otherwise
91

where
x = (x1 , . . . , xn )0 ;
= (1 , . . . , k )0 ;
ci (), d()
real-valued functions of not depending on x;
gi (x), z(x)
real-valued functions of xnot depending on ;
A Rn
a range/support which does not depend on .
Remark 4.13 In order to check whether a family of pdfs belongs to the exponential class, we
must identify the functions ci (), d(), gi (x), z(x) and show that the family has pdfs of the
form given in the definition, which is not always trivial.
Example 4.7 Consider a univariate N (, 2 ) distribution with n = 1 and k = 2 (# of parameters). Set

1
c2 () = 2 ,
g1 (x) = x,
g2 (x) = x2 ;
c1 () = 2 ,
2
1
1 2
d() = ln(2 2 )
,
z(x) = 0,
2
2 2
Substitution into the definition of the exponential class yields
(
A = R.
1
1
1 2
2
2
ln(2
)
f (x; ) = exp
2
2 2
2
2 2
n
1
1 (x )2 o
=
exp
.
2 2
(2)1/2
Remark 4.14 Further members of the exponential class are: Bernoulli, binomial, multinomial,
negative binomial, Poisson, geometric, gamma, chi-square, exponential, beta, etc. Families of
distributions that do not belong to the exponential class are, e.g.: discrete uniform, continuous
uniform, hypergeometric.
4.5. Further extensions

The distribution models we have discussed so far are widely used in practice. They will however
not be perfect choices in all practical situations, and more flexible variants may be required at
times. We review a couple of the more popular choices now.
92
Generalized normal distribution

There are several generalizations of the normal distribution, e.g. one exhibiting skewness; see
below. Another focuses of the tails of the distribution. The t distribution (see Section 6.5)
can be seen as such a generalization. The particular generalization discussed now (also called
generalized error distribution, or exponential power distribution, or generalized Gaussian distribution) has the same exponential structure as the normal (which the t-density does not have),
but allows for different powers in the exponential.
Family Name: Generalized normal
Parameterization
R, , (0, )
Density Definition
f (x; , , ) =
2( 1 )
1
2
e(
1
, x
+ sgn (x )
2( 1 )
x s1 t
where (s, x) = 0 t e ds
CDF
F (x) =
Moments
2 ( 3 )
( 1 )
The Laplace distribution4 is a popular particular case obtained from setting = 1. The other
obvious popular particular case is, for = 2, the normal distribution. For , the limit is
a uniform distribution (which?).
Skewed distributions
Some of the basic distributions we examined so far are skewed, but this is not the case with any
of the distributions having R (X) = R. Stock returns for instance exhibit a slight skewness, and
one may want to capture that when modelling them. There are several approaches designed
to skew given (continuous) symmetric (about zero) distributions (typically used with the
normal, the t or the Laplace distributions). One is to simply multiply the base density with
two different weights, one for positive and and one for negative x. Another, slightly more
elegant (and more general), way is to attach a continuum of weights.
Family Name: Generic skew distributions
Parameterization
R, h pdf, G cdf
h, g continuous, symmetric about 0
Density Definition
f (x) = 2h (x) G (x)
The Laplace distribution is also called double exponential; yet be careful here, there is another distribution
the Gumbel
distribution (or generalized extreme value distribution of type I), with cdf F (x) =
exp exp x
, which sometimes goes by the same name for obvious reasons.
93

Most characteristics depend on the precise shape of h and G, except for the symmetric case
= 0 where G is irrelevant and E(X) = 0. Note also that the distribution is not standardized
even if h or G are.
For , f (x) converges to the positive (negative) half-h distribution given by f (x) =
2h (x) I(0,) (x) (f (x) = 2h (x) I(,0) (x)). For = 0, symmetry is recovered.
Remark 4.15 The pdf f is indeed a pdf irrespective of the choices of h and G. To see this,
consider two independent random variables, X1 G and X2 h. Since both are symmetric
about the origin, X1 X2 must also be symmetric, such that
1
P (X1 X2 < 0) = .
2
At the same time,

P (X1 X2 < 0) = E I(,0) (X1 X2 )
LIE

= E E I(,0) (X1 X2 ) |X2

= E (P (X1 X2 |X2 )) .
But P (X1 X2 |X2 ) = G (x2 ) such that
P (X1 X2 < 0) = E (G (X2 )) =
G (x) h (x) dx =
and therefore
G (x) h (x) dx = 1.
f (x)dx = 2
1
2
The best-known member of the class is the skew normal distribution, with density given by
f (x) = 2 (x) (x)
R.
Location-scale families
As the name says, the focus is here on location and scale of the distribution, but less on its
shape. The important observation (which we also made in conjunction with the normal and
the standard normal distribution) is that one may construct an entire class of distributions by
shifting and scaling a given base distribution.
Shifting and scaling a pdf are essentially linear transformations of the underlying random
variable. Let Z be some random variable with cdf G. A variable X defined as a linear transformation of Z, X = a + bZ, then has expectation E (X) = a + b E (Z) and standard deviation
94

1
X = bZ . The cdf of X is F (x) = G xa
and the pdf is f = |b|
g xa
. Assume now that
b
b
Z is standardized, i.e. E (Z) = 0 and Var (Z) = 1. Then, the distribution of X has the shape
of that of Z, with expectation E (X) = a and variance Var (X) = b2 , i.e. only the coefficients
of the linear transformation. This leads us to
Family Name: Location-scale (univariate)

Parameterization
Density Definition
R, (0, ), g a pdf

f (x; , ) = 1 g x
CDF
F (x) = G
Moments
MGF
, 2 (g is standardized with finite variance)

MX (t) = et MZ (t)
Note that the family can actually be defined for base densities that do not have finite variance
(or even expectation); this is the reason why the class is called location-scale and not meanvariance.
Remark 4.16 This is quite a generic class of distributions closed under linear transformations.
I.e. if X has a location-scale distribution (with a given base g), then so does Y = a + bX for
any a, b 6= 0.
The class can be extended multivariately by writing

X = + HZ,
Rn , and H is a n n matrix.
If Z has zero mean and the identity matrix as as covariance matrix, E (X) = and Cov (X) =
HH 0 = , leading to
Family Name: Location-scale (multivariate)
Parameterization
Density Definition
Rn , pos. def., g multivariate pdf

f (x; , ) = 1 g (0.5 (x ))
Moments
, (g is standardized with finite variance)
||
It is often convenient to pick g such that it is the density of a vector of independent standardized
random variables, but, in practice, it suffices otherwise that its mean is zero and its covariance
matrix is the identity matrix. As mentioned before, the multivariate normal distribution is a
member of this class.
One interesting subclass of multivariate location-scale distributions of the class of the elliptical
distributions, defined as
95

Family Name: Elliptical distributions
Parameterization Rn , pos. def., g : g (x2 ) integrable

Density kernel
f (x; , ) = g (x )0 1 (x )
The name comes from the fact that the level curves of the density function are ellipses. Again,
the multivariate normal distribution is a particular case with g(u) = eu/2 . The exact expression
of the pdf strongly on the particular choice of g and on the dimension n, but the covariance
matrix (if finite) is proportional to (for this reason, the correlations are the same).
Mixtures
This class of distributions is well-motivated when one could physically model outcomes of an
experiment as first choosing at random from which population (distribution) an observation
should come, and then drawing the actual observation (according to, but independent of, the
distribution picked in the first step). The resulting overall distribution can easily be derived using the law of total probability as a linear combination of the densities in the second
step. Abstracting from the physical structure of the probability space, we may simply use
the resulting expressions for distributions, since they deliver more flexible distribution models
(essentially a continuum of distributions between two base shapes).
Family Name: Mixture distribution (countable)
Parameterization
Density definition
Moments
wi 0,
fi pdfs
i1 wi = 1,
P
f (x) = i wi fi (x),
P
= i wi i ,

P
= i wi (i ) (i )0 + i
P
Note that the linear combination involves density functions and not the corresponding random
variables! The difference can e.g. be seen in the expression of the covariance matrix.
Remark 4.17 One can imagine an uncountable mixture, f (x) = f (x; ) w () d where the
variable plays the role of an index and w () that of weights (analogously to the countable case,
w should be non-negative and integrate to unity to maintain the interpretation as weights).
These mixtures are sometimes called compound distributions5 and one may allow for vector
valued as well.6 Like a countable mixture, the compound distribution simply marginalizes
over draws from the parameter distributions.
The best-known member of this class are Gaussian mixtures, where fi are normal distributions
with different means and covariance matrices (or variances in the univariate case).
5
The name comes from treating the parameters as random, possessing a distribution of their own, i.e. w (),
independent of the distributions f .
6
The parameters of this parameter (vector) distribution w are called meta-parameters.
96
In this chapter we consider sequences of random variables of the form
Yn = g(X1 , ..., Xn ),
where n = 1, 2, 3, .....
A simple (and very typical) example for such a sequence is the average of n random variables
Yn =
n
1X
Xi .
n i=1
The objective of asymptotic theory is to establish results relating to the stochastic behavior of
such sequences Yn when n . In particular,
Yn may converge to a constant in various ways,
or the distribution of Yn may converge to some limiting distribution.
What are reasons to study the asymptotic behavior of sequences of random variables Yn ?
Estimators for parameters and test statistics are typically functions such as Yn = g(X1 , ..., Xn ),
where n refers to the sample size (number of data observations). In order to evaluate/compare
the quality of such estimators and test statistics it is necessary to know their probability
characteristics and distributions. However, in many cases their actual probability density or
distribution is unknown or analytically intractable, when n is finite. Asymptotic theory often
provides tractable approximations to the distribution of functions g(X1 , ..., Xn ), when n is
sufficiently large.
In the following section we begin by reviewing some basic concepts from real analysis.
5.1. Convergence of Number and Function Sequences

Recall the following
97
Definition 5.1 (Convergence of real sequences) A sequence of real numbers {yn } converges to y R1 iff for every real > 0 there exists an integer N () such that
|yn y| < n N ().
The existence of the limit is denoted by yn y or limn yn = y.
It can be shown that for the limit of a sequence of numbers to exist, it is necessary (but not
sufficient) that the sequence is bounded, as defined next.
Definition 5.2 (Boundedness of real number sequences) A sequence of real numbers {yn }
is bounded iff there exists a finite number m > 0 such that
|yn | m n N.
Example 5.1 The sequence yn = 3 + n2 , n N, is bounded, since |yn | 4 n N, and has
a limit yn 3. The sequence yn = sin(n), n N, is bounded, since |sin(x)| 1 x, but does
not have a limit, since sin(x) cycles between +1 and -1.
Remark 5.1 The concept of convergence can be extended to sequences of real valued matrices
by applying the definition of convergence of real number sequences to the sequence of matrices
element by element.
A further important limit concept is the convergence of a function sequence.
Definition 5.3 (Convergence of Function Sequences) Let {fn (x)}, n N, be a sequence
of functions having a common domain D Rm . The function sequence {fn (x)} converges to a
function f (x) with domain D0 D iff for n
fn (x) f (x) x D0 .
f is called the limiting function of {fn }.
Remark 5.2 The definition implies that the values of the functions fn (x), n = 1, 2, 3, .. converge to f (x) pointwise for each single x D0 . Hence, f (x) can be viewed as an approximation of fn (x) for any given x when n is large. This is not always the same as f (x) can
be viewed as an approximation of fn (x) at all x at the same time (uniform convergence,
supx |fn (x) f (x)| 0), but pointwise convergence is in many cases a good enough tool.
98
Example 5.2
The function sequence fn (x) = n1 + 2x2 , x R1 , n N has a limit function
f (x) = 2x2 since limn fn (x) = 2x2 x R1 .
For the function sequence
fn (x) = xn ,
x [0, 1], n Nthe limiting function is
lim fn (x) = f (x) =
0
1
for x [0, 1)
.
for x = 1
Note that fn (x) is continuous for each point of the domain D = [0, 1]. In contrast, the
limit function f (x) is not continuous at x = 1.
In order to characterize the convergence properties of a sequence, we may use the concept of
the order of magnitude of a sequence.
Definition 5.4 (Order of Magnitude of a Sequence) Let {xn } be a real number sequence.
{xn } is said to be at most of order nk , denoted by O(nk ), if there exists a finite constant
c such that

xn

k
c n N.
{xn } is said to be of order smaller than nk , denoted by o(nk ), if

xn
0.
nk
Remark 5.3 Note that
1. if {xn } is O(nk ),
2. if {xn } is o(nk ),
{xn } is o(nk+ ) > 0;
then
{xn } is O(nk ) (but with c 0).
then
Notationally, O(n0 ) and o(n0 ) are written as O(1) and o(1). Also, note that k needs not be
positive. The big-Oh and small-Oh are also known as Landau symbols.
Example 5.3 Let {xn } be defined by xn = 3n3 n2 + 2, n N. Since
1
2
xn
=3 + 3
3
n
n n
3 < ,
we have
xn = O(n3 );
Since for a positive

xn
3
1
2
= 1+ + 3+
3+
n
n
n
n
0,
99
we have
xn = o(n3+ ).
5.2. Convergence Concepts for Sequences of Random

Variables
In this section we extend the converge concepts for real number sequences to sequences of
random variables.
For sequences of random variables, we distinguish among the following types/modes of convergence:
1. convergence in distribution;
2. convergence in probability;
3. convergence in mean square;
4. almost-sure convergence.
In the following subsection, we begin with the weakest mode of convergence, the convergence
in distribution.
Convergence in Distribution
Definition 5.5 (Convergence in Distribution) Let {Yn } be a sequence of random variables
with an associated sequence of cdfs {Fn }. If there exists a cdf F such that as n
Fn (y) F (y)
at which F is continuous,
we say that Yn converges in distribution to the random variable Y with cdf F .

d
We denote this by Yn Y or Yn F . The function F is called the limiting cdf/limiting

distribution of {Yn }.
d
Remark 5.4 The definition implies that if Yn Y , then as n becomes large, the actual cdf of
Yn can be approximated by the cdf F of the random variable Y . The associated approximation
error disappears as n (see below).
100
Remark 5.5 The limiting cdf F can be the cdf of a degenerate random variable with Y = c,
where c is a constant. In this case, we say that Yn converges in distribution to a cond
stant, and we denote this by Yn c.
Example 5.4 Let {Yn } be a sequence of random variables with an associated sequence of cdfs
{Fn } given by
0 for y < 0
Fn (y) = ( y )n for
0y< .
y
for
We see that as n ,
0 for
y<
1 for
Fn (y) F (x) =
,
d
which is the cdf a degenerate random variable, and we have Yn .

Remark 5.6 For non-negative, integer-valued discrete random variables and continuous random variables the convergence of the sequence of pdfs fn (y) for the sequence of random variables
Yn to the pdf f (y) for random variable Y is sufficient for establishing convergence in distribution
d
of Yn to Y (Yn Y ). For more details see Mittelhammer (1996, Theorem 5.1).
Example 5.5 Consider the random variable {Zn } with
1
2n
Zn = 3 +
Y +
,
n
n1

where
Y N (0, 1)
n.
The associated sequence of pdfs {fn } is

fn = N
2n
,
n1
1
3+
n
2 !
,
101
such that
fn N (2, 9).
d
Hence Zn Z N (2, 9).

The following theorem is based upon the uniqueness of MGFs (see Theorem 3.15), and is very
useful for identifying limiting distributions.
Theorem 5.1 Let {Yn } be a sequence of random variables having an associated sequence of
MGFs {MYn (t)}. Let MY (t) be the MGF of Y . Then
d
Yn Y
MYn (t) MY (t) t (h, h), for some h > 0.
iff
Proof
See Lukacs (1970, p.49-50), Characteristic Functions, London, Griffin.
The theorem implies that if we can establish that limn MYn (t) is equal to the MGF MY (t),
then the distribution associated with MY (t) is the limiting distribution of the sequence {Yn }.
Example 5.6 Let
Xn 2(n)
MXn (t) = (1 2t) 2
with an MGF
Zn =
Xn n
2n
n
2
n. Consider
1 Xn .
2n
(Since E Xn = n and Var(Xn ) = 2n, the variable Zn is a standardized 2 variable.) The MGF
of Zn is therefore
n
n
q n

2
MZn (t) = e 2 t MXn 12n t = e 2 t 1 n2 t
.
In order to establish the limit of MZn (t) as n , we consider the limit of the transformation
ln MZn (t). First note that
n
ln MZn (t) = ln 1
2
2
t
n
n
t.
2
A Taylor series expansion of the first term on the r.h.s. around t = 0 yields
ln MZn (t) =
hq
n
t
2
t2
2
+ o(1)
n
t
2
t2
2
+ o(1)
t2
.
2
(Recall, o(1) represents a term of order smaller than n0 = 1 disappearing as n .) It follows

that
t2
ln MZn (t)
limn ln MZn (t)
2 .
lim
M
=
e
=
e
Zn (t) = lim e
n
n
d
Since exp(t2 /2) is the MGF of a N (0, 1)-distribution, we know by Theorem 5.1 that Zn Z
N (0, 1).
102
Generally speaking, the asymptotic distribution for a random variable Zn is any distribution
that provides an approximation to the true distribution of Zn for large n. If {Zn } has a limiting
distribution, this limiting distribution might be considered as an asymptotic distribution, since
the limiting distribution provides an approximation to the distribution of Zn for large n. But
what should we do if the sequence {Zn } has no limiting distribution or a degenerate limiting
d
distribution (Zn constant)? The following definition of the asymptotic distribution generalizes the concept of approximating distributions for large n to include cases where Zn has no
limiting distribution or a degenerate limiting distribution.
Definition 5.6 (Asymptotic Distribution) Let {Zn } be a sequence of random variables defined by
Zn = g(Xn , n ),
where
Xn X (nondegenerate),
n : sequence of parameter values.
Then an asymptotic distribution for Zn is the distribution of g(X, n ), denoted by

a
Zn g (X, n ) Zn is asymptotically distributed as g (X, n )00 .
Example 5.7 In the last example, we showed that if

Xn 2(n) ,
Xn n
2n
then Wn =
W N (0, 1)
Now consider the random variable

Yn = g(Wn , n) =
2n Wn + n;
according to the definition, the asymptotic distribution of Yn is obtained as

a
Yn = g(Wn , n) g(W , n) =
2n W + n N (n, 2n),
and hence Yn N (n, 2n). Note that Yn has in contrast to Wn no limiting distribution.
(For this reason, many speak of approximate distribution.)
The following theorem facilitates identification of the limiting distribution of continuous

functions of random variables with a limiting distribution.
d
Theorem 5.2 Let Xn X, and let g(Xn ) be a continuous function which depends on n only
d
via Xn . Then g(Xn ) g(X).
103
Proof
See Serfling (1980, p.24-25), Approximation Theorems, New York, Wiley.
Example 5.8 Consider Zn Z N (0, 1). Then

d
g(Zn ) = 2Zn + 5 2Z + 5 N (5, 4);

d
g(Zn ) = Zn2 Z 2 2(1) .
Convergence in Probability
If a sequence of random variables {Yn } converges in probability to a random variable Y , then
the realizations of Yn are arbitrarily close to the realizations of Y with probability one, as
n .
Definition 5.7 (Convergence in probability) The sequence of random variables {Yn } converges in probability to the random variable Y iff
lim P (|yn y| < ) = 1 > 0.
n
p
We denote this by Yn Y , or plim Yn = Y , where Y is called the probability limit of Yn .
Remark 5.7 The definition implies that if n is large enough, observing outcomes of Yn is
essentially equivalent to observing outcomes of Y . Also note that the probability limit Y can
be a degenerate random variable with Y = c, where c is a constant. We still denote this by
p
Yn c.
Example 5.9 Consider the random variable Yn with pdf
fn (y) = n1 I{0} (y) + 1
1
n
I{1} (y)
I{1} (y).
Hence we have P (|Yn 1| = 0) 1 as n , so that

limn P (|yn 1| < ) = 1
> 0,
104
and
plim Yn = 1.
Example 5.10 Let Y N (0, 1) and Zn N (0, n1 ), assume Y and Zn are independent. Consider

Yn = Zn + Y N 0 , [1 + n1 ] .
Since Yn Y = Zn , we obtain
limn P (|Yn Y | < ) = lim P (|Zn | < ) lim (1
n
{z
Chebyshevs Ineq.
Var(Zn )
) = 1.
2 }
Hence plim Yn = Y .
Some differences between convergence in probability and convergence in distribution:

d
1. Yn Y means that the random variables Yn= and Y have the same probability distribution. However, it is immaterial whether outcomes of Yn and Y are related in any
way. This results from the fact that random variables with the same distribution are not
necessarily the same random variables.
p
2. Yn Y involves the outcomes of Yn= and Y , and not merely their distributions. That
is, for large enough n, observing outcomes of Yn is essentially equivalent to observing
outcomes of Y . Of course, this implies that Yn= and Y must have the same probability
distribution.
d
3. Still, for any sequence Yn converging in distribution, Yn Y , one can construct a sequence
of random variables Yn equivalent in distribution to Yn (i.e. Yn Yn but otherwise
unrelated) that converges in probability, though in general to an independent copy of
Y.
The following theorem facilitates identification of the probability limit of continuous functions of sequences of random variables.
p
Theorem 5.3 Let Xn X, and let g(Xn ) be a continuous function which depends on n only
via Xn . Then plim g(Xn ) = g(plim Xn ) = g(X).
Proof
See Serfling (1980, p.24-25), Approximation Theorems, New York, Wiley.
Remark 5.8 The theorem implies that the plim operator acts analogously to the standard lim
operator of real analysis.
105
p
Example 5.11 Let Xn 3. Then the probability limit of Yn = ln(Xn ) +
plim Yn = ln(plim Xn ) +
plim Xn = ln(3) +
Xn is
3.
The following theorem establishes further useful properties of the plim operator, which obtain
as special cases of Theorem5.3.
Theorem 5.4 For the sequences of random variables Xn , Yn , and the constant a.
1. plim(aXn ) = a(plim Xn );
2. plim(Xn + Yn ) = plim Xn + plim Yn
3. plim(Xn Yn ) = plim Xn plim Yn
(the plim of a sum is the sum of the plims);
(the plim of a product is the product of the plims);
4. plim(Xn /Yn ) = (plim Xn )/(plim Yn ) if Yn 6= 0 and plim Yn 6= 0.

Proof
All results follow from Theorem 5.3 and from the fact that the functions being considered are continuous and depend on n only via Xn and Yn .
Remark 5.9 The results of Theorem 5.4 extend to matrices by applying them to matrices
element-by-element see Mittelhammer (1996, p. 244-245).
The following theorem indicates that convergence in probability implies convergence in distribution.
p
Theorem 5.5 Let Yn Y ; then Yn Y .

Proof
p
Note that Yn Y implies that observing outcomes of Yn= is equivalent to observing outcomes of Y .
Hence Yn= and Y must have the same probability distribution. For a formal proof, see Mittelhammer
(1996, p. 246).
Example 5.12 Let Yn = (2 + n1 )X + 3, where X N (1, 2). Then the plim of Yn is

plim Yn = plim(2 + n1 ) plim X + plim 3 = 2X + 3 = Y N (5, 8).
p
The fact that Yn Y N (5, 8) implies according to Theorem 5.5 that Yn Y N (5, 8).
106
The converse of Theorem 5.5 is generally not true. However, in the special case where we have
convergence in distribution to a constant, the converse of Theorem 5.5 does hold as stated in
the following theorem.
p
Theorem 5.6 Let Yn c; then Yn c.
Proof
d
Suppose that Yn c. This implies for the cdf Fn of the sequence Yn that Fn (y) F (y) = I[c,) (y).
Then as n ,
P (|yn c| < ) Fn (c + ) Fn (c ) 1,
|
p
It follows that Yn c.
{z
{z
for (0, )and > 0.
The following theorem combines the concepts of convergence in distribution and in probability
to produce an extension of Theorem 5.2 that is useful for obtaining the limiting distribution of
a wider variety of functions of random variables.
p
Theorem 5.7 (Slutskys Theorem) Let Xn X and Yn c. Then,

d
1. Xn + Yn X + c ;
d
2. Xn Yn X c ;
d
3. Xn /Yn X/c if Yn 6= 0 with probability 1 and c 6= 0.
Proof
See e.g. Rohatgi and Saleh (2001, p. 270).
Convergence in Mean Square

Definition 5.8 (Convergence in Mean Square) The sequence of random variables {Yn }
converges in mean square to the random variable Y , iff

lim E (Yn Y )2 = 0.
n
m
We denote this by Yn Y .
107
Remark 5.10 Since E ((Yn Y )2 ) essentially measures the average squared distance between
Yn and Y , convergence in mean squared error implies that of Yn and Y are arbitrarily close to
one another when n . In particular, first and second-order moments of Yn and Y converge
to one another as indicated in the following theorem which provides necessary and sufficient
conditions for convergence in mean square.
m
Theorem 5.8 It holds that Yn Y iff

1. E (Yn ) E(Y ),
2. Var(Yn ) Var(Y ),
3. Cov(Yn , Y ) Var(Y ).
Proof
Necessity of the conditions (1)-(3);
[E Yn E Y ]: Note that

|E (Yn ) E (Y )| = | E(Yn Y )|

(|x| x x)
where

2 1/2
E (|Yn Y |) = E (|Yn Y |2 )1/2 ,
1/2
E (|Yn Y | )
E(|Yn Y | )
{z
1/2
= E(Yn Y )
Jensens inequality for the concave function x1/2
Hence the fact that E(Yn Y )2 0 implies that |E (Yn ) E (Y )| 0 which in turn implies that
E (Yn ) E(Y ).
[Var(Yn ) Var(Y )]: By expansion of E Yn2 we obtain
E Yn2
= E Yn2 2Y Yn + Y 2 + E Y 2 + 2 E(Y Yn ) 2 E Y 2

= E (Yn Y )2
E Y 2 + 2E[Y (Yn Y )],
where the last term is bounded by

h
|E[Y (Yn Y )]| E Y 2 E (Yn Y )2

|
i1/2
{z
Cauchy-Schwartz inequality E(W Z)2 E W 2 E Z 2 , i.e., | E W Z| (E W 2 E Z 2 )1/2
It follows that
E
Yn2
E (Yn Y )
+E Y
108
h
2 E Y
E (Yn Y )
i1/2
Hence the fact that E (Yn Y )2 0 implies that E Yn2 [0 + E Y 2 2 0], which together with
(1) implies that Var(Yn ) Var(Y ).

[Cov(Yn , Y ) Var(Y )]: Note that

E (Yn Y )2
E Yn2
2 E(Y Yn )
E Y2
= Var(Yn ) + E (Yn )2 2 (Cov(Yn , Y ) + E (Y ) E (Yn )) + Var(Y ) + E(Y )2 .

If E (Yn Y )2 0, with (1) and (2) we obtain from the preceding equality that
2 Var(Y ) 2 lim Cov(Yn , Y ),
which implies Cov(Yn , Y ) Var(Y ).

Sufficiency of the conditions (1)-(3);
As shown above, we have

E (Yn Y )2 = Var(Yn ) + E (Yn )2 2 (Cov(Yn , Y ) + E (Y ) E (Yn )) + Var(Y ) + E (Y )2 .

This shows directly that conditions (1)-(3), imply that E (Yn Y )2 0.
The necessary and sufficient conditions in Theorem 5.8 simplify, when Y is a constant, as stated
in the following corollary.
m
Corollary 5.1 Yn c iff E (Yn ) c and Var(Yn ) 0.

Proof
This follows directly from Theorem 5.8, upon letting Y = c and noting that E(c) = c and Var(c) =
Cov(Yn , c) = 0.
The following theorem indicates that convergence in mean square implies convergence in probability. This result can be useful as a tool for establishing the convergence in probability in
cases where convergence in mean square is relatively easy to demonstrate.
m
Theorem 5.9 Let Yn Y ; then Yn Y .

Proof
Note that (Yn Y )2 is a non-negative random variable, and letting a = 2 > 0, we have by Markovs
inequality that
E (Yn Y )2
,
P ([yn y] )
2
2
109
or for the complementary event
E (Yn Y )2
P ([yn y] < ) 1
.
2
2
Thus, mean square convergence, which means E (Yn Y )2 0, implies

lim P (|yn y| < ) = 1.
Thus, plim Yn = Y .
Example 5.13 Let Yn =

have
1
n
Pn
i=1 Xi ,
where the Xi s are independently N (0, 1)-distributed. We
E (Yn ) = 0
Var(Yn ) =
and
1
n
0,
p
which implies that Yn 0, so that according to Theorem 5.9, Yn 0.
The following example demonstrates that the converse of Theorem 5.9 is generally not true.
That means that convergence in probability does not imply mean square convergence.
Example 5.14 Let the pdf of Yn be given by,

fn (y) =
1
n2
n2
for
for
yn = 0
.
yn = n
It immediately follows that P (yn = 0) 1 so that plim Yn = 0. However,

E(Yn 0)2 = E Yn2 = 02 (1
1
)
n2
+ n2
1
n2
= 1 n,
so that Yn 9 0.
5.3. Weak Laws of Large Numbers

In this section, we consider the convergence behavior of a specific sequence of random variable,
namely, the mean of a sequence of random variables
n
X
n = 1
X
Xi ,
n i=1
n = 1, 2, 3, ....
110
The convergence behavior of such a specific sequence deserves special attention, since a large
number of parameter-estimation and hypothesis-testing procedures in econometrics can be defined in terms of averages of random variables. Hence, the analysis of the convergence behavior
of averages is useful for analyzing the asymptotic behavior of these procedures.
Definition 5.9 (Weak Law of Large Numbers) Let {Xn } be a sequence of random variables with finite expected values E (Xn ) = n . We say that {Xn } obeys the weak law of large
numbers (WLLN), if
n
1X
p
n
(Xi i ) = X
n 0.
n i=1
Remark 5.11 For E (Xi ) = i = i the WLLN implies that
n
1X
p
Xi 0
n i=1
n
1X
p
Xi ,
n i=1
and
such that the (sample) average converges in probability to the expectation of the random variables.
There are a variety of conditions that can be placed on the stochastic behavior of the variables
in the sequence {Xn } that ensure that they obey the WLLN. These conditions typically relate
to independence, variance and covariances of the variables in {Xn }. A WLLN which is based
upon the condition that the variables in the sequence {Xn } are iid is Khinchins WLLN, as
follows.
Theorem 5.10 (Khinchins WLLN) Let {Xn } be a sequence of iid random variables with
p
n = 1 Pni=1 Xi
.
finite expectations E (Xi ) = i. Then X
n
Proof
The proof here is based on the additional assumption that the MGF of Xi , denoted by MXi (t), exists1 .
n is obtained as (see Section 3.5)
The MGF of X
MX n (t)
(def.)
=
=
1
n
n
Y
t
i=1
n
Y
E exp
MXi
i=1 Xi
Xi
t
n
=E
Qn
i=1 exp
t
n Xi
(5.1)
(by independence of the Xi s)

i=1
Pn
E exp t
MXi
(by def.)
n
t
n
(identical distr. of the Xi s)
For a more general proof, see Rao (1973), Linear Statistical Inference and Its Applications. New York: John
Wiley & Sons
111
For n , we get for MX n (t) = [MXi ( nt )]n
n MXi ( nt ) n n
(by extending MXi (t/n))
n
n

i
h

t
a n n
= exp lim n MXi
1
(since lim 1 +
= exp{ lim an }).
n
n
n
n
n

lim MX n (t) =
lim 1 +
For the limit in the exponent we have

MXi ( nt ) 1
0
= .
1
n
n
0
lim
By a mild abuse of the LHospitals rule2 we obtain

MXi ( nt ) 1
n
n1
lim
d(MXi ( nt ) 1)/(dn)
n
d(n1 )/(dn)
lim
=
=
dMXi ( nt )/d( nt ) ( nt2 )
lim
lim dMXi
since the first derivative of MXi (t ) as t =
t
n
n2
t
/d
t = t,
n
n
t
0. Thus limn MX n (t) = et . Note that
limn MX n (t) = et is the MGF of a random variable that is degenerated at the expected value .
p
p
d
n
Therefore, by Theorem 5.1 and Theorem 5.5 (Xn c Xn c), we have X
.
Example 5.15 Let {Xn } be a sequence of iid random variables, with Xn Gamma(, ) such
that E (Xn ) = . Khinchins WLLN implies that
p
n
X
E (Xn ) = .
Hence, for large enough n, the outcome of the random variable X n can be taken as a close
approximation of (or rather the other way round). This is the property of a consistent
estimator for as we shall discuss in Advanced Statistics II.
Remark 5.12 Note that Khinchins WLLN does not require the existence of the variance of
the variables in the sequence {Xn }. On the other hand, it requires that the variables are iid.
This is too restrictive for many situations we encounter, where the variables are not iid.
WLLNs that relax the iid assumption can be defined by imposing various other conditions on
the (co)variances of the Xn s. The next theorem states necessary and sufficient conditions for a
sequence {Xn } to satisfy the WLLN.
2
Use the Stolz-Cesro lemma for rigor.
112
Theorem 5.11 Let {Xn } be a sequence of random variables with finite variances, and let {n }
be the corresponding sequence of their expectations, Then
n
(X
n )2
0.
E
n
1 + (X
n )2
#
"
p
n
X
n 0
iff
Proof
Sufficiency: For any two positive numbers a and b, the fact that
ab
a
b
.
1+a
1+b
implies that
Hence the event

n
(X
n )2 2
implies the event
n
2
(X
n )2
.
n
1 + 2
1 + (X
n )2
It follows that3
n
P (X
n )2 2

n
(X
n )2
2
n
1 + 2
1 + (X
n )2
|
{z
pos. random variable
n
(X
n )2
E
n
1 + (X
n )2

2
1 + 2

(by Markovs inequality)
If the expectation of the term in brackets 0 as n , then, > 0, the probability on the
l.h.s. 0. That is
n
n
n | 0,
P (X
n )2 2 = P |X

and for the complementary event

n
P |X
n | < 1,

so that
p
n
X
n 0.
Necessity; See Rohatgi and Saleh (2001, p. 276).
Remark 5.13 The theorem states that {Xn } obeys a WLLN such that X n n 0 iff the
condition that E[(X n n )2 /(1 + [X n n ]2 )] 0 is satisfied. But since this condition applies
not to the individual variables in {Xn }, but to their average, Theorem 5.11 is of limited use.
However, any condition placed on the individual variables in {Xn } that ensures that E[(X n
p
n
n
n )2 /(1 + [X
n ]2 )] 0 is sufficient to guarantee that X
n 0. The following theorem
identifies one such condition.
Theorem 5.12 Let {Xn } be a sequence of random variables with respective expectations given
3
Note that the event of tossing a 2 implies the event of tossing an even number.
113
by {n }. If
p
n
then X
n 0.
n ) 0,
Var(X
Proof
Note that
n

(X
n )2
n
n ).
0E
E (X
n )2 = Var(X
2
1 + (Xn
n )

n ) 0 implies that E[(X

n
n
Hence, Var(X
n )2 /(1 + [X
n ]2 )] 0, and it follows from Theorem
p
n
5.11 that X
n 0 .
Example 5.16 Let {Xn } be a sequence random variables, with

E (Xi ) = i =
1
,
2i
Var(Xi ) = 4,
and
Cov(Xi , Xj ) = 0 i 6= j.
n are
The mean and variance for the average X

n
E X
n
n
1 ( 21 )n
1X
1X
1
=
i =
=
,
n i=1
n i=1 2i
n
n) =
Var(X
4
.
n
n ) 0, it follows by Theorem 5.12, that

Since Var(X
1 ( 21 )n
Xn
n = Xn
n
Also note that
1 ( 12 )n
=
0,
n
so that
0.
p
n
X
0.
5.4. Central Limit Theorems

Central limit theorems (CLTs) are concerned with the conditions under which sequences of
random variables converge in distribution to known families of distribution. Let {Xn }
P
be a sequence of random variables, and let Sn = ni=1 Xi , n = 1, 2, .. Here we focus on the
convergence in distribution of sequences of random variables of the following form
d
b1
n (Sn an ) Y N (0, ),
where {an } and {bn } are sequences of appropriately chosen real constants. A statement of conditions on {Xn }, {an }, and {bn } that ensure the convergence in distribution result constitutes
a CLT.
114
What are the reasons to study the particular problem concerning the convergence in distribution
specified above?
As we shall see in the course Advanced Statistics II, many procedures for parameter
estimation and hypothesis testing are specified as functions of sums of random variables
P
such as Sn = ni=1 Xi .
CLTs are then often useful for establishing the asymptotic distributions for those procedures. Recall that asymptotic distributions provide approximations to the exact but
often unknown distribution.
Similar to the case of the WLLN, there are a variety of conditions that can be placed on the
P
stochastic behavior of the variables in the sum Sn = ni=1 Xi that ensure the convergence in
distribution according to a CLT. The simplest, but least general, of all CLTs is the LindebergLvy CLT, which assumes iid variables.
Theorem 5.13 (Lindeberg-Lvy) Let {Xn } be a sequence of iid random variables with E (Xi ) =
and Var(Xi ) = 2 (0, ) i. Then
Yn

X
n
1
=
Xi n =
n i=1
n ) d
n(X
N (0, 1).
Proof
The proof here is based on the additional assumption that the MGF of Xi , denoted by MXi (t), exists4 .
Consider the standardized variable
Zi = (Xi )/,
Now, rewrite Yn as
The MGF of Yn =
MYn (t)
which is iid with
E (Zi ) = 0,
Var(Zi ) = 1.
n
n
1 X
1 X
(Xi )/ =
Zi .
Yn =
n i=1
n i=1
1
n
(def.)
Pn
i=1 Zi
is obtained as
n
n
h
i (independ.) Y
t (ident. distr.) h
t in
1 X
E exp{t
Zi }
,
=
MZi
=
MZi
n i=1
n
n
i=1
and its log-transformation is
ln MYn (t) =
4
ln MZi
n1
t
n
L
=
t
n
n1
where L() = ln MZi ().
For a more general proof, see Rao (1973), Linear Statistical Inference and Its Applications. New York: John
Wiley & Sons
115
Since ln MZi (0) = ln(1) = 0, the limit limn ln MYn (t) has the indeterminate form 0/0. Applying
LHospitals rule (again, abusively, since n is discrete) yields
h
lim ln MYn (t) = lim
n2
dL(0)/d( tn ) =
with
t
dL( tn )/d( tn ) ( 21 )( n3/2
)
= lim
dL( tn )/d( tn ) t
,
2/ n
dMZi (0)
= 1 E Zi = 0
MZi (0) d( tn )
1
A second application of LHospitals rule yields

h
lim ln MYn (t) = lim
t
d2 L( tn )/d( tn )2 ( 21 )( n3/2
)t
1
n3/2
d2 L( tn )/d( tn )2 t2
= lim
Since
2
d2 MZi (0)
dMZi (0)
1
MZi (0) d( tn )2
[MZi (0)]2
d( tn )
1
d2 L(0)/d( tn )2 =
= 1 E Zi2 1 (E(Zi ))2 = 1,

we obtain
1
lim ln MYn (t) = t2 .
n
2
Thus
1 2
t .
lim MYn (t) = exp lim ln MYn (t) = exp
n
n
2
n
2 /2
Since M (t) = et
is the MGF of a N (0, 1) distribution, we have that Yn N (0, 1).
Remark 5.14 In order to understand the mechanics behind the Lindeberg-Lvy CLT, consider
the random variable
Pn
i=1
Xi ,
with
Pn
E(
i=1
Xi ) = n
and
Var(
Pn
i=1
Xi ) = n 2 .
Note that ni=1 Xi does not have a limiting distribution as its mean and variance diverge to
P
as n increases. Hence, some form of centering and scaling of ni=1 Xi is necessary, for obtaining
some limiting distribution. An appropriate centering and scaling which stabilize the mean and
variance, is obtained by defining the random variable
P
1 P
Yn = [ ni=1 Xi n ] =
|{z}
n
| {z }
1
n
Pn
i=1
Zi ,
centering
scaling
such that
E (Yn ) = 0
and
Var(Yn ) = 1 n (!).
An additional effect of this centering and scaling is that it removes any tendencies for higher
116
order moments of the Xi s to cause the moments of Yn to deviate form those of a N (0, 1) distribution as n . To see this consider the third moment of Yn , which is (see Mittelhammer,
1996, p. 271)
E (Yn3 ) =
Thus
E (Yn3 )
E (Zi3 ) =
n3/2 2

1
n3/2 2
Xi
0 as n regardless of the specific value of
3
E (Zi3 )
=E

Xi 3
. Recall
that E (W ) = 0 if W N (0, 1). Now consider the fourth moment of Yn , which is

1
[n E (Zi4 )
n2 4
E (Yn4 ) =
+ 3n(n 1) 4 ],
so that E (Yn4 ) 3 as n regardless of the specific value of E (Zi4 ) = E

Xi 4
. Recall
that E (W 4 ) = 3 if W N (0, 1). This type of argument can be continued ad infinitum to show
that all moments (assumed to be finite) of Yn converge to their Gaussian values.5
Remark 5.15 The Lindeberg-Lvy CLT states that
Yn =
1
n
Pn
i=1 Xi n Y N (0, 1).
This limiting distribution can be used to obtain asymptotic distributions for functions
P
of Yn . For the variable Sn = ni=1 Xi , for example, we obtain
Sn =

nYn + n
For the average X n =
1
n
n =
X
Pn
i=1 Xi
Yn
n
Sn N (n, n 2 ).
nY + n
def.
we have
Y
n
2
a
n
X
N (, n ).
def.
Note that according to the WLLN X n , and hence X n . Because the limiting distribution
of X n is degenerate it provides no information about the variability of X n for finite n and is
useless for establishing an approximation of the distribution.
Example 5.17 Let {Xn } be a sequence of iid 2(1) random variables with E (Xi ) = 1 and
P
Var(Xi ) = 2. Recall that by the additivity property of 2 -distribution (Theorem 4.2), ni=1 Xi
2(n) . By the CLT of Lindeberg-Lvy, we have
Pn
Yn =
5
i=1
X n1 d
i
N (0, 1)
n2
n
X
Xi N (n, 2n).
i=1
This argument can be exploited in order to prove the Lindeberg-Lvy CLT by means of a Taylor series
expansion of the MGF of Yn (see Rohatgi and Saleh, 2001, p. 297).
117
This implies, that we can approximate the pdf of a 2(n) -distribution by a (scaled) Gaussian pdf
if the d.o.f. is large.
Remark 5.16 The CLT of Lindeberg-Lvy requires that the random variables are iid.
However, in many applications the assumption that the variables are iid is violated since we have
variables which are correlated and/or have different distributions. Fortunately, there are various
other CLTs, which do not need the iid condition. Instead, they place alternative conditions on
the stochastic behavior of the random variables in the sequence {Xn }. In the following, we
consider such CLTs for variables which are not iid. The results are presented without
proofs.6
A CLT for non-identically distributed random variables is that of Lindeberg. It is based
P
on the condition that the contribution that each variable Xi makes to the variance of ni=1 Xi
is negligible as n .
Theorem 5.14 (Lindebergs CLT) Let {Xn } be a sequence of independent random variables
P
P
n2 = n1 ni=1 i2 ,
n =
with E (Xi ) = i and Var (Xi ) = i2 < i. Define b2n = ni=1 i2 ,
1 Pn
n
i=1 i , and let fi be the PDF of Xi . If > 0,
n
1 X
(continuous case:) lim 2
(xi i )2 fi (xi ) dxi = 0,
n b
2
2
n i=1 (xi i ) bn
n
X
1 X
(discrete case:)
lim 2
(xi i )2 fi (xi ) = 0,
n b
n i=1 (x )2 b2
i
fi (xi )>0
then
Pn
i=1 Xi
Pn
i=1 i
2 1/2
i=1 i )
Pn
n
n
n1/2 X

d
N (0, 1).
The conditions may seem a bit intricate, but are e.g. implied for independent sequences by

b2n and E |Xi |2+ < C < i for some real > 0.
A further CLT for non-identically distributed random variables relies on the condition that
the range of the variables are almost surely bounded.
Theorem 5.15 (CLT for bounded variables) Let {Xn } be a sequence of independent random variables such that P (|xi | m) = 1 i for some m (0, ), and suppose E (Xi ) = i
P
P
and Var (Xi ) = i2 < i. If ni=1 Var (Xi ) = ni=1 i2 as n , then

n
n1/2 X
n
n
6
For the proofs, see Mittelhammer, (1996, p. 274-282)
118
N (0, 1).
The CLTs presented so far are applicable to sequences of random scalars. In order to discuss
CLTs for a sequence of random vectors, that is, multivariate CLTs, a result of Cramr and
Wold, termed the Cramr-Wold device is very useful. The Cramr-Wold device allows
to reduce the question of convergence in distribution for multivariate random vectors to the
question of convergence in distribution for random scalars. Thus it facilitates the use of CLTs
for random scalars in order to obtain multivariate CLTs.
Theorem 5.16 (Cramr-Wold Device) The sequence of (k 1)-dimensionals random vectors {X n } converges in distribution to the (k 1)-dimensional random vector X iff
d
`0 X n `0 X
` Rk .
Proof
Sufficiency; Our proof of sufficiency assumes the existence of MGFs. By Theorem 5.1 (Convergence
of MGFs), the fact that
d
`0 X n `0 X
M`0 X n (t) M`0 X (t).
implies that
The MGFs obtain as

M`0 X n (t)
(def.)
)0 X
{z
M`0 X (t)
E et` X n = E e(t
t `0 = (t )0
= MX n (t ),
MX (t ).
Hence, if M`0 X n (t) M`0 X (t), then MX n (t ) MX (t ), which implies that X n X.

Necessity; Since `0 X n is a continuous function g of X n , Theorem 5.2 (limiting distributions of continuous function) implies that
d
`0 X n = g(X n ) g(X) = `0 X.
Remark 5.17 The Cramr-Wold Device implies that to establish convergence in distribution
of a vector xn to a vector x it suffices to demonstrate that any linear combination of xn
converges in distribution to the corresponding linear combination of x. The implications of
the Cramr-Wold Device in the context of normal distributions is formalized in the following
Corollary.
d
Corollary 5.2 Let X n X N (, ) iff `0 X n `0 X N (`0 , `0 `).
119
Remark 5.18 The Corollary implies that to establish convergence in distribution of a vector
X n to a multivariate Normal distribution it suffices to demonstrate that any linear combination of X n converges in distribution to a univariate Normal distribution by using
an appropriate univariate CLT. Hence the Cramr-Wold Device allows us to extend CLTs for
scalar variables to multivariate CLTs. The following CLT is a multivariate extension of the
univariate CLT of Lindeberg-Lvy.
Theorem 5.17 (Multivariate Lindeberg-Lvy) Let {X i } be a sequence of iid (k 1) random vectors with E (X i ) = and Cov(X i ) = i, where is a (k k) positive definite
matrix. Then

n
1 X
d
n
X i N (0, ) .
n i=1
Proof
Consider the linear combination of X i given by Zi = `0 X i , (` 6= 0). Note that
Zi iid,
with
E (Zi ) = `0 = z
and
Var(Zi ) = `0 ` = z2 .
Applying the Lindeberg-Lvy CLT for random scalars to the iid sequence {Zi } yields
n

1 X
Zi nz
=
nz i=1
n

X
1
0
n
`
i
n `0 `
i=1
n

`0 n 1 X
i
`0 ` n i=1
N (0, 1).
Then, by Slutskys Theorem, we get

n
n
`0 n 1 X

1 X
X i = `0 n
Xi
`0 ` 0
n i=1
` ` n i=1
| {z }

p
`0 `
{z
N (0, `0 `).
N (0, 1)
By the Cramr-Wold Device, the last equation is sufficient to conclude that

N (0, ) .
n( n1
Pn
i=1 X i
Remark 5.19 Various other multivariate CLTs can be constructed using the Cramr-Wold
Device, including the multivariate extension of the Lindeberg CLT (Theorem 5.14) and the
CLT for bounded variables (Theorem 5.15).
120
5.5. Asymptotic Distributions of Functions of Asymptotically

Normally Distributed Random Variables
In this section we consider the asymptotic distribution of differentiable functions of asymptotically normally distributed random variables. The practical use of results about this asymptotic
distribution is that once the asymptotic distribution of a sequence {Xn } is known, the asymptotic distribution of interesting functions of Xn need not be derived anew. We will apply
those results in order to obtain the asymptotic distribution for parameter estimators and test
statistics (Advanced Statistics II ).
Theorem 5.18 (Asymptotic Distribution of g(X n ); Delta method) Let {X n } be a se

d
quence of (k 1) random vectors such that n(X n ) Z N (0, ). Let g(x) be a
function that has first-order partial derivatives in a neighborhood of the point x = that are
continuous at , and suppose the gradient vector of g(x) evaluated at x = ,
G(1k) = [g()/x1 . . . g()/xk ] ,
is not the zero vector. Then
n (g (X n ) g ()) N (0, GG0 ) and
g (X n ) N g(), n1 GG0 .
Proof
The proof is based upon a first-order Taylor series expansion of the function g(x) around the point
. This yields (see Mittelhammer, 1996, Lemma 5.6)
g(X n ) = g() + G (X n ) + [(X n )0 (X n )]1/2 R(X n ),
|
lim R(X n ) = R() = 0.
where
Multiplying by
{z
remainder term
Xn
n and rearranging terms leads to
n[g(X n ) g()] = G

|
o1/2
n
n(X n ) + [ n(X n )]0 [ n(X n )]
R(X n ),
{z
Z N (0, )
{z
To see that the second term converges in probability to 0 first note that
o1/2
n
o1/2
d
[ n(X n )]0 [ n(X n )]
[N (0, )]0 [N (0, )]
,
121
(note that this follows from Theorem 5.2). Then, the application of Slutskys theorem to the second
term yields
o1/2
[ n(X n )]0 [ n(X n )]

R(X n )
n
|
{z
1/2
[N (0, )]0 [N (0, )]
0.
} | p{z }
0
By application of Slutskys theorem to
n[g(X n ) g()] = G

|
o1/2
n
R(X n ),
n(X n ) + [ n(X n )]0 [ n(X n )]
{z
Z N (0, )
it follows that
{z
n (g(X n ) g()) G N (0, ) N 0, GG0 .

Remark 5.20 Should the Jacobian be a singular matrix at x = , the result can still be applied,
but the resulting limiting distribution exhibits degeneracy (its elements are linearly dependent).
To deal with this situation, a higher-order expansion may be employed for the null space of G.
The details are not straightforward for the general case, but the idea can easily be illustrated in
the univariate case. Instead of expanding g linearly about , one simply uses a more precise
Taylor expansion, say g (x) = g () + g 00 () (x )2 + Rx whenever the first-order derivative g 0
is zero at , but the second is continuous and nonzero. The result will be no limiting normal
distribution, though. Should G on the other hand be undefined at , e.g. because the derivatives
have a pole, like g(x) = 3 x at 0, one must analyze the result on a case-by-case approach, since
the limiting behavior of the transformation depends on the nature and shape of the pole.
122

In this chapter we begin to study problems and methods related to statistical inference. In
the preceding chapters we discussed fundamental ideas of probability theory and the theory of
distributions. There a typical question was:
Given the probability space, what can we say about the characteristics and properties
of outcomes of a random experiment?
Statistical inference turns this question around:
Given the observed characteristics and properties of outcomes of an experiment,
what can we conclude (infer) about the probability space?
A typical problem of statistical inference is as follows:
Suppose we seek information about some characteristics of a collection of elements, called
population.
For reasons of time or cost we may not wish (or even be able) to study each element of
the population. Our object is rather to draw conclusions about the unknown population
characteristics on the basis of information on some characteristics of a suitably selected
sample.
Formally, let X be a random variable that represents the population under investigation,
and let f (x, ) denote the parametric family of pdfs of X. The set of possible parameter
values is denoted by .
Then the job of the statistician is to decide on the basis of a sample randomly drawn
from the population which member of the family of pdfs {f (x, ), } can represent
the pdf of X.
We now discuss the notions of random samples and sample statistics in more detail.
123
6.1. Random (iid) Sampling

Often, the collected data of an experiment consist of several observed values of a variable of
interest. If the process of data collection is random, it is referred to as random sampling. Here,
we will consider two general random sampling methods of data collection:
1. random sampling from a population distribution (which includes random sampling with
replacement);
2. random sampling without replacement.
Definition 6.1 (Random Sampling From a Population Distribution) Let X be a random variable with pdf f (x). The set of random variables X1 , ..., Xn is called a random sample
of size n from the population distribution with pdf f (x), if
X1 , ..., Xn are iid random variables with pdf f (x).
The set of observed values x1 , ..., xn is called realization of the sample.
Remark 6.1 According to the definition, we consider a situation where the variable of interest
X has a pdf f (x) and where we have repeated observations on this variable. The first observation
x1 is a realized value of X1 , the second x2 a realized value of X2 , and so on. Each xi represents
an observation of the same variable, and each Xi has the same marginal pdf given by f (x).
Furthermore, the observations are taken in such a way that the Xi s are independent. Hence,
the joint pdf of a random sample from a population distribution X1 , ..., Xn is given by
f (x1 , ..., xn ) =
n
Y
f (xi ).
i=1
Remark 6.2 When a population with a finite number of elements is sampled, random
sampling from a population distribution is alternatively referred to as random sampling with
replacement. (This is not entirely random anymore.)
Example 6.1 Consider an urn containing N balls, J red and N J black balls. Let the random
variable X represent the color of a ball, with x = 1 for a red ball and x = 0 for a black ball. If
we sample n balls with replacement, the population distribution is a Bernoulli distribution,
f (x; p) = px (1 p)1x ,
124
with p =
J
.
N

The joint pdf of a random sample X1 , ..., Xn obtained by drawing n times from the urn with
replacement is
f (x1 , ..., xn ) =
n
Y
f (xi ) = p
Pn
i=1
xi
Pn
(1 p)n
i=1
xi
i=1
Note that if we assume that the population ratio p = J/N is unknown, then estimation of p
would be an object of statistical inference. Note also that the value of the population ratio p has
a direct influence on the probability of drawing a red (black) ball. Thus, we have a probabilistic
link between the population and the sample characteristics.
Random Sampling Without Replacement can relevant for populations with finite number
of elements. It means that once the characteristic of an element has been observed, the element
is removed from the population before another element is drawn. Removing an element from the
population changes the composition of the population and hence the population distribution f
for the next draw. This implies that the corresponding sample variables X1 , ..., Xn are neither
identically distributed nor mutually independent. Hence, the joint pdf of the sample variables
Q
cannot be written as f (x1 , ..., xn ) = ni=1 f (xi ).
Example 6.2 Consider an urn containing N balls, J red and N J black balls. Let the
random variable X represent the color of a ball, with x = 1 for a red ball and x = 0 for black
ball. Suppose a random sample X1 , ..., Xn without replacement of size n N is drawn from the
urn. Then, the pdf for the first sample variable X1 is
x1
f (x1 ) =
J
N
N J
N
1x1
The pdf of the X2 given the realization of the first variable x1

f (x2 |x1 ) =
Jx1
N 1
x2
N J(1x1 )
N 1
{z
1x2
.
}
The 1st draw reduces popalution size by one,

and for x1 = 1it reduces the number of red balls
Hence, the second sample variable X2 has a pdf which is different from that of the first one.
Furthermore the distribution of X2 depends on the realization of the first variable. In general,
the pdf of the `th variable X` given the realization of the first ` 1 variables is
f (x` |x`1 , ..., x1 ) =
P`1
J i=1 xi
N (`1)
x`
P`1
N J(`1 i=1 xi )
N (`1)
1x`
Hence, the joint pdf of the random sample is obtained as

f (x1 , ..., xn ) = f (x1 ) f (x2 |x1 ) . . . f (xn |xn1 , ..., x1 ) 6=
n
Y
i=1
125
f (xi ).

In the remainder of our course, we will consider primarily random sampling (iid) from a population.
In statistical inference, we use functions of the random sample X1 , ..., Xn to map/transform
sample information into inferences regarding the population characteristics of interest. The
functions used for this mapping are called statistics, defined as follows.
Definition 6.2 (Statistic) Let X1 , ..., Xn be a random sample from a population and let T (x1 , ..., xn )
be a real-valued function which does not depend on unobservable quantities. Then the random
variable
Y = T (X1 , ..., Xn ) is called a (sample) statistic.
Remark 6.3 The definition requires that the function T does not depend on unobservable quantities, like unknown population parameters. This implies that a statistic is a random variable
whose outcomes can be observed. Two of the most commonly used statistics are the sample
mean and the sample variance, given by
n
1X
Xi ,
Xn =
n i=1
and
Sn2
n
1X
n )2 .
=
(Xi X
n i=1
Remark 6.4 Note that sample statistics are random variables, while population characteristics (like the population mean or the population variance 2 ) are fixed constants. Furthermore note that the distribution of sample statistics T (X1 , ..., Xn ) depends on the joint distribution of the random sample X1 , ..., Xn and therefore on unknown parameters. This link
allows for estimation; see Advanced Statistics II.
In the following sections we will examine a number of statistics (including the sample mean
and variance) that will be useful in statistical inference.
6.2. Empirical Distribution Function

The empirical distribution function (edf) provides information about the functional form
of the cdf of the underlying population from which a sample is drawn.
Definition 6.3 (Empirical Distribution Function) Let X1 , ..., Xn denote a random sample from a population distribution. Then the edf is the following function
n
1X
Fn (t) =
I(,t] (Xi ) ,
n i=1
126
t (, ).

The realization of the random variable Fn (t) is denoted by Fn (t).
Remark 6.5 The edf Fn at point t represents the fraction of sample variables that have values
t. The edf Fn (t) is the empirical counterpart of the population cdf F (t). Note that the cdf of
a population is typically unknown, since it depends on unknown parameters and/or an unknown
law of distribution.
In order to assess the information content of the edf w.r.t. the underlying population distribution, it is useful to examine the properties of the random variable Fn (t). We begin with the
pdf of the edf.
Theorem 6.1 Let Fn (t) be the edf corresponding to a random sample of size n from a population with cdf F (t). Then the pdf of Fn (t) is

j
nj
nj [F (t)] [1 F (t)]
j {0, 1, 2, . . . , n} ,
for
j
=
P Fn (t) =
n
0 otherwise.

Proof
From the definition of Fn (t), it follows that

P Fn (t) =
j
n
P

n
I
(X
)
= P (nFn (t) = j) = P
=
j
.
i
i=1 (,t]
|
{z
= Yi
Note that Yi = I(,t] (Xi ) Bernoulli, with P (yi = 1) = P (xi t) = F (t). Since the Xi s are iid, it
follows that the Yi s are also iid, such that
Pn
i=1 Yi
Pn
i=1 I(,t] (Xi )
Bin[n, F (t)]
Thus we obtain

P Fn (t) =
j
n
= P
P
n
i=1 I(,t] (xi )
n
[F (t)]j [1 F (t)]nj .
j
=j =
|
{z
binomial pdf evaluated at j
Remark 6.6 Based upon the pdf of the edf, it is rather straightforward to derive the mean, the
variance and the asymptotic behavior of Fn (t). The mean of the edf at point t is
E (Fn (t)) = E[ n1
Pn
i=1 I(,t]
(Xi )] = E( n1
|P
n
i=1
127
Pn
i=1
Yi ) = n1 [nF (t)] = F (t).

{z
Yi Binomial with mean nF (t)

Hence the distribution of Fn (t) is centered on the value of the population cdf F (t). (This means
that Fn (t) provides an unbiased estimate for the value of F (t); see Advanced Statistics II.) The
variance of the edf at point t is
Var[Fn (t)] = Var( n1
| P
n
i=1
Pn
i=1
Yi ) =
1
2
n
{z
nF (t)[1 F (t)] = n1 F (t)[1 F (t)].

}
Yi Binomial with variance nF (t)[1 F (t)]
Note that the variance of the edf and hence the spread of its distribution decreases as the sample
size n increases.
As to the probability limit of the sequence of edfs {Fn (t)} at point t: Since E (Fn (t)) =
F (t) n, and limn Var[Fn (t)] = 0 it follows that
m
Fn (t) F (t)
plim Fn (t) = F (t).
This implies that the probability that realizations of Fn (t) differ from F (t) converges to 0 as
n . (This means that Fn (t) provides a consistent estimate for the value of F (t); see
P
Advanced Statistics II again.) Since Fn (t) = n1 i Yi is the average of iid Bernoulli variables
with mean E (Yi ) = = F (t) and variance Var(Yi ) = 2 = F (t)[1 F (t)], we can use the CLT
of Lindeberg-Lvy to obtain
n( n1
Yi )
n(Fn (t) F (t)) d

= q
N (0, 1).
F (t)[1 F (t)]
Hence, the asymptotic distribution of the edf at point t is

a
Fn (t) N F (t) ,
1
F (t)[1
n
F (t)] .
All in all, these properties possessed by the edf Fn (t) will make it a good statistic to use in
providing information about the population cdf F (t).
We have shown that the edf Fn (t) converges in probability to the cdf F (t) for each value of t.
The Glivenko-Cantelli Theorem strengthens this result showing that it is possible to make
a probability statement simultaneously for all t values.
Theorem 6.2 (Glivenko-Cantelli Theorem) Let Dn = sup<t< |Fn (t) F (t)|. Then
p
Dn 0.
Proof
For a proof see Fisz, M. (1976, p. 456-459), Wahrscheinlichkeitsrechnung und mathematische Statistik,
Berlin, VEB Verlag.
128

Remark 6.7 The theorem implies that the sequence of functions {Fn (t)} converges in probability uniformly as n to the function F (t).1 Thus for large enough n, the edf provides a
good approximation of the cdf over its entire domain (not only for individual points). See below.
6.3. Sample Moments

Based on random samples, statistics called sample moments can be defined that are sample
counterparts to the moments of the population distribution (population moments) defined
in Chapter 3. The sample moments have properties that make them useful for estimating the
values of the corresponding population moments. Therefore, they play a central role for the
estimation of parameters. In the following discussion of sample moments we will assume that
the sample variables are from random samples from the population distribution.
Definition 6.4 (Sample Moments) Let X1 , ..., Xn denote a random sample. Then the rth
order non-central sample moment (or moment about the origin) is
Mr0 =
n
1X
Xr.
n i=1 i
The rth order central sample moment (or moment about the mean) is
Mr =
n
1X
n )r ,
(Xi X
n i=1
where X n = n1 ni=1 Xi . The realization of the random variables Mr0 and Mr are denoted by m0r
and mr , respectively.
P
The result is actually stronger, Dn vanishes almost surely (with probability 1).
129

We now discuss some important stochastic properties of the non-central sample moments.
Those properties will suggest that sample moments will be useful for estimating the values of
the corresponding population moments.
Let Mr0 = n1 ni=1 Xir be the rth order non-central sample moment for a random sample (X1 , ..., Xn )
from a population distribution. If we assume that the appropriate population moments (denoted by 0s ) exist, we get the following results:
P
For the mean of Mr0 we obtain

E (Mr0 ) =
1
n
Pn
i=1
E (Xir ) = E (Xir ) = 0r .
Hence, the mean of the sample moment is equal to the value of the corresponding population
moment. (Thus Mr0 provides unbiased estimates for the value of 0r .)
For the variance of Mr0 we obtain
Var(Mr0 )
1
n2
Pn
r
i=1 Var(Xi )
1
n
Var(Xir )
1
n
02r
(0r )2
This implies that the variance goes to zero as n .

The probability limit of Mr0 is obtained as follows. Since E (Mr0 ) = 0r n, and limn Var(Mr0 ) =
0, we have
m
Mr0 0r
plim Mr0 = 0r .
This implies that the probability that realizations of Mr0 differ from 0r converges to 0 as n .
(Thus Mr0 provides consistent estimates for the value of 0r .)
Since Mr0 = n1 i Xir is the average of iid variables with mean E (Xir ) = 0r and variance
h
i
Var(Xir ) = 02r (0r )2 , we can use the CLT of Lindeberg-Lvy to obtain
P
1
n
02r
r
i Xi
0r
N (0 , 1).
(0r )2
Hence, the asymptotic distribution of Mr0 is

Mr0
1
n
r
i Xi
0r
1
[0
n 2r
(0r )2 ]
(This asymptotic distribution is useful for testing hypotheses about the value of 0r ; see Advanced
Statistics II ).
Two of the most commonly used statistics are the first-order non-central sample moment
also referred to as the sample mean and the second-order central sample moment also
referred to as the sample variance.
130

Definition 6.5 (Sample Mean) Let X1 , ..., Xn denote a random sample. The sample mean
is
n
X
n = 1
X
Xi = M10 .
n i=1
Remark 6.8 From the discussion of the properties of sample moments, we know that

n = ,
E X
n ) = 1 (0 2 ) =
Var(X
2
n
2
,
n
a
n
X
N (, n1 2 ).
n = ,
plim X
Definition 6.6 (Sample Variance) Let X1 , ..., Xn denote a random sample. The sample
variance is
n
1X
n )2 = M2 .
Sn2 =
(Xi X
n i=1
Important stochastic properties of the sample variance are summarized in the following theorem.
Theorem 6.3 Let Sn2 be the sample variance of a random sample X1 , ..., Xn from a population
distribution. Then assuming that appropriate population moments exist,
1.
E (Sn2 ) =
2.
Var(Sn2 ) =
(n1) 2
,
n
1
n
h
n1
n
2
(n1)(n3) 4
n2
plim Sn2 = 2 ,
3.
4.
n (Sn2 2 ) N (0, 4 4 ),
a
Sn2 N 2 , n1 (4 4 ) .
5.
Proof
1. The mean of the sample variance is obtained as follows

E Sn2
= E
h P
n
1
= E
h P
n
1
n
n
i=1 (Xi
i=1 (Xi
n )2
+X
i
)2 + E
1
2
n n
h P
n
1
i=1 (
n )2 + 2 E
X
h P
n
1
n
i=1 (Xi
n)
)( X
n ) 1 Pn (Xi )
2 E ( X
i=1
n
n )(X
n )
2 E ( X
n )2
2 E ( X
n )2
E ( X
n )2
E ( X
n )2
E ( X
n )2
E ( X
n)
Var(X
131
i
i
2 n1 2
(1 n1 ) 2 .

Note that in contrast to the expectation of the sample mean, the expectation of the sample variance
differs from the corresponding population moment. (Hence Sn2 provides a biased estimate for 2 , but
one that is easy to bias-correct.)
2. The proof for the variance of the sample variance follows from rewriting
Var(Sn2 ) = E[Sn2 2 (1 n1 )]2
in terms of corresponding sums of the Xi s and taking expectation. The corresponding algebra is
conceptually straightforward, but tedious (for details, see Rohatgi and Saleh, 2001, p. 315-317). Note
that
n 1 h n 1
2
lim Var(Sn2 ) = lim
n | n{z
3. Since E Sn2 =
(n1) 2
n
(n 1)(n 3) 4 io
= 0.
n{z2
}
|
1
2 and Var(Sn2 ) 0 as n , we have

m
Sn2 2
plim Sn2 = 2 .
(Hence Sn2 provides consistent estimates for 2 .)

4. and 5. For the proof for the asymptotic distribution of Sn2 , first note that
nSn2 =
Pn
Pn
i=1 (Xi
i=1 (Xi
Subtracting n 2 and dividing by
n )2
+X
n )2 .
n ) Pn (Xi ) + n( X
)2 + 2( X
i=1
n yields
n
n
X
1 X
n) n 1
n )2 .
(Xi ) + n(X
n(Sn2 2 ) = [ (Xi )2 n 2 ] + 2( X
|
{z
}
n i=1
n i=1
p
{z
{z
Wn 0
Vn 0
Zn N (0, 4 )
P
d
Regarding the limiting behavior of the second term Vn , note that n n1 ni=1 (Xi ) N (0, 2 ),
n ) = plim plim X
n = 0. So by Slutskys theorem
by the CLT of Lindeberg-Lvy, and plim( X
p
Vn 0. The last term Wn can be written as

Pn
Wn = [n1/4 n1 (
i=1 Xi
Pn
1
n)]2 = [ n3/4
(
i=1 Xi
where E Un = 0, Var(Un ) =
1
n 2
n3/2
{z
Un
n)]2
}
0, so that Un 0 and plim Wn = [plim Un ]2 = 0. The first
term Zn can be rewritten as

Zn =
P
1 [ n (Xi
i=1
n
)2 n 2 ] =
n[ n1
Pn
i=1 (Xi
)2 2 ],
where (Xi )2 is an iid variable with E (Xi )2 = 2 and Var[(Xi )2 ] = E (Xi )4

[E(Xi )2 ]2 = 4 4 . So by Lindeberg-Lvys CLT, Zn N (0, 4 4 ). Collecting all terms we
132

have by Slutskys theorem
n(Sn2 2 ) = (Zn + Vn + Wn ) N (0, 4 4 ),

1
a
Sn2 N ( 2 , [4 4 ]).
n
so that
(This asymptotic distribution is useful for testing hypotheses about the value of 2 .)
So far we considered sample moments for random samples with scalar random variables. For
random samples with multivariate variables, the joint sample moments between pairs of
variables become relevant. One of the most commonly used joint sample moment is the sample
covariance. It is the sample counterpart of the population covariance.
Definition 6.7 (Sample Covariance) Let (X1 , Y1 ), ..., (Xn , Yn ) denote a random sample. Then
the sample covariance is
SXY =
n
n
X
1X
n )(Yi Yn ) = 1
n Yn .
(Xi X
Xi Yi X
n i=1
n i=1
We now examine important properties of the sample covariance as the sample counterpart
of the population covariance XY . Let SXY be the sample covariance for a random sample
(X1 , Y1 ), ..., (Xn , Yn ) from a joint population distribution. If we assume that the appropriate
population moments exist, we get the following results:
For the mean of SXY we obtain
E (SXY ) =
1
n
Pn
i=1
n )(Yi Yn )].
E[(Xi X
where
n )(Yi Yn )]
E[(Xi X

= E Xi Yi
1
X
n i
Pn
1 X (Y
1
n i
i=1
{z
Yi
}
+ . . . + Yn )
1
Y
n i
Pn
1 Y (X
1
n i
i=1
{z
1 Pn
( i=1
n2
Xi +
}
+ . . . + Xn )
1
n2
Pn
Xi )(
i=1
{z
Yi ) .
}
(X1 + . . . + Xn )(Y1 + . . . + Yn )
Since (X1 , Y1 ), ..., (Xn , Yn ) are iid Xi is independent of Xj and Yj for i 6= j , and vice versa, we
obtain
n )(Yi Yn )]
E[(Xi X

= 01,1 2 n1 01,1 + (n 1)X Y +

= 01,1 (1 n1 ) (1 n1 )X Y
133
1
n2
n01,1 + (n2 n)X Y .
(1 n1 )XY

Using this result, we find that
1
n
E (SXY ) =
Pn
i=1
n )(Yi Yn )] = ( n1 )XY .
E[(Xi X
n
Hence, as it is the case for the sample variance, the expectation of the sample covariance differs
from the corresponding population moment. However, the difference goes to 0 as n , so
that the distribution of SXY becomes centered on XY .
The variance of SXY has the form
Var(SXY ) = n1 [2,2 (01,1 )2 ] + o( n1 ).
This result is obtained from a Taylor series expansion (see Kendall M., Stuart, A., (1994,
Chap. 13) The Advanced Theory of Statistics, Vol. 1, New York, Wiley). Note, that the variance
of SXY disappears as n , so that its distribution increasingly concentrates within a small
neighborhood of XY . Since E SXY XY and limn Var(SXY ) = 0, we have
m
SXY XY
plim SXY = XY .
In order to obtain the asymptotic distribution of SXY , represent SXY as

SXY =
1
n
Pn
i=1 Xi Yi
{z
0
= M1,1
1
n
Pn

i=1 Xi
{z
n
=X
1
n
Pn
{z
}|
i=1 Yi
n
=Y
0
n , Yn ).
,X
= g(M1,1
0
n , and Yn are averages of the iid variables (Xi Yi ), Xi , and Yi respectively,
Note that M1,1
, X
with mean and covariance matrix
Cov
Xi Yi
Xi
Yi
Xi Yi
Xi
Yi
01,1
X
Y
= .
02,2 (01,1 )2 02,1 1,1 X 01,2 1,1 Y
x2
XY
Y2
Then by the multivariate CLT of Lindeberg-Lvy
0
M1,1
01,1
n X
X
Yn Y
134
N (0, ).
= .

0
0
n Yn is a differentiable function of asymptotically nor n , Yn ) = M1,1
X
,X
Since SXY = g(M1,1
mally distributed variables, SXY itself is by Theorem 5.18 asymptotically normally distributed
SXY =
a
0
n , Yn )
,X
g(M1,1
(01,1
|
X Y ) ,
{z
g() = XY
1
GG0
n
where
G = (1, Y , X )

1
0
2
0
1
GG =
2,2 (1,1 ) .
n
n
(gradient vector of gevaluated at )
Thus the asymptotic distribution of the sample covariance is

1
a
SXY N XY , [2,2 (01,1 )2 ] .
n

As we have similarly argued in the case of the sample mean and variance, the properties of the
sample covariance Sn2 derived above are useful for estimating its population counterpart, the
covariance XY .
6.4. Sample Mean and Variance from Normal Random

Samples
n and sample
In the previous section we investigated the properties of the sample mean X
variance Sn2 for a random sample, without assuming a specific population distribution for the
n and Sn2 that arise when
random sample. This section deals with additional properties of X
random sampling is from a Normal distribution - still one of the most widely used statistical
n and S 2 are then independent random variables,
models. In particular, we will show that X
n
2
Xn is then normally distributed, and Sn is then Gamma distributed.

n and Sn2 when the random sample is from a Normal
In order to show the independence of X
distribution, the following result is useful.
Theorem 6.4 Let

B :
real (q n)matrix,
A : real symmetric (n n)matrix with rank p,
X :
(n 1)random vector with a N (, 2 I)-distribution.
Then the linear form BX and the quadratic form X 0 AX are independent, if BA = 0.
135

Proof
Let the (n n) diagonal matrix of the eigenvalues of A be denoted by2
1
..
.
n
where i : ith eigenvalue of A,
and the (n n) matrix of the corresponding eigenvectors by

P = (P 1 , ..., P n ),
where Pi : ith eigenvector of A,
with P 0 P = I, so that P 0 = P 1 and P P 0 = P P 1 = I. Then, the so-called spectral decomposition

of A is
A = P P 0 ,
0
P AP = =
so that
(say)=
1
..
.
p
{z
(p p)
rk(A) = pp = (#of eigenvalues 6= 0)
Now let BA = 0, so that

BAP = 0
and
BP
P}0 AP = 0,
| {z
I
and let
C = B
(q n)
P ,
0
AP} = C = 0.
BP
| {z
|{z} P
so that
(q n)(n n)
Partitioning C conformably with , we get

C =
C1
C2
(q p)
(q n p)
D 0
0
C 1D + C 20 C 10 + C 20
= 0.
This implies that C 1 D = 0 and, since D 6= 0, that C 1 = 0. Thus the matrix C = BP must have
the form
C=
0 C2
Now use the eigenvector matrix P and X N (, 2 I) to define

Z = P 0 X,
(n 1)
where
Z N (P 0 , 2 P 0 P ) = N (P 0 , 2 I).
Note that the elements in Z, say (Z1 , ..., Zn ), are independent variables since they are uncorrelated
The eigenvalues and - vectors of A obtain from the n nontrivial solutions of (A i I)P i = 0 under the
normalizing restrictions P 0i P i = 1 and P 0i P j = 0 i 6= j.
136

and normally distributed. Collecting all terms, we obtain for the quadratic form of x
0
AP} P
X AX = |{z}
xP P
x = Z Z =
| {z
|{z}
Z0
p
X
i Zi2 .
i=1
For the linear form of x we obtain
Z1
..
Bx = BP
P 0 x = CZ =
|{z} |{z}
C
(q p)
C2
(q n p)
Zp+1
Zp
..
= C2
.
Zp+1
Zn
..
Zn
Thus
X 0 AX = g1 (Z1 , ..., Zp )
BX = g2 (Zp+1 , ..., Zn ),
and
and because (Z1 , ..., Zp ) and (Zp+1 , ..., Zn ) are independent, X 0 AX and BX are independent.
n and Sn2 and the distriWe now use the preceding theorem to establish the independence of X
bution of nSn2 / 2 when the random sample is from a Normal distribution.
n and Sn2 are the sample mean and sample variance of a size-n random
Theorem 6.5 Let X
sample from a N (, 2 )-distribution. Then
n N (, 1 2 ),
1. X
n
n and Sn2 are independent,
2. X
3. nSn2 / 2 2n1 .
Proof
n =
1. The normality of X
n is a linear combination
follows immediately from the fact that X
n is not only asymptotically normally distributed but also
of iid Gaussian random variables. (Hence X
1
n
Pn
i=1 Xi
exactly normally distributed when we sample from a Normal distribution.)

2. Write the sample mean as
X1
1
1
..
1 Pn
Xn = n i=1 Xi = [ , ..., ] . = BX.
|n {z n}
Xn
B (1n)
|
{z
x N (0, 2 I)
137

Also write the vector of differences between the sample variables and the sample mean as
1
n
X X
n
1
.
..
= IX ..
Xn Xn
n
..
.
1
n
1
n
..
X = (I H)X,
.
{z
H (nn)
where (I H) is symmetric and idempotent, i.e. (I H)(I H) = (I H). Using the matrix
(I H), we can write the sample variance as
n
X X
1
P
.
n
1
n )2 = 1 [(X1 X
n ), . . . , (Xn X
n )]
.
(X
i
.
i=1
n
n
Xn Xn
Sn2 =
0
1
n X (I
H)0 (I H)X = X 0 n1 (I H) X = X 0 AX.

|
{z
n = BX and the sample variance S 2 = X 0 AX

It follows from Theorem 6.4 that the sample mean X
n
are independent since
1
1
BA = B (I H) = (B BH)
n
n
1
=
(B B) = 0.
(BH = [ n1 , ..., n1 ]
n
1
n
1
n
.
.
.
.
.
.
= [n n12 , ..., n n12 ] = B)
1
n
1
n
3. In order to obtain the distribution of nSn2 / 2 , we represent this variable as

2
nSn
2
1
X 0 (I
2
H)0 (I H)X.
Furthermore, let
X = (, ..., )0
and note that
(I H)X = X X = 0.
This allows us to write

2
nSn
2
1
2
X 0 (I H)0 0X (I H)0
|
{z
0
ih
(I H)X (I H)X
1
2
(x X )0 (I H)0 (I H)(x X )
1
2
(x X )0 (I H)(x X ) .
{z
0
Since the (n n) matrix (I H) is symmetric and idempotent, it implies that3

rk(I H) = trace(I H) = trace(I) trace(H) = n 1,
3
See Ltkepohl (1996, Chap. 5.2), Handbook of Matrices, Chichester, Wiley.
138

and that
the eigenvalues of (I H)are either 0 or 1, with
(#eigenvalues = 1) = rk(I H) = n 1.
The spectral decomposition of (I H) into the matrices of eigenvalues () and eigenvectors (P ), i.e.
I H = P P 0 ,
then yields
P 0 (I H)P = =
I (n1) 0
0
If this is accounted for in the last equation for nSn2 / 2 we get

2
nSn
2
1
2
0
(X X )0 (I H)(X X ) =
1
2
(X X )0 P P 0 (X X )
= Z Z,
Z = P0
where
XX
{z
N (0, P 0 P ) = N (0, I).
vector of iidN (0, 1)

variables
Thus
2
nSn
2
= Z 0 Z =
n1
X
Zi2
2(n1) .
i=1
From the fact that nSn2 / 2 2(n1) it follows that the sample variance Sn2 is Gamma-distributed
as stated in the following theorem.
Theorem 6.6 Under the assumptions of Theorem 6.5,
Sn2 Gamma with
n1
,
2
2 2
.
n
Sn2 =
Y 2
n .
Proof
Let
Y =
2
nSn
2
2(n1) ,
so that
Then the MGF for Sn2 is obtained as
MSn2 (t)
(def)
E exp{Sn2 t} = E exp{Y ( n t)} = (1 2t )
| {z }
t
(1 2 n t)
(n1)
2 ,
|{z}
139
{z
(n1)
2
MGF of Y 2
(n1)

which is the MGF associated with the Gamma distribution having =
n1
2
and =
2 2
n .
Remark 6.9 The theorem implies that the sample variance Sn2 is exactly Gamma distributed
when we sample from a Normal distribution. Hence if the normal model is supposed to be the
correct model, we do not need to rely on the normal approximation for the Sn2 distribution which
is given in Theorem 6.3.
6.5. Pdfs of Functions of Random Variables

In the preceding section, we examined a number of statistics (sample mean, sample variance,
etc.), which are useful in a number of statistical inference problems. However, one needs to be
concerned with a much larger variety of functions of random samples to deal with the variety
of inference problems that arise in practice. Furthermore, in order to assess the properties of
statistical procedures, it will be necessary to identify the distribution for functions of random
samples that are used as estimators or as hypothesis-testing statistics.
In this section, we will discuss three approaches, which can be used to derive the pdf for
functions of a random sample:
1. the MGF Approach,
2. the Equivalent Events Approach,
3. the Change of Variable Approach.
MGF Approach
Let X1 , ..., Xn be a random sample from a known population distribution, and let Y = g(X1 , ..., Xn )
denote the function of interest. Then one can attempt to derive the MGF of Y , i.e.

MY (t) = E eY t = E eg(X1 ,...,Xn )t ,

and identify the distribution associated with that MGF.
n =
Example 6.3 Consider the sample mean X
1
n
140
Pn
i=1
Xi , when Xi iid Exponential(). The

n is
MGF of X

MXn (t) = E exp ( n1
Pn
i=1
o
Xi )t
=E
Q
n
i=1
exp Xi nt
o
(indep.)
Qn
i=1
o
{z
E exp Xi nt
|
MGF of Xi
Exponential()
t
evaluated at n
Qn
i=1 (1
nt )1 = (1 nt )n ,
t < n ,
for
.
n
which is the MGF of a Gamma distribution with = n and =

n is gamma.
distribution of X
Thus the sampling
Remark 6.10 Note that the MGF approach is applicable only if the MGF does exist and if we
know a distribution which corresponds to the MGF when the MGF does exist.
Equivalent Events Approach (Discrete Case)

In the case of discrete random variables, we can use the equivalent-events approach for
deriving the distribution of functions of random variables.
Let Y = g(X) be the function of interest, where X represents a discrete variable with pdf fx .
Consider the set of elementary events x generating a particular elementary event y, i.e,
Ay = {x : y = g(x) , x R(X)};
Then the probability for the elementary event y can be written as
X
PY (y) = PX (x AY ) =
fX (x) = fY (y),
{xAY }
which defines the pdf of Y . The extension to the case of multivariate variables is straightforward.
Example 6.4 Let x = (X1 , X2 , X3 )0 have a joint discrete pdf given by
(0,0,0)
(0,0,1)
(0,1,1)
(1,0,1)
(1,1,0)
(1,1,1)
fx (x)
1/8
3/8
1/8
1/8
1/8
1/8
What is the joint pdf of Y = (Y1 , Y2 ) with Y1 = X1 + X2 + X3 and Y2 = |X3 X2 | ?

The mapping between the outcomes in the range of x and that of Y is
141

x
(0,0,0)
(0,0,1)
(0,1,1)
(1,0,1)
(1,1,0)
(1,1,1)
(0,0)
(1,1)
(2,0)
(2,1)
(2,1)
(3,0)
Hence the joint pdf of Y = (Y1 , Y2 ) is obtained as

y
(0,0)
(1,1)
(2,0)
(2,1)
(3,0)
fy (y)
1/8
3/8
1/8
2/8
1/8
Change of Variables Approach (Continuous case)

A useful technique for deriving the pdf of functions of continuous random variables is
the change-of-variables technique. If the function of interest y = g(x) is monotone and
continuously differentiable, the expression for the pdf of Y is given in the following theorem.
Theorem 6.7 Let X be a continuous random variable with a pdf f (x) with support = {x :
f (x) > 0}. Suppose that y = g(x) is a continuously differentiable function with
1.
dg(x)
dx
6= 0x in some open interval containing ,
2. and an inverse x = g 1 (y) defined y = {y : y = g(x), x }.

Then the pdf of Y = g(X) is given by

h(y) = f g

dg 1 (y)

(y)

dy
y .
for
Proof
The fact that g(x) is continuously differentiable with
dg(x)
dx
6= 0 implies that g is either monotonically
increasing (case a) or monotonically decreasing (case b).

Case a., where
dg(x)
dx
> 0: In this case we have

P (y b) = P x g 1 (b) .
Thus the cdf for Y is obtained as

(def.)
H(b) = P (y b) = P x
g 1 (b)
g 1 (b)
f (x)dx.
Then the pdf of Y is obtained by differentiation of H as

(def.)
h(b) =
dH(b)
db
g1 (b)
f (x)dx
db
(chain rule)
f g 1 (b)
dg 1 (b)
db
| {z }
note that
this term
is > 0
142

Case b., where
dg(x)
dx
< 0: In this case the cdf of Y is given by

H(b) = P (y b) = P x
g 1 (b)
f (x)dx.
=
g 1 (b)
Thus the pdf of Y is obtained as

h(b) = f g 1 (b)
dg 1 (b)
db
| {z }
note that
this term
is < 0
Combining both cases (increasing and decreasing g), the pdf h(b) can be expressed concisely as

h(b) = f g

dg 1 (b)

(b)
.
db
Example 6.5 Consider the Cobb-Douglas production function defined as the following product
of weighted input factors:
Q = 0
k
Y
xi eW ,
i
i=1
where Q : output, xi > 0 : deterministic quantities of input factors, i : corresponding partial

production elasticities, 0 > 0 : efficiency parameter, and W N (0, 2 ) : stochastic error
term. What is the pdf of Q? In order to answer this question rewrite Q as
Q = exp{ln 0 +
Pn
i=1
i ln xi + W } = exp Z,
{z
where
Z N (ln 0 +
Pn
i=1
{z
i ln xi , 2 ) = N (Z , 2 ).
dq
The function q = exp z is monotonically increasing with dz
= exp z > 0 z. The inverse is
dz
1
z = ln q with dq = q > 0. Thus Theorem 6.7 applies, and the pdf for Q is
h(q) =
1
2
z)
exp{ (ln q
} ( 1q ) ,
2 2
{z
fZ (g 1 (q))
for q > 0.
|{z}
dg 1 (q)
dq
This is the density of a lognormal distribution.

Remark 6.11 It is important to note that Theorem 6.7 does not apply to cases where the
function g does not have an inverse such as, for example, the function y = x2 . However, it can
be generalized to cases with piecewise invertible functions (see Mittelhammer, 1996, p. 338).
143

Furthermore, the results of Theorem 6.7 can can be extended to the multivariate case, as
stated in the following theorem.
Theorem 6.8 Let x be a continuous (n 1) random vector with joint pdf f (x) with support
. Furthermore, let g(x) be a (n 1) vector function which is
1. continuously differentiable x in some open rectangle, , containing ,
2. and with an inverse x = g 1 (y), which exists y = {y : y = g(x), x }.
Assume that the Jacobian matrix
J=
g1 (y)
1
y1
..
1
(y)
gn
y1
...
..
.
...
g11 (y)
yn
..
.
1
gn
(y)
satisfies
det(J ) 6= 0,
yn
and assume that all partial derivatives in J are continuous y . Then the joint pdf of
Y = g(x) is given by

h(y) = f g11 (y), . . . , gn1 (y) |det (J )|
for
y .
Proof
See Rohatgi and Saleh (2001), p. 133-134.
Remark 6.12 In the multivariate change-of-variable Theorem 6.8, there are as many coordinates in y as there are elements in the argument x, i.e., dim(y) = dim(x) = n. In cases where
dim(y) < dim(x) = n, we need to introduce auxiliary variables to obtain an invertible function
having n coordinates and to apply Theorem 6.8. Then, in a second step, the auxiliary variables
are integrated out from the derived joint pdf.
We now illustrate the multivariate change-of-variable approach in order to derive the studentt density and the F -density. Both densities play an important role for hypothesis-testing
procedures when random sampling is from a normal population.
The t-density
Theorem 6.9 Let Z N (0, 1), let Y 2v , and let Z and Y be independent. Then
T = Z
Y /v
has the t-density with vdegrees of freedom,

144

defined as
f (t; v) =
( v+1
2)
(v/2) v
t2
v
1+
( v+1 )
2
Proof
The proof is based on the multivariate change-of-variable Theorem 6.8. Define the (2 1) vector
function g as
h t i
h g (z, y) i
1
g2 (z, y)
h z
y/v
where g2 is an auxiliary function introduced to allow the use of Theorem 6.8. The function g is
continuously differentiable with an inverse function g 1 which is obtained by solving for z and y as
q
h g 1 (t, w) i h t w i
1
v
=
,
=
1
h z i
g2 (t, w)
The Jacobian of the inverse g 1 is thus

" g1 ()
1
J=
g21 ()
t
g11 ()
w
g21 ()
w
"
w
v
t
2
v 1
wv
|det(J)| =
with
w
v.
The assumed joint pdf of (Z, Y ) is

f (z, y)
indep.
z
1 e 2
2
f (z)f (y) =
{z
1
2v/2 (v/2)
y (v/2)1 ey/2 .
} |
{z
Z N (0, 1)
Y 2
v
Then by Theorem 6.8 the joint pdf of (T, W ) is given by

h(t, w) =
1 2w
1 e 2 t v
2
1
2v/2 (v/2)
w(v/2)1 ew/2
w
v
w(v1)/2 ew[1+(t /2)]/2

.
(v/2) v2(v+1)/2
The marginal pdf of T is obtained by integrating w out from the joint pdf h(t, w), i.e.,
f (t) =
h(t, w)dw =
w(v1)/2 ew[1+(t /2)]/2

dw.
(v/2) v2(v+1)/2
Making an appropriate variable substitution in this integral yields4

f (t) =
( v+1
2)
(v/2) v
1+
t2
v
( v+1 )
2
Remark 6.13 The t-distribution with a pdf as given in Theorem 6.9 has the following moments
= 0 for v > 1,
4
2 =
v
v2
for v > 2,
For details, see Mittelhammer (1996, p. 342).
145
3 = 0 for v > 3,

where the parameter v > 0 is referred to as the degrees of freedom. The MGF does not exist.
The t-density is symmetric about 0 and has fatter tails than a standard normal density.
Theorem 6.9 defining the t-distribution facilitates the derivation of the pdf of the so called
t-statistic when the random sample is from a normal distribution, as stated in the following
theorem.
Theorem 6.10 Let (X1 , ...., Xn ) be a random sample from a N (, 2 ) - distribution. Then the
t-statistic given by
n
X
T = q
,
n2 /n
where
n =
X
1
n
Pn
i=1
n2 =
Xi ,
n
S2,
n1 n
Sn2 =
1
n
Pn
i=1 (Xi
n )2
X
follows a t-distribution with v = n 1 degrees of freedom.

Proof
Rewrite the t-statistic as
n
X
T =q q 2 =q
n
2
2
n
Xn
where
N (0, 1), and
2
/n
2
nSn
2
n
X
2
n
2
nSn
/(n
2
,
1)
n and Sn2 are independent (see Theorem 6.5).

2n1 , and where X
Thus Theorem 6.9 applies, so that T tn1 .
An important property of the tv -distribution is that it converges to a Normal distribution as

v , as stated in the following theorem.
Theorem 6.11 Let Z N (0, 1), let Y 2v , and let Z and Y be independent so that
Tv = Z
Y /v
tv .
Tv N (0, 1),
Then
for v .
Proof
Since Y 2v , we have E (Y ) = v and Var(Y ) = 2v, so that
Var( Yv ) =
E( Yv ) = 1,
2
v
0,
for v .
Hence
Y m
v
plim( Yv ) = 1.
Also note that since Z N (0, 1) v, it follows that Z N (0, 1). Then by Slutskys theorem,
Tv = Z
Y /v
d Z
1
N (0, 1).
146

The F -density
The F -density arises as the density of a ratio of two independent 2 random variables.
Theorem 6.12 Let Y1 2v1 , let Y2 2v2 , and let Y1 and Y2 be independent. Then
F =
Y1 /v1
Y2 /v2
has the F -density with v1 numerator and v2 denominator degrees of freedom, defined as
m (f ; v1 , v2 ) =
v1 +v2
2
v
22
) v1 v1 /2 (v1 /2)1

f
1+
( ) ( ) v2
v1
2
v1
f
v2
(1/2)(v1 +v2 )
I(0,) (f ).
Proof
The proof based upon the multivariate change-of-variable technique (Theorem 6.8) is straightforward
(see Mittelhammer, 1996, p. 345).
Remark 6.14 The F -distribution with a pdf as given in Theorem 6.12 has the following moments
v2
for v2 > 2
v2 2
2v22 (v1 + v2 2)
for v2 > 4
=
v1 (v2 2)2 (v2 4)
v 3 8v1 (v1 + v2 2)(2v1 + v2 2)
= 2 3
>0
v1 (v2 2)3 (v2 4)(v2 6)
=
2
3
for v > 6,
where the degrees-of-freedom parameters are v1 > 0 and v2 > 0. The MGF does not exist. The
F -density is skewed to the right.
6.6. Order Statistics

Situations arise in practice where we are interested in the largest or smallest value in a random
sample rather than in the average value. Examples are:
When constructing a dike we are interested in the highest flood water level rather than
in the average water level;
As a risk manager of a portfolio of risky assets we are interested in the smallest portfolio
return, whose expected values represents the expected maximal loss.
147

The largest and smallest value in a random sample are examples of order statistics.
Definition 6.8 (Order Statistic) Let X1 , X2 , ..., Xn be a random sample. Then X[1] X[2]
... X[n] , where the X[i] s are the Xi s arranged in order of increasing magnitudes, are the order
statistics of the sample X1 , X2 , ..., Xn . The variable X[i] is called the ith order statistic.
Remark 6.15 Note that the order statistics are indeed statistics, since they are defined as
functions of the random sample.
Example 6.6 Let the random variable X be the return of a portfolio of risky assets. Then the
1st order statistic X[1] = min{X1 , ..., Xn } is a critical variable for a risk manager. He might be
interested in the probability
P (X[1] -10%).
Note that we need the sampling distribution of X[1] in order to compute this probability.
The following theorem establishes the cdf for the sampling distribution of the kth order statistic
for a random sample from a population distribution (i.e. a sample consisting of iid random
variables).
Theorem 6.13 Let (X1 , . . . , Xn ) be a random sample from a population distribution with cdf
F , and let X[k] be the kth order statistic. Then the cdf of X[k] is given by
FX[k] (b) =
n
X
j=k
n
F (b)j [1 F (b)]nj .
j
Proof
For a given b, define the random variable
Yi = I(,b] (Xi ) ,
which is Bernoulli distributed with p = P (yi = 1) = P (xi b) = F (b). Since the Yi s are iid, it follows
that
Pn
i=1 Yi
Pn
i=1 I(,b] (Xi )
Binomial[n, p = F (b)],
Now note the equivalence of the following two events

{x[k] b}
|
{z
{
|
(event that the kth largest outcome is less or equal to b)
148
Pn
i=1 I(,b] (xi )
{z
k}
}
(event that at least koutcomes are less or equal to b)

Since both events are equivalent, they have the same probability. Thus the cdf of X[k] is obtained as
(def.)
FX[k] (b)
Pn
P (x[k] b) = P (
=
i=1 yi
n
Pn
j=k j
k)
F (b)j [1 F (b)]nj .
{z
(the pdf of
}
P
yi )
The cdf for the smallest and largest order statistic, X[1] and X[n] , obtain as special cases of
Theorem 6.13.
Corollary 6.1 Under the conditions of Theorem 6.13 the cdfs of X[1] and X[n] are given by
FX[1] (b) = 1 [1 F (b)]n ,
FX[n] (b) = F (b)n .
and
Proof
From Theorem 6.13, we have
FX[1] (b) =
=
j
n
j=1 j F (b) [1
Pn
j
n
j=0 j F (b)
Pn
F (b)]nj
[1 F (b)]nj
{z
n
0 F
(b)0 [1 F (b)]n0
(sum of a binomial pdf over its support = 1)
= 1 [1 F (b)]n .
Also,
FX[n] (b) =
n
j=n j F
Pn
(b)j [1 F (b)]nj =
n
n F
(b)n [1 F (b)]nn = F (b)n .
Remark 6.16 Note that the particular analytical form of the cdf of the order statistics FX[k] (b)
depends upon the analytical form of the cdf of the parent distribution F , i.e., the population
distribution of the random sample. For unknown distributions, Markovs inequality may help
get an order of magnitude of the maximum or minimum of a random sequence.
Example 6.7 Suppose that the life of a certain light bulb measured in hours and denoted by X
is exponentially distributed with pdf
f (x) = 1 ex/ I(0,) (x),
with
E (X) = = 1000 hours.
In a random sample of n = 10 such light bulbs, what is the distribution of the life of the bulb
that fails first and what is its expected life?
149

First recall that the cdf of X is F (b) = 1 ex/ ; then the cdf for the life of the bulb that fails
first X[1] is
FX[1] (x)
(Corol. 1.1)
1 [1 FX (x)]n = 1 [1 1 + ex/ ]n = 1 exn/ ,
which is the cdf of an exponential distribution with parameter /n. Thus X[1] is exponentially
distributed with mean E X[1] = /n = 1000/10 = 100.
Remark 6.17 Having established the cdf of an order statistic, we can use the duality between
cdfs and pdfs in order to obtain the pdf of the order statistic. If the random sample is
from a discrete population distribution with support x1 < x2 < ... < xn , then the pdf of X[k] is
obtained as
fX[k] (xi ) = FX[k] (xi ) FX[k] (xi1 ) for i 2,
with
fX[k] (x1 ) = FX[k] (x1 ).
In the case of a continuous population distribution, the pdf is obtained by differentiation of the
cdf as5
dFX[k] (x)
n!
= (k1)!(nk)!
f (x)F (x)k1 [1 F (x)]nk .
fX[k] (x) =
dx
Example 6.8 Let {X1 , ..., Xn } be a random sample from a uniform(0, 1) distribution with pdf
f (x) = 1 for x (0, 1) and cdf F (x) = x. Then the pdf of the kth order statistic is
fX[k] (x) =
=
n!
f (x)F (x)k1 [1 F (x)]nk
(k1)!(nk)!
(n+1)
xk1 (1 x)(nk+1)1 .
(k)(nk+1)
n!
xk1 (1
(k1)!(nk)!
x)nk
Thus, the kth order statistic from a uniform(0, 1) random sample has a beta(, ) distribution
with = k and = n k + 1.
So far we considered the marginal distribution of one order statistic. The joint sampling
distribution of any pair of order statistics is given in the following theorem.
Theorem 6.14 Let (X1 , . . . , Xn ) be a random sample from a population distribution with cdf
F , and let X[k] and X[`] , with k < `, be the kth and `th order statistics. Then their joint cdf is
given by
FX[k] X[`] (bk , b` ) =
FX[`] (b` )
for bk b` ,
P
n Pni
n!
i=k
j=max{0,ì} i!j!(nij)! F
(bk )i [F (b` ) F (bk )]j
[1 F (b` )]nij
For details of the algebraic manipulations see Casella and Berger (2002, p. 229).
150
for bk < b` .

Proof
Case bk b` ; Given X[k] X[`] , it follows that
event {x[`] b` }
implies
event {x[k] bk }.
(See the sketch below for an illustration.)
Hence, we have
FX[k] X[`] (bk , b` ) = P (x[k] bk , x[`] b` ) = P (x[`] b` ) = FX[`] (b` ) .
Case bk b` ; Note that the event {x[k] bk , x[`] b` } corresponds to the event
{at least kof the xi s bk
and
at least òf the xi s b` }.
This event can be represent as

{x[k] bk , x[`] b` } =
[
{iof the xi s bk and jof the xi s are such that bk < xi b` },
(i,j)I
where I = {(i, j) : max{0, ` i} j < n i; k i < n}.

Note that the events involved in the union operation are disjoint. In order to assign probabilities to
those events we can use the multinomial distribution as follows: Categorize the outcomes of the Xi s
into one of the three types,
{xi bk },
{bk < xi b` },
{xi > b` },
which occur with probabilities

F (bk ),
F (b` ) F (bk ),
1 F (b` ).
Then directly applying the multinomial distribution yields

P ({iof the xi s bk and jof the xi s are such that bk < xi b` })
151

=
n!
i!j!(nij)! F
(bk )i [F (b` ) F (bk )]j [1 F (b` )]nij .
Finally, summing those probabilities of all of the disjoint events in the union operation yields the
second part of the theorem.
Remark 6.18 From the joint cdf of two order statistics we derive their joint pdf. If the
random sample is discrete, the pdf can be obtained by differencing of the cdf. In the continuous
case the pdf is obtained by differentiation of the cdf, which yields
fX[k] ,X[`] (xk , x` )
=
2
FX
[k] ,X[`]
(xk ,x` )
xk x`
n!
f (xk )f (x` )F (xk )k1 [F (x` )
(k1)!(`1k)!(n`)!
F (xk )]`1k [1 F (x` )]n` ,
for xk < x` . This result for the joint pdf for two order statistics can be extended to derive the
joint pdf of all order statistics for a continuous random sample, which is given by6
fX[1] ,...,X[n] (x1 , ..., xn ) = n!
n
Y
f (xi ),
for
x1 < x2 < ... < xn .
i=1
We now derive the distribution of interesting functions of order statistics, such as the
sample median, the sample range, and the sample midrange.
Definition 6.9 (Sample Median) Let (X[1] , ..., X[n] ) be the order statistics of a random sample of size n. Then the sample median is defined as
M =
X[k]
if
nis odd with n = 2k 1, k N,
(X[k] + X[k+1] )/2
if
nis even with n = 2k.
Hence the sample median is the middle order statistic (if n is odd) or the average of the middle
two order statistics (if n is even) and is a measure of the center of the empirical distribution.
Definition 6.10 (Sample Range, Sample Midrange) Let (X[1] , ..., X[n] ) be the order statistics of a random sample of size n. Then the sample range and midrange are defined as
R = X[n] X[1] ,
and
M R = (X[1] + X[n] )/2.
The sample midrange represents a further measure of the center of the empirical distribution,
while the sample range measures the dispersion of the empirical distribution.
6
See, Mood, Graybill and Boes (1974), Theorem 12, p. 254.
152

Distribution of the Median
When the sample size n is odd with n = 2k 1, we have M = X[k] , and therefore, the pdf of
the sample median is immediately obtained as the pdf of the kth order statistic
fM (m) = fX[k] (xk ).
When the sample size n is even with n = 2k, the median is a function of two random variables.
Then its pdf can be obtained by using the change-of-variable approach based upon the joint
pdf the two RVs.
When the random sample is from a continuous population distribution, the joint pdf of X[k] , X[k+1]
is
n!
f (xk )f (xk+1 )
fX[k] ,X[k+1] (xk , xk+1 ) = (k1)!0!(nk1)!
F (xk )k1 [F (xk+1 ) F (xk )]0 [1 F (xk+1 )]nk1 ,
for xk < xk+1 .
Now define the (2 1) vector function g as

v
m
g1 (xk , xk+1 )
g2 (xk , xk+1 )
xk
,
1
(xk + xk+1 )
2

where g1 is an auxiliary function introduced to allow the use of the change-of-variable approach.
The function g is continuously differentiable with an inverse function g 1 which is obtained by
solving for xk and xk+1 as

xk
xk+1
g11 (v, m)
g21 (v, m)
v
,
2m v

The Jacobian of the inverse g 1 is thus
J =
g11 ()
v
g21 ()
v
g11 ()
m
g21 ()
m
|det(J)| = 2.
with
1 2
Then by the multivariate change-of-variable approach the joint pdf of V = X[k] and M =
1
2 (X[k] + X[k+1] ) is given by
h
fV,M (v, m) = |det(J)| fX[k] ,X[k+1] g11 (v, m), g21 (v, m)
|
{z
X[k]
} |
{z
X[k+1]
2n!
f (v)f (2m v)F (v)k1 [1 F (2m v)]nk1 .
(k 1)!(n k 1)!
153

Finally, the marginal pdf of the sample median M is obtained by integrating v out from the
joint pdf fV,M (v, m), i.e.
fM (m) = fV,M (v, m)dv.

Note that the result, i.e. the functional form of fM (m), depends upon the population distribution of the sample and the functional forms of F and f .
Distribution of the Sample Range and Midrange
The sampling distributions of the sample range R = X[n] X[1] and the sample midrange
M R = (X[1] + X[n] )/2 can be obtained in the same way we derived the distribution of the
median. For further details, see Mood, Graybill and Boes (1974, Chap. VI.5.2) and Casella and
Berger (2002, Chap. 5.4).
154
Appendix
A. Tables
Table A.1.: Quantiles of the 2 distribution
0.5%
1%
2.5%
5%
10%
90%
95%
97.5%
99%
99.5%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0.000
0.010
0.039
0.195
0.406
0.673
0.987
1.343
1.734
2.155
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434
0.000
0.020
0.100
0.292
0.552
0.871
1.238
1.646
2.088
2.558
3.053
3.570
4.107
4.660
5.229
5.812
6.408
7.015
7.633
8.260
0.001
0.051
0.213
0.484
0.831
1.237
1.690
2.180
2.700
3.247
3.816
4.404
5.009
5.629
6.262
6.908
7.564
8.231
8.907
9.591
0.004
0.103
0.353
0.712
1.146
1.636
2.168
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851
0.016
0.211
0.587
1.065
1.611
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.042
7.790
8.547
9.312
10.085
10.865
11.651
12.443
2.706
4.605
6.252
7.780
9.237
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
3.841
5.991
7.816
9.488
11.071
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410
5.024
7.378
9.350
11.144
12.833
14.450
16.013
17.535
19.023
20.483
21.920
23.337
24.736
26.119
27.488
28.845
30.191
31.526
32.852
34.170
6.635
9.210
11.346
13.277
15.086
16.812
18.475
20.090
21.666
23.209
24.725
26.217
27.688
29.141
30.578
32.000
33.409
34.805
36.191
37.566
7.879
10.597
12.836
14.859
16.749
18.547
20.277
21.955
23.589
25.188
26.757
28.299
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
X 2 (): Quantiles 2p () of the 2 distribution with degrees of freedom.
A. Tables
Table A.2.: Quantiles of the standard normal distribution
p
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.50x
0.51x
0.52x
0.53x
0.54x
0.0000
0.0251
0.0502
0.0753
0.1004
0.0025
0.0276
0.0527
0.0778
0.1030
0.0050
0.0301
0.0552
0.0803
0.1055
0.0075
0.0326
0.0577
0.0828
0.1080
0.0100
0.0351
0.0602
0.0853
0.1105
0.0125
0.0376
0.0627
0.0878
0.1130
0.0150
0.0401
0.0652
0.0904
0.1156
0.0175
0.0426
0.0677
0.0929
0.1181
0.0201
0.0451
0.0702
0.0954
0.1206
0.0226
0.0476
0.0728
0.0979
0.1231
0.55x
0.56x
0.57x
0.58x
0.59x
0.1257
0.1510
0.1764
0.2019
0.2275
0.1282
0.1535
0.1789
0.2045
0.2301
0.1307
0.1560
0.1815
0.2070
0.2327
0.1332
0.1586
0.1840
0.2096
0.2353
0.1358
0.1611
0.1866
0.2121
0.2378
0.1383
0.1637
0.1891
0.2147
0.2404
0.1408
0.1662
0.1917
0.2173
0.2430
0.1434
0.1687
0.1942
0.2198
0.2456
0.1459
0.1713
0.1968
0.2224
0.2482
0.1484
0.1738
0.1993
0.2250
0.2508
0.60x
0.61x
0.62x
0.63x
0.64x
0.2533
0.2793
0.3055
0.3319
0.3585
0.2559
0.2819
0.3081
0.3345
0.3611
0.2585
0.2845
0.3107
0.3372
0.3638
0.2611
0.2871
0.3134
0.3398
0.3665
0.2637
0.2898
0.3160
0.3425
0.3692
0.2663
0.2924
0.3186
0.3451
0.3719
0.2689
0.2950
0.3213
0.3478
0.3745
0.2715
0.2976
0.3239
0.3505
0.3772
0.2741
0.3002
0.3266
0.3531
0.3799
0.2767
0.3029
0.3292
0.3558
0.3826
0.65x
0.66x
0.67x
0.68x
0.69x
0.3853
0.4125
0.4399
0.4677
0.4959
0.3880
0.4152
0.4427
0.4705
0.4987
0.3907
0.4179
0.4454
0.4733
0.5015
0.3934
0.4207
0.4482
0.4761
0.5044
0.3961
0.4234
0.4510
0.4789
0.5072
0.3989
0.4261
0.4538
0.4817
0.5101
0.4016
0.4289
0.4565
0.4845
0.5129
0.4043
0.4316
0.4593
0.4874
0.5158
0.4070
0.4344
0.4621
0.4902
0.5187
0.4097
0.4372
0.4649
0.4930
0.5215
0.70x
0.71x
0.72x
0.73x
0.74x
0.5244
0.5534
0.5828
0.6128
0.6433
0.5273
0.5563
0.5858
0.6158
0.6464
0.5302
0.5592
0.5888
0.6189
0.6495
0.5330
0.5622
0.5918
0.6219
0.6526
0.5359
0.5651
0.5948
0.6250
0.6557
0.5388
0.5681
0.5978
0.6280
0.6588
0.5417
0.5710
0.6008
0.6311
0.6620
0.5446
0.5740
0.6038
0.6341
0.6651
0.5476
0.5769
0.6068
0.6372
0.6682
0.5505
0.5799
0.6098
0.6403
0.6713
0.75x
0.76x
0.77x
0.78x
0.79x
0.6745
0.7063
0.7388
0.7722
0.8064
0.6776
0.7095
0.7421
0.7756
0.8099
0.6808
0.7128
0.7454
0.7790
0.8134
0.6840
0.7160
0.7488
0.7824
0.8169
0.6871
0.7192
0.7521
0.7858
0.8204
0.6903
0.7225
0.7554
0.7892
0.8239
0.6935
0.7257
0.7588
0.7926
0.8274
0.6967
0.7290
0.7621
0.7961
0.8310
0.6999
0.7323
0.7655
0.7995
0.8345
0.7031
0.7356
0.7688
0.8030
0.8381
0.80x
0.81x
0.82x
0.83x
0.84x
0.8416
0.8779
0.9154
0.9542
0.9945
0.8452
0.8816
0.9192
0.9581
0.9986
0.8488
0.8853
0.9230
0.9621
1.0027
0.8524
0.8890
0.9269
0.9661
1.0069
0.8560
0.8927
0.9307
0.9701
1.0110
0.8596
0.8965
0.9346
0.9741
1.0152
0.8633
0.9002
0.9385
0.9782
1.0194
0.8669
0.9040
0.9424
0.9822
1.0237
0.8705
0.9078
0.9463
0.9863
1.0279
0.8742
0.9116
0.9502
0.9904
1.0322
0.85x
0.86x
0.87x
0.88x
0.89x
1.0364
1.0803
1.1264
1.1750
1.2265
1.0407
1.0848
1.1311
1.1800
1.2319
1.0450
1.0893
1.1359
1.1850
1.2372
1.0494
1.0939
1.1407
1.1901
1.2426
1.0537
1.0985
1.1455
1.1952
1.2481
1.0581
1.1031
1.1503
1.2004
1.2536
1.0625
1.1077
1.1552
1.2055
1.2591
1.0669
1.1123
1.1601
1.2107
1.2646
1.0714
1.1170
1.1650
1.2160
1.2702
1.0758
1.1217
1.1700
1.2212
1.2759
0.90x
0.91x
0.92x
0.93x
0.94x
1.2816
1.3408
1.4051
1.4758
1.5548
1.2873
1.3469
1.4118
1.4833
1.5632
1.2930
1.3532
1.4187
1.4909
1.5718
1.2988
1.3595
1.4255
1.4985
1.5805
1.3047
1.3658
1.4325
1.5063
1.5893
1.3106
1.3722
1.4395
1.5141
1.5982
1.3165
1.3787
1.4466
1.5220
1.6072
1.3225
1.3852
1.4538
1.5301
1.6164
1.3285
1.3917
1.4611
1.5382
1.6258
1.3346
1.3984
1.4684
1.5464
1.6352
0.95x
0.96x
0.97x
0.98x
0.99x
1.6449
1.7507
1.8808
2.0537
2.3263
1.6546
1.7624
1.8957
2.0749
2.3656
1.6646
1.7744
1.9110
2.0969
2.4089
1.6747
1.7866
1.9268
2.1201
2.4573
1.6849
1.7991
1.9431
2.1444
2.5121
1.6954
1.8119
1.9600
2.1701
2.5758
1.7060
1.8250
1.9774
2.1973
2.6521
1.7169
1.8384
1.9954
2.2262
2.7478
1.7279
1.8522
2.0141
2.2571
2.8782
1.7392
1.8663
2.0335
2.2904
3.0902
Z N (0, 1): Quantiles zp = 1 (p) of the standard normal distribution.
II

Lecture Notes Statistics I

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes Statistics I

Uploaded by

Copyright:

Available Formats

Advanced Statistics I

Institut fr Statistik und konometrie

October 21, 2016

2. Random variables and their probability distributions

3. Moments of random variables

Convergence of Number and Function Sequences . . . . . . . . . . . . . . . . . .

6. Samples and Statistics

1. Elements of probability theory

1.1. Sample Space and Events

1. Elements of probability theory

Note the following:

1. Elements of probability theory

1. Elements of probability theory

1. Elements of probability theory

As mentioned above, if S is uncountable a sigma algebra containing all subsets of S cannot be

Non-axiomatic Probability Definitions

1. Elements of probability theory

For the event A = {1, 2, 3} we obtain

Remark 1.6 The classical definition has two major drawbacks.

1. Elements of probability theory

nhead (No. of heads) nhead /n (Rel. freq.)

It would appear that n

1. Elements of probability theory

Axiomatic Probability Definition

1. Elements of probability theory

This set function represents a probability set function since

inf inte geom. series

This set function represents a probability set function since

1. Elements of probability theory

A0i s are nonoverlapping intervals: addititivity property of Riemann integrals

1.3. Properties of the Probability Function

Theorem 1.1 Let A be an event in S. Then P (A) = 1 P A .

Theorem 1.2 It holds that P () = 0.

1. Elements of probability theory

Theorem 1.4 Let A and B be events in S. Then P (A) = P (A B) + P (A B).

P (A) = P [(A B) (A B)]

Theorem 1.5 Let A and B be events in S. Then P (A B) = P (A) + P (B) P (A B).

Corollary 1.1 (Booles Inequality) It holds that P (A B) P (A) + P (B).

1. Elements of probability theory

Theorem 1.7 (Bonferronis Inequality) Let A and B be events in S. Then P (A B)

By Theorem 1.1 we have P (A B) = 1 P (A B). DeMorgans law states that A B = A B.

0 (Axiom 1) we have P (A B) 1 P (A)

Theorem 1.8 Let A1 , . . . , An be events in S. Then P (ni=1 Ai ) 1

P (Ai ) and P (ni=1 Ai )

1. Elements of probability theory

1.4. Conditional Probability

B = {at least one coin shows H}.

1. Elements of probability theory

(by def. of conditional probability)

P (Ai B)/P (B)

(since (Ai B) (Aj B) = for i 6= j)

(by def. of conditional probability).

1. Elements of probability theory

Theorem 1.12 (Extended Multiplication Rule) Let A1 , A2 , . . . , An , n 2, be events in

1. Elements of probability theory

P (An1 C) = P (An1 |C)P (C). Hence we have

1. Elements of probability theory

The independence of A and B is obtained analogously. To establish the independence of A and B

= 1 P (A) P (B)[1 P (A)]

The following example illustrates the concept of independence.

for all subsets J {1, 2, . . . , n}

1. Elements of probability theory

P (A1 A2 ) = P (A1 A3 ) = P (A2 A3 ) = 1/9,

1.6. Total Probability Rule and Bayes Law