Professional Documents
Culture Documents
The subject of probability theory is the foundation upon which all of statistics/econometrics is built,
providing a mean for modelling populations, experiments, or almost anything else that could be considered a
random phenomenon. Through these models, econometricians are able to draw inferences about populations,
inferences based on examination of only a part of the whole.
Foundations
Denition 1 The set, S, of all possible outcomes of an experiment is called the sample space for the experiment.
Take the simple example of tossing a coin. There are two outcomes, heads and tails, so we can write
S = fH; T g. If two coins are tossed in sequence, we can write the four outcomes as S = fHH; HT; T H; T T g.
Denition 2 An event is any collection of possible outcomes of an experiment, that is, any subset of S
(including S itself ).
Continuing the two coins example, one event is A = fHH; HT g, the event that the rst coin is heads.
Another event is ;. A third event is HH.
Denition 3 A collection of subsets of S is called a sigma algebra, denoted by B, if it satises the following
three properties:
1. ; 2 B.
2. If A 2 B, then AC 2 B; where AC is the complement of A.
3. If A1 ; A2 ; ::: 2 B, then [1
i=1 Ai 2 B.
The collection of the two sets f;; Sg is an example of a (trivial) sigma algebra. Associated with the
sample space S we can have many dierent sigma algebras. The only sigma algebra we will be concerned
with is the smallest one that contains all of the open sets in a given sample space S.
If S is nite or countable, then these technicalities do not arise and we dene B to be the collection of
all subsets of S. In general, if S is uncountable, it is not an easy task to describe B. However, B is chosen
to contain any set of interest.
Example 4 Let S = ( 1; 1), the real line. Then B is chosen to contain all sets of the form
[a; b] ,
(a; b) ,
(a; b] ,
and [a; b)
0 for all A 2 B.
2. P (S) = 1:
3. If
A1 ; A2 ; ::: 2 B are pairwise disjoint (that is if Ai \ Aj = ; for all i and j), then P ([1
i=1 Ai ) =
P1
P
(A
).
i
i=1
If P (B) > 0, the conditional probability of the event A given the event B is
P (AjB) =
P (A \ B)
.
P (B)
For any B, the conditional probability function is a valid probability function where S has been replace by
B. Rearranging the denition, we can write
P (A \ B) = P (AjB) P (B)
which is often useful.
We can say that the occurence of B has no information about the likelihood of event A when P (AjB) =
P (A), in which case we nd
P (A \ B) = P (A) P (B) :
(1)
We say that the events A and B are statistically independent when (1) holds. Furthermore, we say that the
collection of events A1 ; :::; Ak if for any subcollection Ai1 ; :::; Aik , we have
Qk
P \kj=1 Aij = j=1 P Aij .
Let me emphasize that his must hold for any subcollection. For instance, pairwise independence does not
imply joint independence.
Random variables
A random variable X is a function from a sample space S into the real line. This induces a new sample
space the real line and a new probability function on the real line. Typically, we denote random variables
by uppercase letters such as X; and use lower case letters such as x for potential values and realized values.
Example 6 Experiment: toss two dice, random variable: X =sum of the numbers.
Example 7 Experiment: toss a coin 25 times, random variable: X =number of heads.
Example 8 Experiment: get a master degree, random variable: X =hourly wage one year later.
For a random variable X we dene its cumulative distribution function (cdf) as
F (x) = P (X
x) :
Sometimes we write this as FX (x) to denote that it is the cdf of X. A function is a cdf if and only if the
following properties hold:
1. limx!
2. F (x) is nondecreasing in x:
3. F (x) is right-continuous.
We say that a random variable is discrete if F (x) is a step function. In this case, the range of X consists of
a countable set of real numbers, 1 ; 2 ; ::: The size of the jump at any point x is equal to f (x) P (X = x),
which is called the probability mass function (pmf). By denition,
X
F (x) =
f ( i) .
(2)
i:
We say tat the random variable X is continuous if F (x) is continuous in x. In this case P (X = x) = 0
for all x 2 <. The analog of (2) in the continuous case is to substitute integrals for sums, and we get
Z x
F (x) =
f (t) dt
(3)
1
Thus, we dene the probability density function (pdf) or a continuous random variable as
dF (x)
dx
f (x) =
such that
P (a
b) =
f (u) du
for any real a and b with a < b. These expressions only make sense if F (x) is dierentiable. While there are
examples of continuous random variables which does not possess a pdf, these cases are highly unusual and
do not play any role in econometrics.
Theorem 9 A function f (x) is a pmf (or pdf ) of a random variable X if and only if
1. f (x) 0 for all x.
R1
P
2.
f (x) dx = 1 (pdf ).
x f (x) = 1 (pmf ) or
1
Expectation
The expected value, or mean, or expectation, of a random variable is merely its average value, where we
speak of average value as one that is weighted according to the probability distribution.
For any measurable real function g, we dene the mean or expectation E [g (X)] as follows. If X is
discrete
X
E [g (X)] =
g ( i) f ( i)
i
and if X is continuous
E [g (X)] =
g (x) f (x) dx
provided that the sum or integral exists. If E [jg (X)j] = 1, we say that E [g (X)] does not exist.
The expectation is a linear operator:
E [a g1 (X) + a g2 (X) + c] = a E [g1 (X)] + b E [g2 (X)] + c.
The expected value of a random variable has another property, it minimizes the mean squared distance
from the random variable:
h
i
2
E [X] = arg minE (X b) :
b2R
For each integer n, we dene the nth moment of X as E [X n ] : The rst moment is simply called the
n
mean and it is denoted by
E [X]. The nth central moment of X is dened as E [(X
) ]. The second
central moment is commonly known as the variance
i
h
2
2
V ar (X) E (X
)
and its square root is the standard deviation. The standard deviation is easier to interpret in that the
measurement unit on the standard deviation is the same as that for the original variable X. The measurement
unit on the variance is the square of the original unit.
Theorem 10 If X is a random variable with nite variance, then for any constants a and b,
V ar (aX + b) = a2 V ar (X) :
Theorem 11 If X is a random variable with nite variance, then
V ar (X) = E X 2
3
E [X]
Denition 12 An n-dimensional random vector is a function from a sample space S into <n , n-dimensional
Euclidean space.
For simplicity (and because no insight is gained by considering higher dimensional vector) we will consider
only n = 2 so our random vector is the ordered pair (X; Y ).
Example 13 Consider the experiment of tossing thrice a coin. Dene
X
In this way we have dened the bivariate random vector (X; Y ). This random vector is called a discrete
random vector because it has only a countable (in this case, nite) number of possible values.
Denition 14 The joint cdf of a bivariate random vector (X; Y ) is
F (x; y) = P (X
x; Y
y) :
Denition 15 Let (X; Y ) be a discrete bivariate random vector. Then the function f (x; y) from <2 into <
dened by f (x; y) P (X = x; Y = y) is called the joint probability mass function (or joint pmf ) of (X; Y ).
Denition 16 Let (X; Y ) be an absolutely continuous bivariate random vector. Then the function f (x; y)
from <2 into < dened by
@2
F (x; y)
f (x; y) =
@x@y
is called the joint probability density function (or joint pdf ) of (X; Y ).
The joint pmf (or pdf) fX;Y (x; y) satises
1. fX;Y (x; y) 0 for every (x; y) 2 <2 .
R1 R1
PP
2.
fX;Y (x; y) = 1 or 1 1 fX;Y (x; y) dydx = 1:
(x;y)2R2
The variable X is itself a random variable in the sense of section 2, and its probability distribution is
described by its pmf, namely, fX (x) = P (X = x). We now call fX (x) the marginal pmf (or pdf) of X to
emphasize the fact that it is the pmf (or pdf) of X only but in the context of the probability model that
gives the joint distribution of the vector (X; Y ). The marginal pmf (or pdf) are easily calculated from the
joint pmf (or pdf):
X
X
fX (x) =
fX;Y (x; y) and fY (y) =
fX;Y (x; y) .
y2<
or
fX (x) =
x2<
fX;Y (x; y) dy
and fY (y) =
The marginal pmf of X and Y can be used to compute probabilities or expectations that involve only X or
Y . But to compute a probability or expectation that simultaneously involves both X and Y , we must use
the joint pmf. It is important to note that the joint pmf uniquely determines the marginals but the reverse
is not true.
Denition 17 For any measurable function g (X; Y ) ;
XX
E [g (X; Y )] =
g (x; y) fX;Y (x; y)
(x;y)2<2
1
1
Denition 18 Let (X; Y ) be a discrete (continuous) bivariate random vector with joint pmf (pdf ) fX;Y (x; y)
and marginal pmf (pdf ) fX (x) and fY (y). For any such x such that fX (x) > 0, the conditional pmf (pdf)
of Y given X = x is
fX;Y (x; y)
fY jX (y jx ) =
.
fX (x)
Note that we are conditioning on an event of probability 0 when the variables are continuous. The
denition does however dene a valid pdf.
Conditional pmfs and pdfs can be used in exactly the same way as other univariate pmfs and pdfs. We
can get the conditional expected value of g (Y ) given X = x as
P
R y g (y) fY jX (y jx ) (discrete)
E [g (Y ) jX = x ] =
g (y) fY jX (y jx ) dy (continuous)
In particular we can get the conditional mean and variance of Y given X = x. Note that the conditional
variance is dened as
h
i
2
V ar (Y jX = x ) = E (Y E [Y jX = x ]) jX = x
= E Y 2 jX = x
E [Y jX = x ]
R and B
E (X)) (Y
5
E (Y ))] .
E [X] E [Y ]
=p
Cov (X; Y )
V ar (X) V ar (Y )
E X2 E Y 2
X;Y
X;Y
= 1,
Remark 33 This chapter outlined the mathematical fondations on which the statistical developments are
based. However, in the following we shall ignore as far as possible the technical di culties and instead
concentrate on the statistical issues. In particular, we shall pay little or no attention to three technical
di culties that occur throughout:
1. The estimators that will be derived need to be measurable. We shall not check that this requirement is
satised. In practice, the sets and functions usually turn out to be measurable although verication of
their measurability can be quite di cult.
2. Typically, the estimators are also required to be integrable. This condition will be tacitly assumed, even
if it is not as universally satised as measurability.
3. We will sometimes assume that we can interchange the order of di erentiation and integration. There
are several results in calculus that give conditions under which this operation is legitimate (e.g. dominated convergence theorem). We will not check these conditions in our examples.
Remark 34 These notes are a (very short!) summary of the rst 4 chapters of Casella and Berger (2001).
This book is naurally recommended for many more results, examples and proofs. There exist many introductory books that are easier to read if you have never studied probability before. For instance, Anderson,
Sweeney, Willimas, Camm, and Cochran (2013) provides a very informal introduction to probability and
statistics.
References
Anderson, D. R., D. J. Sweeney, T. A. Willimas, J. D. Camm, and J. J. Cochran (2013): Statistics
for Business & Economics. Cengage Learning.
Casella, G., and R. L. Berger (2001): Statistical Inference, second edition. Brooks/Cole.