You are on page 1of 6

Probability theory

The subject of probability theory is the foundation upon which all of statistics/econometrics is built,
providing a mean for modelling populations, experiments, or almost anything else that could be considered a
random phenomenon. Through these models, econometricians are able to draw inferences about populations,
inferences based on examination of only a part of the whole.

Foundations

Denition 1 The set, S, of all possible outcomes of an experiment is called the sample space for the experiment.
Take the simple example of tossing a coin. There are two outcomes, heads and tails, so we can write
S = fH; T g. If two coins are tossed in sequence, we can write the four outcomes as S = fHH; HT; T H; T T g.
Denition 2 An event is any collection of possible outcomes of an experiment, that is, any subset of S
(including S itself ).
Continuing the two coins example, one event is A = fHH; HT g, the event that the rst coin is heads.
Another event is ;. A third event is HH.
Denition 3 A collection of subsets of S is called a sigma algebra, denoted by B, if it satises the following
three properties:
1. ; 2 B.
2. If A 2 B, then AC 2 B; where AC is the complement of A.
3. If A1 ; A2 ; ::: 2 B, then [1
i=1 Ai 2 B.
The collection of the two sets f;; Sg is an example of a (trivial) sigma algebra. Associated with the
sample space S we can have many dierent sigma algebras. The only sigma algebra we will be concerned
with is the smallest one that contains all of the open sets in a given sample space S.
If S is nite or countable, then these technicalities do not arise and we dene B to be the collection of
all subsets of S. In general, if S is uncountable, it is not an easy task to describe B. However, B is chosen
to contain any set of interest.
Example 4 Let S = ( 1; 1), the real line. Then B is chosen to contain all sets of the form
[a; b] ,

(a; b) ,

(a; b] ,

and [a; b)

for all real numbers a and b.


We now can give the axiomatic denition of probability.
Denition 5 Given a sample space S and an associated sigma algebra B, a probability function is a function
P with domain B that satises
1. P (A)

0 for all A 2 B.

2. P (S) = 1:
3. If
A1 ; A2 ; ::: 2 B are pairwise disjoint (that is if Ai \ Aj = ; for all i and j), then P ([1
i=1 Ai ) =
P1
P
(A
).
i
i=1

If P (B) > 0, the conditional probability of the event A given the event B is
P (AjB) =

P (A \ B)
.
P (B)

For any B, the conditional probability function is a valid probability function where S has been replace by
B. Rearranging the denition, we can write
P (A \ B) = P (AjB) P (B)
which is often useful.
We can say that the occurence of B has no information about the likelihood of event A when P (AjB) =
P (A), in which case we nd
P (A \ B) = P (A) P (B) :
(1)
We say that the events A and B are statistically independent when (1) holds. Furthermore, we say that the
collection of events A1 ; :::; Ak if for any subcollection Ai1 ; :::; Aik , we have
Qk
P \kj=1 Aij = j=1 P Aij .

Let me emphasize that his must hold for any subcollection. For instance, pairwise independence does not
imply joint independence.

Random variables

A random variable X is a function from a sample space S into the real line. This induces a new sample
space the real line and a new probability function on the real line. Typically, we denote random variables
by uppercase letters such as X; and use lower case letters such as x for potential values and realized values.
Example 6 Experiment: toss two dice, random variable: X =sum of the numbers.
Example 7 Experiment: toss a coin 25 times, random variable: X =number of heads.
Example 8 Experiment: get a master degree, random variable: X =hourly wage one year later.
For a random variable X we dene its cumulative distribution function (cdf) as
F (x) = P (X

x) :

Sometimes we write this as FX (x) to denote that it is the cdf of X. A function is a cdf if and only if the
following properties hold:
1. limx!

F (x) = 0 and limx!1 F (x) = 1

2. F (x) is nondecreasing in x:
3. F (x) is right-continuous.
We say that a random variable is discrete if F (x) is a step function. In this case, the range of X consists of
a countable set of real numbers, 1 ; 2 ; ::: The size of the jump at any point x is equal to f (x) P (X = x),
which is called the probability mass function (pmf). By denition,
X
F (x) =
f ( i) .
(2)
i:

We say tat the random variable X is continuous if F (x) is continuous in x. In this case P (X = x) = 0
for all x 2 <. The analog of (2) in the continuous case is to substitute integrals for sums, and we get
Z x
F (x) =
f (t) dt
(3)
1

Thus, we dene the probability density function (pdf) or a continuous random variable as
dF (x)
dx

f (x) =
such that
P (a

b) =

f (u) du

for any real a and b with a < b. These expressions only make sense if F (x) is dierentiable. While there are
examples of continuous random variables which does not possess a pdf, these cases are highly unusual and
do not play any role in econometrics.
Theorem 9 A function f (x) is a pmf (or pdf ) of a random variable X if and only if
1. f (x) 0 for all x.
R1
P
2.
f (x) dx = 1 (pdf ).
x f (x) = 1 (pmf ) or
1

Expectation

The expected value, or mean, or expectation, of a random variable is merely its average value, where we
speak of average value as one that is weighted according to the probability distribution.
For any measurable real function g, we dene the mean or expectation E [g (X)] as follows. If X is
discrete
X
E [g (X)] =
g ( i) f ( i)
i

and if X is continuous

E [g (X)] =

g (x) f (x) dx

provided that the sum or integral exists. If E [jg (X)j] = 1, we say that E [g (X)] does not exist.
The expectation is a linear operator:
E [a g1 (X) + a g2 (X) + c] = a E [g1 (X)] + b E [g2 (X)] + c.
The expected value of a random variable has another property, it minimizes the mean squared distance
from the random variable:
h
i
2
E [X] = arg minE (X b) :
b2R

For each integer n, we dene the nth moment of X as E [X n ] : The rst moment is simply called the
n
mean and it is denoted by
E [X]. The nth central moment of X is dened as E [(X
) ]. The second
central moment is commonly known as the variance
i
h
2
2
V ar (X) E (X
)
and its square root is the standard deviation. The standard deviation is easier to interpret in that the
measurement unit on the standard deviation is the same as that for the original variable X. The measurement
unit on the variance is the square of the original unit.
Theorem 10 If X is a random variable with nite variance, then for any constants a and b,
V ar (aX + b) = a2 V ar (X) :
Theorem 11 If X is a random variable with nite variance, then
V ar (X) = E X 2
3

E [X]

Multivariate random variables

Denition 12 An n-dimensional random vector is a function from a sample space S into <n , n-dimensional
Euclidean space.
For simplicity (and because no insight is gained by considering higher dimensional vector) we will consider
only n = 2 so our random vector is the ordered pair (X; Y ).
Example 13 Consider the experiment of tossing thrice a coin. Dene
X

= number of Hs on rst two tosses

= number of Hs on rst three tosses:

In this way we have dened the bivariate random vector (X; Y ). This random vector is called a discrete
random vector because it has only a countable (in this case, nite) number of possible values.
Denition 14 The joint cdf of a bivariate random vector (X; Y ) is
F (x; y) = P (X

x; Y

y) :

Denition 15 Let (X; Y ) be a discrete bivariate random vector. Then the function f (x; y) from <2 into <
dened by f (x; y) P (X = x; Y = y) is called the joint probability mass function (or joint pmf ) of (X; Y ).
Denition 16 Let (X; Y ) be an absolutely continuous bivariate random vector. Then the function f (x; y)
from <2 into < dened by
@2
F (x; y)
f (x; y) =
@x@y
is called the joint probability density function (or joint pdf ) of (X; Y ).
The joint pmf (or pdf) fX;Y (x; y) satises
1. fX;Y (x; y) 0 for every (x; y) 2 <2 .
R1 R1
PP
2.
fX;Y (x; y) = 1 or 1 1 fX;Y (x; y) dydx = 1:
(x;y)2R2

The variable X is itself a random variable in the sense of section 2, and its probability distribution is
described by its pmf, namely, fX (x) = P (X = x). We now call fX (x) the marginal pmf (or pdf) of X to
emphasize the fact that it is the pmf (or pdf) of X only but in the context of the probability model that
gives the joint distribution of the vector (X; Y ). The marginal pmf (or pdf) are easily calculated from the
joint pmf (or pdf):
X
X
fX (x) =
fX;Y (x; y) and fY (y) =
fX;Y (x; y) .
y2<

or

fX (x) =

x2<

fX;Y (x; y) dy

and fY (y) =

fX;Y (x; y) dx.

The marginal pmf of X and Y can be used to compute probabilities or expectations that involve only X or
Y . But to compute a probability or expectation that simultaneously involves both X and Y , we must use
the joint pmf. It is important to note that the joint pmf uniquely determines the marginals but the reverse
is not true.
Denition 17 For any measurable function g (X; Y ) ;
XX
E [g (X; Y )] =
g (x; y) fX;Y (x; y)
(x;y)2<2

for discrete rv or, for continuous rv,


E [g (X; Y )] =

1
1

g (x; y) fX;Y (x; y) dydx:

Conditional distribution and independence

Denition 18 Let (X; Y ) be a discrete (continuous) bivariate random vector with joint pmf (pdf ) fX;Y (x; y)
and marginal pmf (pdf ) fX (x) and fY (y). For any such x such that fX (x) > 0, the conditional pmf (pdf)
of Y given X = x is
fX;Y (x; y)
fY jX (y jx ) =
.
fX (x)
Note that we are conditioning on an event of probability 0 when the variables are continuous. The
denition does however dene a valid pdf.
Conditional pmfs and pdfs can be used in exactly the same way as other univariate pmfs and pdfs. We
can get the conditional expected value of g (Y ) given X = x as
P
R y g (y) fY jX (y jx ) (discrete)
E [g (Y ) jX = x ] =
g (y) fY jX (y jx ) dy (continuous)

In particular we can get the conditional mean and variance of Y given X = x. Note that the conditional
variance is dened as
h
i
2
V ar (Y jX = x ) = E (Y E [Y jX = x ]) jX = x
= E Y 2 jX = x

E [Y jX = x ]

Theorem 19 (law of iterated expectation) For any two rv X and Y ,


E [X] = E [E (X jY )] ,
provided that the expectations exist.
Theorem 20 (conditioning theorem) For any function g (x) ;
E [g (X) Y jX] = g (X) E [Y jX]
Theorem 21 (law of iterated variance) For any two rv X and Y ,
V ar [X] = E [V ar (X jY )] + V ar [E (X jY )] ,
provided that the expectations exist.
Denition 22 Let (X; Y ) be a bivariate random vector with joint pmf or pdf fX;Y (x; y) and marginal pmfs
or pdfs fX (x) and fY (y). Then X and Y are independent random variables if and only if
fX;Y (x; y) = fX (x) fY (y) for every x 2 R and y 2 R.
Theorem 23 Let (X; Y ) be a bivariate random vector with joint pmf or pdf fX;Y (x; y). Then X and Y are
independent if, and only if, there exist functions g (x) and h (y) such that
fX;Y (x; y) = g (x) h (y) for every x 2 R and y 2 R.
Theorem 24 Let X and Y be independent random variables.
1. If g (x) is a function of x only and h (y) is a function of y only then
E [g (X) h (Y )] = E [g (X)] E [h (Y )]
2. For any A

R and B

R the events (X 2 A) and (Y 2 B) are independent events.

Denition 25 The covariance of X and Y is the number dened by


Cov (X; Y ) = E [(X

E (X)) (Y
5

E (Y ))] .

Theorem 26 For any rv X and Y ,


Cov (X; Y ) = E [XY ]

E [X] E [Y ]

Theorem 27 If X and Y are independent rv, then Cov (X; Y ) = 0.


Remark 28 The reverse is not true!
Theorem 29 If X and Y are any two rv and a; b and c are any three constants, then
V ar (aX + bY ) = a2 V ar (X) + b2 V ar (Y ) + 2abCov (X; Y ) .
Theorem 30 The correlation coe cient of X and Y is the number dened by
X;Y

=p

Cov (X; Y )
V ar (X) V ar (Y )

Theorem 31 (Cauchy-Schwary inequality) For any two random variables X and Y ,


E [XY ]

E X2 E Y 2

Theorem 32 For any two rv X and Y


1.
2.

X;Y

X;Y = 1 if and only if there exist numbers a 6= 0 and b such that P (Y = aX + b) = 1. If


then a > 0 and if X;Y = 1, then a < 0.

X;Y

= 1,

Remark 33 This chapter outlined the mathematical fondations on which the statistical developments are
based. However, in the following we shall ignore as far as possible the technical di culties and instead
concentrate on the statistical issues. In particular, we shall pay little or no attention to three technical
di culties that occur throughout:
1. The estimators that will be derived need to be measurable. We shall not check that this requirement is
satised. In practice, the sets and functions usually turn out to be measurable although verication of
their measurability can be quite di cult.
2. Typically, the estimators are also required to be integrable. This condition will be tacitly assumed, even
if it is not as universally satised as measurability.
3. We will sometimes assume that we can interchange the order of di erentiation and integration. There
are several results in calculus that give conditions under which this operation is legitimate (e.g. dominated convergence theorem). We will not check these conditions in our examples.
Remark 34 These notes are a (very short!) summary of the rst 4 chapters of Casella and Berger (2001).
This book is naurally recommended for many more results, examples and proofs. There exist many introductory books that are easier to read if you have never studied probability before. For instance, Anderson,
Sweeney, Willimas, Camm, and Cochran (2013) provides a very informal introduction to probability and
statistics.

References
Anderson, D. R., D. J. Sweeney, T. A. Willimas, J. D. Camm, and J. J. Cochran (2013): Statistics
for Business & Economics. Cengage Learning.
Casella, G., and R. L. Berger (2001): Statistical Inference, second edition. Brooks/Cole.

You might also like