You are on page 1of 25

Probability

201A: Econometrics I
Peter E. Rossi
Anderson School of Management

2018

1 / 25
Outline

1. Role of Probability in Econometrics


2. Sample Spaces and Events
3. Probability
4. Interpretations of Probability
5. Finite Sample Spaces and Counting Rules
6. Independence
7. Conditional Probability
8. Bayes Theorem
9. Bayes Theorem and the Weight of Evidence for a Hypothesis
10. Classic Game Show Host Problem

2 / 25
Role of Probability in Econometrics

I Probability is a “language” or mathematical representation of


uncertainty
I All of econometrics starts with a data generating mechanism or
model
I To be useful, models must have random components based on
probability theory
I Probability can also be use to summarize empirical findings and
make predictions
I How likely is it that a theory is true or represents the data?
I If I (agent/firm) undertake action a, what is the probability
distribution of outcomes (what does this have to do with data?)?

3 / 25
Sample Spaces and Events

One way to think about data is that it arises as the outcome of an


“experiment.”
The sample space, Ω, is the set of possible outcomes of the experiment.
ω ∈ Ω are the sample outcomes, realizations, or elements.
Subsets of Ω are called events.
Examples:
1. Toss a coin twice. Ω = {HT, T H, HH, T T }. An event is that
sequence of coin tosses ends in a head, A = {HH, T H}.
2. Compute return on holding a stock for one period. Ω = (−∞, ∞)
(is this ok?). An event is that returns are positive, A = (0, ∞).
3. Expose a market to an ad and measure sales response. Ω = [0, ∞).
An event might be that the profits on incremental sales exceed the
cost of the ad (again, this specifies an interval of the real line).

4 / 25
Set Operations

Complement. Ac = {ω ∈ Ω : ω ∈/ A :} (note “:” means “such that”).


Ωc = ∅. Sometimes the complement of A is referred to as “not A.”
S∞
Union. i=1 A = {ω ∈ Ω : ω ∈ Ai for at least one i}. For two sets, we
sometimes say “A or B.”
T∞
Intersection. i=1 Ai = {ω ∈ Ω : ω ∈ Ai for all i}. For two sets, we
sometimes say “A and B” and write AB.
Set Difference. A − B = {ω ∈ Ω : ω ∈ A, ω ∈
/ B}.
Set Inclusion. A ⊂ B. This means A is a subset of B.
Disjoint sets. A1 , A2 , . . . are disjoint or mutually exclusive if Ai ∩ Aj = ∅
when i 6= j.

5 / 25
Partitions, Indicator Functions, and Monotone Sequences
A partition of the sample space. A sequence of disjoint sets such that
S
Ai = Ω .
Example:
consider a sequence of N coin tosses, then the number of heads in each
sequence defines a partition of the sample space. Looking ahead: if our
goal is to make inferences regarding the probability of a head H given the
realization of one sequence, does knowing which element of this partition
“summarize” the information in this seqence completely?
An indicator function of an event is defined by
(
1 if ω ∈ A
IA (ω) =
0 if ω ∈/A

A sequence of sets A1 , A2 , . . .Sis monotone increasing if A1 ⊂ A2 . . . and



we can define limn→∞ An = i=1 Ai .
1
Example: Ai = [0, 2 − i ). This is a monotone increasing sequence and
limn→∞ An = [0, 2).

6 / 25
Probability

Assign a real number, P (A) , to each event. With some restrictions, this
mapping from events to real numbers can be called a probability measure
or probability distribution.
Axioms of Probability.
Axiom 1: P (A) ≥ 0
Axiom 2: P (Ω) = 1 S∞ P∞
Axiom 3: If A1, A2, . . . are disjoint, then P ( i=1 Ai ) = i=1 P (Ai ).
How can we interpet what “probability” means or where this measure
comes from?
There are three major schools of thought.
I Frequency interpretation: probabilities are long-run frequencies (e.g.
coin toss).
I Subjective: probabilities represent degrees of belief
I Probabilities are useful building blocks for models of observed data

7 / 25
Interpretation of Probability

Frequency interpretation is strained for anything other than simple


experiments. Empirical work always starts with one dataset. In most
work, this data is NOT the outcome of an experiment but is passively
observed. We want to make inferences and make predictions given this
one dataset.
Examples:
I returns on a risky asset, certainly these are random and can be
characterized by a probability measure. Should we think that we
have to run the stock market forward for an infinite amount of time
and simply compute the frequency of any interval of returns to
understand where this comes from?
I consider the problem of evaluating an hypothesis (is a given theory
correct given a dataset – not too different from: is a defendant
guilty of a crime given the evidence?). What is the relevance of a
frequency interpretation? Is this even possible?

8 / 25
Properties of Probability Measures

Some simple consequences of the Axioms are:


I P (∅) = 0 and P (A) ≤ 1
I P (Ac ) + P (A) = 1 or P (Ac ) = 1 − P (A).
I P (A ∪ B) = P (A) + P (B) − P (AB).
Continuity of Probabilities. What does this mean? The probability of the
limit is the limit of the probabilities. If a sequence of sets gets very close
to a limit set, so will the probabilities.
If An → A, then P (An ) → P (A). A = limn→∞ An . The key
observation to see why this is true is to express the limit as the union of
dispoint sets and then recall that since probabilties are positive and sum
to one then limits of sums of probabilties of dispoint sets must converge.

9 / 25
Finite Sample Spaces and Counting Measure

Suppose the sample space is finite (is this much of a restriction?). Classic
examples include the coin toss experiment but this covers anytime in
which the set of possible outcomes are finite. If all outcomes are equally
likely, then you get what is called a counting measure. That is, to
determine the probability of an event, we simply count the number of
outcomes in the event and express that as a ratio to the total number of
possible outcomes.
kAk
P (A) =
kΩk
This probability measure is called the uniform probability measure.
Computing probabilities with this sort of measure involves simply
counting the total number of outcomes in the sample space and counting
the number of outcomes that constitute an event. This can get
somewhat tricky and is called combinatorics. Given we have computers,
these considerations are less relevant today than in the heyday of
probability theory among starving Russian probabilists.

10 / 25
How to Count if You Must

We only need to know two “counting” methods. Most other identities can
be derived using these two ideas.
I The number of ways of ordering n objects is
n! = n (n − 1) . . . 3 × 2 × 1.
I The number of ways of choosing k objects from n is given by the
so-called binomial coefficients. (think we have to choose a
committee of size k from among n students).
 
n n (n − 1) . . . (n − k + 1) n!
= =
k k! k! (n − k)!

that is, the numerator is the number of ways of choosing k different


people from n but includes identical committees (e.g. (A, B) and (B,
A)). We divide by k! to remove the “double-counting” of non-unique
committees.

11 / 25
Independent Events

The concept of independence between events is a very difficult one. The


approach taken in our text is to define independence as resulting the a
product rule for the probability of the joint occurence (intersection) of the
events. Below we will explain this further using conditional probability.
Definition:
Two events, A and B are independent if and only if,

P (AB) = P (A) P (B)


`
Sometimes this is denoted as A B or by some economists as A ⊥ B.
However, this definition is not very satisfactory as it does explain what
independence means other than via a mathematical identity.
Independence is often assumed as the property of the sampling
experiment (such as we assume coin tosses are independent).
Pseudo-random number generators used in statistical computing (more
later) are constructed to appear as though they are independent.

12 / 25
Equally Probable and Disjoint Events
Equally probable events do not have to be independent!
The classic example is a disjoint event. If we toss a coin, then the event
of a Head is equally probable as the event of a tail even though they are
not independent. We know that for any two events (with positive prob),
P(A) , P (B) > 0 and we also know (by definition of disjoint) that
P (AB) = 0.
Example of computations that exploit disjoint events and independence.
Consider the probability that at least one head is obtained in a sequence
of n coin tosses. One way is to simply count all events with one head and
divide by the number of total events. Another is to express the problem
using the complementary event (no heads obtained).

P (A) = 1 − P (Ac ) = 1 − P (all tails)


= 1 − P (T1 . . . Tn )
= 1 − P (T1 ) · · · P (Tn )
 n
1
=1−
2

13 / 25
Birthday Problem

A classic problem illustrating a uniform probability measure and the use


of complementary events to simplify counting is the Birthday Problem.
I Consider a class of n students. What is the probability that there are
at least two students with the same birthday (month and day)?
I In order to solve this, we have to make some assumptions regarding
the distribution of birthdays. If we assume that birthdays are
uniformly distributed (all birthdays are equally likely), then we have
simplified the problem: Simply count the number of possible class
configurations with two or more duplicate birthdays and divide by
total number of class confirgurations.
I How do we enumerate all possibilities with two or more duplicates?
Might be hard without a computer. Key insight: express this
problem as computing the probability of complementary event.
I What is the probability that no birthdays will be identical in a class
of size n?

14 / 25
Birthday Problem (continued)

Now let’s compute the probability that no one has the same birthday. Let
A be the event that there are two or more duplicate birthdays

P (Ac ) = P (B2 6= B1 ) P (B3 6= (B1 , B2 )) · · · P (Bn 6= (B1 , . . . , Bn−1 ))


364 363 365 − (n − 1)
= × ···
365 365 365
365 364 363 365 − (n − 1)
= × × ···
365 365 365 365

How should we compute this? Let’s illustrate in R. How would you do


this if the probability of a given birthdate is not uniform across the 365
possibilties?

15 / 25
R Implementation

16 / 25
R Implementation

Same Bday Probs by Class Size

1.0
0.8
0.6
prob
0.4
0.2
0.0

0 20 40 60
class_size

17 / 25
Conditional Probability

Definition. If P (B) > 0, then the conditional probability of A given B is

P (A and B)
P (A|B) =
P (B)

Some think about this in a frequency setting as the fraction of times we


observe A in events in which B occurs. It can also be applied in a
subjectivist or non-frequency sense. Example: B is the event that there is
a DNA match between a suspect and DNA material collected from the
scene of a crime. A is the event that the suspect is guilty. We want to
compute P (A|B) as a “rational” juror. This is a view that conditional
probabilties tell us exactly how we should learn about an unobservable
(guilt or innocence) given data (DNA match).

18 / 25
Conditional Probability as a Partition

Consider a 3 coin toss experiment. The number of heads partitions the


sample space of all coin toss sequences into four sets. Even if
P (H) 6= 12 , the probability of each sequence in each column of the table
(each element of the partition) is equal. If the coin is not fair, then the
sequences with more (fewer) heads will be more (less) probable. There is
a sense in which this partition represents all information wrt to the
probability of a head. We will make this precise later.

0 1 2 3
TTT HTT HHT HHH
THT HTH
TTH THH

19 / 25
Conditional Probability and Independence

Lemma. If A and B are independent events, then

P (A and B) P (A) P (B)


P (A|B) = = = P (A)
P (B) P (B)

What this means that if A and B are independent, we don’t change our
views regarding the probability of A occurring if we observe B.

20 / 25
Inverse Probability and Fallacies

Prosecutor’s Fallacy . P (A|B) 6= P (B|A) . Sometimes P (A|B) is called


the inverse probability of P (B|A).
Just because event B is highly likely given A doesn’t mean the inverse.
Inverse probabilities are given by

P (AB) P (B|A) P (A) P (A)


P (A|B) = = = P (B|A)
P (B) P (B) P (B)

This means that P (A|B) = P (B|A) only if the marginal probabilities of


A and B are the same.
Why is it called the prosecutors fallacy? Just because the observed
evidence (DNA match, eye witness testimony) is high probable given that
the suspect committed the crime doesn’t mean that the probability of
guilty is large given the evidence as this depends on prob of guilt as well
as the occurrence of evidence in the general population. If the evidence
(driving a white car, showing up in a Checker cab) is common, then prob
of guilty given evidence may be low.

21 / 25
Bayes Theorem

Theorem. Let A1 , . . . , Ak form a partition of the sample space such that


P (Ai ) > 0. If P (B) > 0, then

P (B|Ai ) P (Ai )
P (Ai |B) = P
j P (B|Aj ) P (Aj )

The denominator comes from the fact that the {Ai } constitute a
partition and is sometimes called the “law of total probability.” Some call
P (Ai ) the prior (before) probability of the event and is P (Ai |B) is called
the posterior (after) probability. This is an updating formula that tells
you how the event B updates probabilities from the prior to posterior.

22 / 25
The Classic Game Show Host Problem

Problem. You are on a game show where you “compete” with the host
for prizes. This problem is styled after the “Price is Right” hosted by
Monty Hall.
I There are three doors (doors A, B, C). There is a prize (a car)
behind one of the doors and a “goat” or nothing behind the others.
I The host asks you to choose one door. Let’s say you pick door A.
I The host reveals what is behind door B and there is no prize behind
it.
I The host ask to if you want to switch from door A to door C.
(sometimes the host might ask you to “bet” or pay for this privilege).
I Should you switch?

23 / 25
The Classic Game Show Host Problem

Solution. Compute the correct conditional probability of the prize behind


door C given the information you have.

P (Prize behind C|Monty picks B, I pick A)


This should be a simple exercise using Bayes Theorem to convert
statement about the probability of Monty picking doors given the prize is
behind the door to probabilities of the prize given the doors shown. This
problem has generated much controversy including many mathematicians
who get it wrong. The reasons:
I conditional probability reasoning is hard
I people forget that the host is strategic (the host is never going to
show you the prize!).

24 / 25
The Classic Game Show Host Problem
Solution. Let C P be the event that the prize is behind door C. Let B M
be the event that the host (“Monty”) shows you door B. Let ACh be the
event that you choose door A to begin with.

P B M |ACh = P B M |AP P AP + P B M |B P P B P
    

+ P B M |C P P C P
 

1
= .5 × 1/3 + 0 × 1/3 + 1 × 1/3 =
M P Ch
 P Ch
 2
P B |C , A P C |A 1 × 1/3 2
P C P |B M , ACh =

M Ch
= 1 =
P (B |A ) /2 3

Here we are using two key insights



I Monty will never show the prize (hence P B M |B P = 0)
I If the prize is behind door A, then Monty should randomize in
showing door B or C (if he doesn’t, he will reveal even more
information that he already has).

25 / 25

You might also like