Probability Refresher

Review of Probability Theory
Mario A. T. Figueiredo
Instituto Superior Tecnico
&
Instituto de Telecomunicacoes
Lisboa, Portugal
LxMLS: Lisbon Machine Learning School
July 22, 2014
M
ario A. T. Figueiredo (IST & IT)
LxMLS 2014: Probability Theory
July 22, 2014
1 / 35
Probability theory
M
July 22, 2014
2 / 35
Probability theory
The study of probability has roots in games of chance (dice, cards, ...)
M
July 22, 2014
2 / 35
Probability theory
Great names of science: Cardano, Fermat, Pascal, Laplace,
Kolmogorov, Bernoulli, Poisson, Cauchy, Boltzman, de Finetti, ...
M
July 22, 2014
2 / 35
Probability theory
Natural tool to model uncertainty, information, knowledge, belief, ...
M
July 22, 2014
2 / 35
Probability theory
Natural tool to model uncertainty, information, knowledge, belief, ...
...thus also learning, decision making, inference, ...
M
July 22, 2014
2 / 35
What is probability?
Classical definition: P(A) =
NA
N
...with N mutually exclusive equally likely outcomes,

NA of which result in the occurrence of A.
Laplace, 1814
Example: P(randomly drawn card is ) = 13/52.

Example: P(getting 1 in throwing a fair die) = 1/6.
M
July 22, 2014
3 / 35
NA
N

Laplace, 1814

Frequentist definition: P(A) = lim
NA
N
...relative frequency of occurrence of A in infinite number of trials.
M
July 22, 2014
3 / 35
NA
N

Laplace, 1814

Frequentist definition: P(A) = lim
NA
N
...relative frequency of occurrence of A in infinite number of trials.
Subjective probability: P(A) is a degree of belief.
de Finetti, 1930s
...gives meaning to P(tomorrow will rain).

M
July 22, 2014
3 / 35
Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.
Examples:
I
Tossing two coins: X = {HH, TH, HT , TT }
Roulette: X = {1, 2, ..., 36}
Draw a card from a shuffled deck: X = {A, 2, ..., Q, K }.
M
July 22, 2014
4 / 35
Key concepts: Sample space and events

Sample space X = set of possible outcomes of a random experiment.
Examples:
I
Tossing two coins: X = {HH, TH, HT , TT }
Roulette: X = {1, 2, ..., 36}
Draw a card from a shuffled deck: X = {A, 2, ..., Q, K }.
An event is a subset of X
Examples:
I
exactly one H in 2-coin toss: A = {TH, HT } {HH, TH, HT , TT }.
odd number in the roulette: B = {1, 3, ..., 35} {1, 2, ..., 36}.
drawn a card: C = {A, 2, ..., K } {A, ..., K }
M
July 22, 2014
4 / 35
Kolmogorovs Axioms for Probability

Probability is a function that maps events A into the interval [0, 1].
Kolmogorovs axioms for probability (1933):
M
July 22, 2014
5 / 35

I
For any A X , P(A) 0
M
July 22, 2014
5 / 35

I
P(X ) = 1
M
July 22, 2014
5 / 35

I
P(X ) = 1
[ X
If A1 , A2 ... X are disjoint events, then P
Ai =
P(Ai )
i
M
July 22, 2014
5 / 35

I
P(X ) = 1
[ X
If A1 , A2 ... X are disjoint events, then P
Ai =
P(Ai )
i
From these axioms, many results can be derived. Examples:

I
P() = 0
C D P(C ) P(D)
P(A B) = P(A) + P(B) P(A B)
P(A B) P(A) + P(B) (union bound)
M
July 22, 2014
5 / 35
Conditional Probability and Independence

If P(B) > 0, P(A|B) =
M
P(A B)
(conditional prob. of A given B)
P(B)
July 22, 2014
6 / 35

If P(B) > 0, P(A|B) =
P(A B)
P(B)
...satisfies all of Kolmogorovs axioms:

I
For any A X , P(A|B) 0
P(X |B) = 1
IfA[
X are disjoint, then
1 , A2, ...
X

P(Ai |B)
P
Ai B =
i
M
July 22, 2014
6 / 35

If P(B) > 0, P(A|B) =
P(A B)
P(B)
...satisfies all of Kolmogorovs axioms:

I
For any A X , P(A|B) 0
P(X |B) = 1
IfA[
X are disjoint, then
1 , A2, ...
X

P(Ai |B)
P
Ai B =
i
Events A, B are independent (A

B) P(A B) = P(A) P(B).
M
July 22, 2014
6 / 35

If P(B) > 0,
M
P(A|B) =
P(A B)
P(B)
July 22, 2014
7 / 35

If P(B) > 0,
P(A|B) =
P(A B)
P(B)

B) P(A B) = P(A) P(B).
M
July 22, 2014
7 / 35

If P(B) > 0,
P(A|B) =
P(A B)
P(B)

B) P(A B) = P(A) P(B).
Relationship with conditional probabilities:
P(A B)
P(A) P(B)
A
B P(A|B) =
=
= P(A)
P(B)
P(B)
M
July 22, 2014
7 / 35

If P(B) > 0,
P(A|B) =
P(A B)
P(B)

B) P(A B) = P(A) P(B).
P(A B)
P(A) P(B)
A
B P(A|B) =
=
= P(A)
P(B)
P(B)
Example: X = 52cards, A = {3, 3, 3, 3}, and B = hearts;
P(A) = 1/13, P(B) = 1/4
1
P(A B) = P({3}) =
52
M
July 22, 2014
7 / 35

If P(B) > 0,
P(A|B) =
P(A B)
P(B)

B) P(A B) = P(A) P(B).
P(A B)
P(A) P(B)
A
B P(A|B) =
=
= P(A)
P(B)
P(B)
P(A) = 1/13, P(B) = 1/4
1
P(A B) = P({3}) =
52
1 1
1
P(A) P(B) =
=
13 4
52
M
July 22, 2014
7 / 35

If P(B) > 0,
P(A|B) =
P(A B)
P(B)

B) P(A B) = P(A) P(B).
P(A B)
P(A) P(B)
A
B P(A|B) =
=
= P(A)
P(B)
P(B)
P(A) = 1/13, P(B) = 1/4
1
P(A B) = P({3}) =
52
1 1
1
P(A) P(B) =
=
13 4
52
1
P(A|B) = P(3|) =
13
M
July 22, 2014
7 / 35
Bayes Theorem
Law of total probability: if A1 , ..., An are a partition of X
P(B) =
P(B|Ai )P(Ai )
P(B Ai )
M
July 22, 2014
8 / 35
Bayes Theorem
Law of total probability: if A1 , ..., An are a partition of X
P(B) =
P(B|Ai )P(Ai )
P(B Ai )
Bayes theorem: if A1 , ..., An are a partition of X

P(Ai |B) =
P(B Ai )
P(B|Ai ) P(Ai )
=X
P(B)
P(B|Ai )P(Ai )
i
M
July 22, 2014
8 / 35
Random Variables
A (real) random variable (RV) is a function: X : X R
M
July 22, 2014
9 / 35
Random Variables
Discrete RV: range of X is countable (e.g., N or {0, 1})
M
July 22, 2014
9 / 35
Random Variables
Continuous RV: range of X is uncountable (e.g., R or [0, 1])
M
July 22, 2014
9 / 35
Random Variables
Example: number of head in tossing two coins,

X = {HH, HT , TH, TT },
X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.
Range of X = {0, 1, 2}.
M
July 22, 2014
9 / 35
Random Variables
Example: number of head in tossing two coins,

X = {HH, HT , TH, TT },
X (HH) = 2, X (HT ) = X (TH) = 1, X (TT ) = 0.
Range of X = {0, 1, 2}.
Example: distance traveled by a tossed coin; range of X = R+ .
M
July 22, 2014
9 / 35
Random Variables: Distribution Function

Distribution function: FX (x) = P({ X : X () x})
M
July 22, 2014
10 / 35

Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.
M
July 22, 2014
10 / 35

Example: number of heads in tossing 2 coins; range(X ) = {0, 1, 2}.
Probability mass function (discrete RV): fX (x) = P(X = x),

X
FX (x) =
fX (xi ).
xi x
M
July 22, 2014
10 / 35
Important Discrete Random Variables

Uniform: X {x1 , ..., xK }, pmf fX (xi ) = 1/K .
M
July 22, 2014
11 / 35


p
x =1
Bernoulli RV: X {0, 1}, pmf fX (x) =
1p x =0
Can be written compactly as fX (x) = p x (1 p)1x .
M
July 22, 2014
11 / 35


p
x =1
1p x =0
Binomial RV: X {0, 1, ..., n} (sum on n Bernoulli RVs)

n x
fX (x) = Binomial(x; n, p) =
p (1 p)(nx)
x
M
July 22, 2014
11 / 35


p
x =1
1p x =0
Binomial RV: X {0, 1, ..., n} (sum on n Bernoulli RVs)

n x
fX (x) = Binomial(x; n, p) =
p (1 p)(nx)
x
Binomial coefficients
(n choose x):

n
n!
=
(n x)! x!
x
M
July 22, 2014
11 / 35

M
July 22, 2014
12 / 35

Example: continuous RV with uniform distribution on [a, b].
M
July 22, 2014
12 / 35

Probability density function (pdf, continuous RV): fX (x)
M
July 22, 2014
12 / 35


Z x
FX (x) =
fX (u) du,
M
July 22, 2014
12 / 35


Z d
Z x
FX (x) =
fX (x) dx,
fX (u) du, P(X [c, d]) =
M
c
July 22, 2014
12 / 35


Z d
Z x
FX (x) =
fX (x) dx, P(X = x) = 0
fX (u) du, P(X [c, d]) =
M
c
July 22, 2014
12 / 35
Important Continuous Random Variables

Uniform: fX (x) = Uniform(x; a, b) =
1
ba
x [a, b]
x
6 [a, b]
(previous slide).
M
July 22, 2014
13 / 35

1
ba
x [a, b]
x
6 [a, b]
(previous slide).
Gaussian: fX (x) = N (x; , 2 ) =
M
1
2 2
(x)2
2 2
July 22, 2014
13 / 35

1
ba
x [a, b]
x
6 [a, b]
(previous slide).
Gaussian: fX (x) = N (x; , 2 ) =

Exponential: fX (x) = Exp(x; ) =
M
1
2 2
e x
0
(x)2
2 2
x 0
x <0
July 22, 2014
13 / 35
Expectation of Random Variables

Expectation: E(X ) =
M
xi fX (xi )
X {x1 , ...xK } R
x fX (x) dx
X continuous
July 22, 2014
14 / 35

X
Z
xi fX (xi )
X {x1 , ...xK } R
x fX (x) dx
X continuous
Example: Bernoulli, fX (x) = p x (1 p)1x , for x {0, 1}.

E(X ) = 0 (1 p) + 1 p = p.
M
July 22, 2014
14 / 35

X
Z
xi fX (xi )
X {x1 , ...xK } R
x fX (x) dx
X continuous

E(X ) = 0 (1 p) + 1 p = p.
Example: Binomial, fX (x) =
n
x
p x (1 p)nx , for x {0, ..., n}.
E(X ) = n p.
M
July 22, 2014
14 / 35

X
Z
xi fX (xi )
X {x1 , ...xK } R
x fX (x) dx
X continuous

E(X ) = 0 (1 p) + 1 p = p.
n
x
p x (1 p)nx , for x {0, ..., n}.
E(X ) = n p.
Example: Gaussian, fX (x) = N (x; , 2 ).
M
E(X ) = .
July 22, 2014
14 / 35

X
Z
xi fX (xi )
X {x1 , ...xK } R
x fX (x) dx
X continuous

E(X ) = 0 (1 p) + 1 p = p.
n
x
p x (1 p)nx , for x {0, ..., n}.
E(X ) = n p.
Example: Gaussian, fX (x) = N (x; , 2 ).
E(X ) = .
Linearity of expectation:
E(X + Y ) = E(X ) + E(Y ); E( X ) = E(X ), R
M
July 22, 2014
14 / 35
Expectation of Functions of Random Variables

E(g (X )) =
X
Z
g (xi )fX (xi )
X discrete, g (xi ) R
g (x) fX (x) dx
X continuous
M
July 22, 2014
15 / 35

E(g (X )) =
X
Z
g (xi )fX (xi )
g (x) fX (x) dx
X continuous
2
Example: variance, var(X ) = E X E(X )
M
July 22, 2014
15 / 35

E(g (X )) =
X
Z
g (xi )fX (xi )
g (x) fX (x) dx
X continuous
2
= E(X 2 ) E(X )2
Example: Bernoulli variance, E(X 2 ) = E(X ) = p
M
July 22, 2014
15 / 35

E(g (X )) =
X
Z
g (xi )fX (xi )
g (x) fX (x) dx
X continuous
2
= E(X 2 ) E(X )2
Example: Bernoulli variance, E(X 2 ) = E(X ) = p , thus var(X ) = p(1 p).
M
July 22, 2014
15 / 35

E(g (X )) =
X
Z
g (xi )fX (xi )
g (x) fX (x) dx
X continuous
2
= E(X 2 ) E(X )2

Example: Gaussian variance, E (X )2 = 2 .
M
July 22, 2014
15 / 35

E(g (X )) =
X
Z
g (xi )fX (xi )
g (x) fX (x) dx
X continuous
2
= E(X 2 ) E(X )2

Example: Gaussian variance, E (X )2 = 2 .

Probability as expectation of indicator, 1A (x) =

Z
P(X A) =
1
0
x A
x
6 A
Z
fX (x) dx =
1A (x) fX (x) dx = E(1A (X ))
A
M
July 22, 2014
15 / 35
Two (or More) Random Variables

Joint pmf of two discrete RVs:
fX ,Y (x, y ) = P(X = x Y = y ).
Extends trivially to more than two RVs.
M
July 22, 2014
16 / 35

fX ,Y (x, y ) = P(X = x Y = y ).

Joint pdf of two continuous RVs: fX ,Y (x, y ), such that
ZZ

P (X , Y ) A =
fX ,Y (x, y ) dx dy ,
A R2
A
M
July 22, 2014
16 / 35

fX ,Y (x, y ) = P(X = x Y = y ).

ZZ

P (X , Y ) A =
A R2
A

X
fX ,Y (x, y ),
if X is discrete
x
Z
Marginalization: fY (y ) =
fX ,Y (x, y ) dx, if X continuous
M
July 22, 2014
16 / 35

fX ,Y (x, y ) = P(X = x Y = y ).

ZZ

P (X , Y ) A =
A R2
A

X
fX ,Y (x, y ),
if X is discrete
x
Z
Independence:
X
Y fX ,Y (x, y ) = fX (x) fY (y )
M
.
July 22, 2014
16 / 35

fX ,Y (x, y ) = P(X = x Y = y ).

ZZ

P (X , Y ) A =
A R2
A

X
fX ,Y (x, y ),
if X is discrete
x
Z
Independence:
X
Y fX ,Y (x, y ) = fX (x) fY (y )
M
E(X Y ) = E(X ) E(Y ).

6
July 22, 2014
16 / 35
Conditionals and Bayes Theorem

Conditional pmf (discrete RVs):
fX |Y (x|y ) = P(X = x|Y = y ) =
M
fX ,Y (x, y )
P(X = x Y = y )
=
.
P(Y = y )
fY (y )
July 22, 2014
17 / 35

fX |Y (x|y ) = P(X = x|Y = y ) =
fX ,Y (x, y )
P(X = x Y = y )
=
.
P(Y = y )
fY (y )
Conditional pdf (continuous RVs): fX |Y (x|y ) =
fX ,Y (x, y )
fY (y )
...the meaning is technically delicate.
M
July 22, 2014
17 / 35

fX |Y (x|y ) = P(X = x|Y = y ) =
fX ,Y (x, y )
P(X = x Y = y )
=
.
P(Y = y )
fY (y )
fX ,Y (x, y )
fY (y )
Bayes theorem: fX |Y (x|y ) =
M
fY |X (y |x) fX (x)
fY (y )
(pdf or pmf).
July 22, 2014
17 / 35

fX |Y (x|y ) = P(X = x|Y = y ) =
fX ,Y (x, y )
P(X = x Y = y )
=
.
P(Y = y )
fY (y )
fX ,Y (x, y )
fY (y )
Bayes theorem: fX |Y (x|y ) =
fY |X (y |x) fX (x)
fY (y )
(pdf or pmf).
Also valid in the mixed case (e.g., X continuous, Y discrete).
M
July 22, 2014
17 / 35
Joint, Marginal, and Conditional Probabilities: An Example

A pair of binary variables X , Y {0, 1}, with joint pmf:
M
July 22, 2014
18 / 35

Marginals: fX (0) =
1
5
2
5
fY (0) =
1
5
1
10
M
= 35 ,
=
3
10 ,
fX (1) =
1
10
fY (1) =
2
5
3
10
3
10
=
=
4
10 ,
7
10 .
July 22, 2014
18 / 35

Marginals: fX (0) =
1
5
2
5
= 35 ,
fY (0) =
1
5
1
10
3
10 ,
fX (1) =
1
10
fY (1) =
2
5
3
10
3
10
=
=
4
10 ,
7
10 .
Conditional probabilities:
M
July 22, 2014
18 / 35
An Important Multivariate RV: Multinomial

P
Multinomial: X = (X1 , ..., XK ), Xi {0, ..., n}, such that i Xi = n,
x1 x2
P

xK
n
p
p
x =n
1
2
k
x
x
x
1 2
K
Pi i
fX (x1 , ..., xK ) =
0
i xi 6= n

n
x1 x2 xK
n!
x1 ! x2 ! xK !
P
Parameters: p1 , ..., pK 0, such that i pi = 1.
M
July 22, 2014
19 / 35

P
x1 x2
P

xK
n
p
p
x =n
1
2
k
x
x
x
1 2
K
Pi i
fX (x1 , ..., xK ) =
0
i xi 6= n

n
x1 x2 xK
n!
x1 ! x2 ! xK !
P
=
Generalizes the binomial from binary to K -classes.
M
July 22, 2014
19 / 35

P
x1 x2
P

xK
n
p
p
x =n
1
2
k
x
x
x
1 2
K
Pi i
fX (x1 , ..., xK ) =
0
i xi 6= n

n
x1 x2 xK
n!
x1 ! x2 ! xK !
P
=
Generalizes the binomial from binary to K -classes.

Example: tossing n independent fair dice, p1 = = p6 = 1/6.
P
xi = number of outcomes with i dots. Of course, i xi = n.
M
July 22, 2014
19 / 35
An Important Multivariate RV: Gaussian

Multivariate Gaussian: X Rn ,

1
exp (x )T C 1 (x )
fX (x) = N (x; , C ) = p
2
det(2 C )
1
M
July 22, 2014
20 / 35


1
exp (x )T C 1 (x )
fX (x) = N (x; , C ) = p
2
det(2 C )
1
Parameters: vector Rn and matrix C Rnn .

Expected value: E(X ) = . Meaning of C : next slide.
M
July 22, 2014
20 / 35


1
exp (x )T C 1 (x )
fX (x) = N (x; , C ) = p
2
det(2 C )
1
Parameters: vector Rn and matrix C Rnn .

Expected value: E(X ) = . Meaning of C : next slide.
M
July 22, 2014
20 / 35
Covariance, Correlation, and all that...

Covariance between two RVs:
h

i
cov(X , Y ) = E X E(X ) Y E(Y ) = E(X Y ) E(X ) E(Y )
M
July 22, 2014
21 / 35

h

i
Relationship with variance: var(X ) = cov(X , X ).
M
July 22, 2014
21 / 35

h

i
Correlation: corr(X , Y ) = (X , Y ) = cov(X,Y )
[1, 1]
var(X ) var(Y )
M
July 22, 2014
21 / 35

h

i
[1, 1]
var(X ) var(Y )
X
Y fX ,Y (x, y ) = fX (x) fY (y )
M
July 22, 2014
21 / 35

h

i
[1, 1]
var(X ) var(Y )
X
Y fX ,Y (x, y ) = fX (x) fY (y )
cov(X , Y ) = 0.
6
M
July 22, 2014
21 / 35

h

i
[1, 1]
var(X ) var(Y )
X
Y fX ,Y (x, y ) = fX (x) fY (y )
cov(X , Y ) = 0.
6
Covariance matrix of multivariate RV, X Rn :
h

T i
cov(X ) = E X E(X ) X E(X )
= E(X X T ) E(X )E(X )T
M
July 22, 2014
21 / 35

h

i
[1, 1]
var(X ) var(Y )
X
Y fX ,Y (x, y ) = fX (x) fY (y )
cov(X , Y ) = 0.
6
Covariance matrix of multivariate RV, X Rn :
h

T i
cov(X ) = E X E(X ) X E(X )
= E(X X T ) E(X )E(X )T
Covariance of Gaussian RV, fX (x) = N (x; , C ) cov(X ) = C
M
July 22, 2014
21 / 35
Statistical Inference
Scenario: observed RV Y , depends on unknown variable(s) X .
Goal: given an observation Y = y , infer X .
M
July 22, 2014
22 / 35
Two main philosophies:
Frequentist: X = x is fixed (not an RV), but unknown;
Bayesian: X is a random variable with pdf/pmf fX (x) (the prior);
this prior expresses/formalizes knowledge about X .
M
July 22, 2014
22 / 35
In both philosophies, a central object is fY |X (y |x)
several names: likelihood function, observation model,...
M
July 22, 2014
22 / 35
This in not statistical/machine learning! fY |X (y |x) is assumed known.
M
July 22, 2014
22 / 35
This in not statistical/machine learning! fY |X (y |x) is assumed known.
In the Bayesian philosophy, all the knowledge about X is carried by
fY |X (y |x) fX (x)
fY ,X (y , x)
fX |Y (x|y ) =
=
fY (y )
fY (y )
...the posterior (or a posteriori) pdf/pmf.
M
July 22, 2014
22 / 35
The posterior pdf/pmf fX |Y (x|y ) has all the information/knowledge
about X , given Y = y (conditionality principle).
M
July 22, 2014
23 / 35
How to make an optimal guess b
x about X , given this information?
M
July 22, 2014
23 / 35
Need to define optimal: loss function: L(b
x , x) R+ measures
loss/cost incurred by guessing b
x if truth is x.
M
July 22, 2014
23 / 35
Need to define optimal: loss function: L(b
x , x) R+ measures
loss/cost incurred by guessing b
x if truth is x.
The optimal Bayesian decision minimizes the expected loss:
b
xBayes = arg min E[L(b
x , X )|Y = y ]
b
x
where
Z
L(b
x , x) fX |Y (x|y ) dx, continuous (estimation)
X
E[L(b
x , X )|Y = y ] =
L(b
x , x) fX |Y (x|y ),
discrete (classification)
M
July 22, 2014
23 / 35
Classical Statistical Inference Criteria

Consider that X {1, ..., K } (discrete/classification problem).
M
July 22, 2014
24 / 35

Adopt the 0/1 loss: L(b
x , x) = 0, if b
x = x, and 1 otherwise.
M
July 22, 2014
24 / 35

x , x) = 0, if b
Optimal Bayesian decision:
b
xBayes = arg min
b
x
K
X
L(b
x , x) fX |Y (x|y )
x=1

= arg min 1 fX |Y (b
x |y )
b
x
= arg max fX |Y (b
x |y ) b
xMAP
b
x
MAP = maximum a posteriori criterion.
M
July 22, 2014
24 / 35

x , x) = 0, if b
b
xBayes = arg min
b
x
K
X
L(b
x , x) fX |Y (x|y )
x=1

= arg min 1 fX |Y (b
x |y )
b
x
= arg max fX |Y (b
x |y ) b
xMAP
b
x
MAP = maximum a posteriori criterion.

Same criterion can be derived for continuous X , using
lim0 L (b
x , x), where L (b
x , x) = 0, if |b
x x| < , and 1 otherwise.
M
July 22, 2014
24 / 35

Consider the MAP criterion b
xMAP = arg maxx fX |Y (x|y )
M
July 22, 2014
25 / 35

From Bayes law:
b
xMAP = arg max
x
fY |X (y |x) fX (x)
= arg max fY |X (y |x) fX (x)
x
fY (y )
...only need to know posterior fX |Y (x|y ) up to a normalization factor.
M
July 22, 2014
25 / 35

From Bayes law:
b
xMAP = arg max
x
fY |X (y |x) fX (x)
x
fY (y )

Also common to write: b
xMAP = arg maxx log fY |X (y |x) + log fX (x)
M
July 22, 2014
25 / 35

From Bayes law:
b
xMAP = arg max
x
fY |X (y |x) fX (x)
x
fY (y )

Also common to write: b
xMAP = arg maxx log fY |X (y |x) + log fX (x)
If the prior if flat, fX (x) = C , then,
b
xMAP = arg max fY |X (y |x) b
xML
x
ML = maximum likelihood criterion.

M
July 22, 2014
25 / 35
Statistical Inference: Example

Observed n i.i.d. (independent identically distributed) Bernoulli RVs:
Y = (Y1 , ..., Yn ), with Yi {0, 1}.
Common pmf fYi |X (y |x) = x y (1 x)1y , where x [0, 1].
M
July 22, 2014
26 / 35

Y = (Y1 , ..., Yn ), with Yi {0, 1}.
n
Y
Likelihood function: fY |X y1 , ..., yn |x =
x yi (1 x)1yi
i=1
Log-likelihood function:

log fY |X y1 , ..., yn |x = n log(1 x) + log
n
x X
yi
1x
i=1
M
July 22, 2014
26 / 35

Y = (Y1 , ..., Yn ), with Yi {0, 1}.
n
Y
x yi (1 x)1yi
i=1

n
x X
yi
1x
i=1
Maximum likelihood: b
xML = arg maxx fY |X (y |x) =
M
1
n
n
X
yi
i=1
July 22, 2014
26 / 35

Y = (Y1 , ..., Yn ), with Yi {0, 1}.
n
Y
x yi (1 x)1yi
i=1

n
x X
yi
1x
i=1
Maximum likelihood: b
xML = arg maxx fY |X (y |x) =
1
n
n
X
yi
i=1
Example: n = 10, observed y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), b

xML = 7/10.
M
July 22, 2014
26 / 35
Statistical Inference: Example (Continuation)

Observed n i.i.d. (independent identically distributed) Bernoulli RVs.
M
July 22, 2014
27 / 35

Likelihood: fY |X y1 , ..., yn |x =
n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
M
July 22, 2014
27 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
How to express knowledge that (e.g.) X is around 1/2? Convenient

choice: conjugate prior. Form of the posterior = form of the prior.
M
July 22, 2014
27 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1

I
In our case, the Beta pdf

fX (x) x 1 (1 x)1 , , > 0
M
July 22, 2014
27 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1

I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
M
July 22, 2014
27 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1

I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
MAP: b
xMAP =
P
+ i yi 1
++n2
M
July 22, 2014
27 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1

I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
MAP: b
xMAP =
Example: = 4, = 4, n = 10,
y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),
P
+ i yi 1
++n2
b
xMAP = 0.625
M

recall b
xML = 0.7
July 22, 2014
27 / 35
Another Classical Statistical Inference Criterion

Consider that X R (continuous/estimation problem).
M
July 22, 2014
28 / 35

Adopt the squared error loss: L(b
x , x) = (b
x x)2
M
July 22, 2014
28 / 35

x , x) = (b
x x)2
b
xBayes = arg min E[(b
x X )2 |Y = y ]
b
x
= arg min b
x2 2 b
x E[X |Y = y ]
b
x
= E[X |Y = y ] b
xMMSE
MMSE = minimum mean squared error criterion.
M
July 22, 2014
28 / 35

x , x) = (b
x x)2
b
xBayes = arg min E[(b
x X )2 |Y = y ]
b
x
= arg min b
x2 2 b
x E[X |Y = y ]
b
x
= E[X |Y = y ] b
xMMSE
MMSE = minimum mean squared error criterion.
Does not apply to classification problems.
M
July 22, 2014
28 / 35
Back to the Bernoulli Example

M
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
M
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
I

fX (x) x 1 (1 x)1 , , > 0
M
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
M
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
MMSE: b
xMMSE =
M
P
+ i yi
++n
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
MMSE: b
xMMSE =
Example: = 4, = 4, n = 10,
y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),
b
xMMSE ' 0.611
P
+ i yi
++n

recall that b
xMAP = 0.625, b
xML = 0.7
M
July 22, 2014
29 / 35

n
Y
x yi (1 x)1yi = x
yi
(1 x)n
yi
i=1
I

fX (x) x 1 (1 x)1 , , > 0
Posterior:
P
P
fX |Y (x|y ) = x 1+ i yi (1 x)1+n i yi
MMSE: b
xMMSE =
Example: = 4, = 4, n = 10,
y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1),
b
xMMSE ' 0.611
P
+ i yi
++n

recall that b
xMAP = 0.625, b
xML = 0.7
Conjugate prior equivalent to virtual counts;

often called smoothing in NLP and ML.
M
July 22, 2014
29 / 35
The Bernstein-Von Mises Theorem

In the previous example, we had
P
n = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus i yi = 7.
With a Beta prior with = 4 and = 4, we had
P
P
3 + i yi
4 + i yi
b
xML = 0.7, b
xMAP =
= 0.625, b
xMMSE =
' 0.611
6+n
8+n
M
July 22, 2014
30 / 35

P
n = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus i yi = 7.
P
P
3 + i yi
4 + i yi
b
xML = 0.7, b
xMAP =
= 0.625, b
xMMSE =
' 0.611
6+n
8+n
P
Consider n = 100, and i yi = 70, with the same Beta(4,4) prior
b
xML = 0.7, b
xMAP =
73
74
' 0.689, b
xMMSE =
' 0.685
106
108
... both Bayesian estimates are much closer to the ML.
M
July 22, 2014
30 / 35

P
n = 10, y = (1, 1, 1, 0, 1, 0, 0, 1, 1, 1), thus i yi = 7.
P
P
3 + i yi
4 + i yi
b
xML = 0.7, b
xMAP =
= 0.625, b
xMMSE =
' 0.611
6+n
8+n
P
Consider n = 100, and i yi = 70, with the same Beta(4,4) prior
b
xML = 0.7, b
xMAP =
73
74
' 0.689, b
xMMSE =
' 0.685
106
108
... both Bayesian estimates are much closer to the ML.

This illustrates an important result in Bayesian inference: the
Bernstein-Von Mises theorem; under (mild) conditions,
lim b
xMAP = lim b
xMMSE = b
xML
message: if you have a lot of data, priors dont matter.

M
July 22, 2014
30 / 35
Important Inequalities
Markovs ineqality: if X 0 is an RV with expectation E(X ), then
P(X > t)
M
E(X )
t
July 22, 2014
31 / 35
P(X > t)
E(X )
t
Simple proof:
Z
t fX (x) dx
t P(X > t) =
t
Z t
x fX (x) dx = E(X ) x fX (x) dx E(X )
| 0 {z
}
0
M
July 22, 2014
31 / 35
P(X > t)
E(X )
t
Simple proof:
Z
t fX (x) dx
t P(X > t) =
t
Z t
x fX (x) dx = E(X ) x fX (x) dx E(X )
| 0 {z
}
0
Chebyshevs inequality: = E(Y ) and 2 = var(Y ), then

P(|X | s)
2
s2
...simple corollary of Markovs inequality, with X = |Y |2 , t = s 2

M
July 22, 2014
31 / 35
More Important Inequalities

Cauchy-Schwartzs inequality for RVs:
q
E(|X Y |) E(X 2 )E(Y 2 )
M
July 22, 2014
32 / 35

q
E(|X Y |) E(X 2 )E(Y 2 )
Recall that a real function g is convex if, for any x, y , and [0, 1]
g (x + (1 )y ) g (x) + (1 )g (y )
M
July 22, 2014
32 / 35

q
E(|X Y |) E(X 2 )E(Y 2 )
g (x + (1 )y ) g (x) + (1 )g (y )
Jensens inequality: if g is a real convex function, then

g (E(X )) E(g (X ))
M
July 22, 2014
32 / 35

q
E(|X Y |) E(X 2 )E(Y 2 )
g (x + (1 )y ) g (x) + (1 )g (y )
Jensens inequality: if g is a real convex function, then

g (E(X )) E(g (X ))
Examples: E(X )2 E(X 2 ) var(X ) = E(X 2 ) E(X )2 0.
E(log X ) log E(X ), for X a positive RV.
M
July 22, 2014
32 / 35
Entropy and all that...

Entropy of a discrete RV X {1, ..., K }: H(X ) =
K
X
fX (x) log fX (x)
x=1
M
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Positivity: H(X ) 0 ;
H(X ) = 0 fX (i) = 1, for exactly one i {1, ..., K }.
M
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Upper bound: H(X ) log K ;
H(X ) = log K fX (x) = 1/k, for all x {1, ..., K }
M
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Measure of uncertainty/randomness of X
M
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Z
Continuous RV X , differential entropy: h(X ) =
M
fX (x) log fX (x) dx
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Z
h(X ) can be positive or negative. Example, if

fX (x) = Uniform(x; a, b), h(X ) = log(b a).
M
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Z

If fX (x) = N (x; , 2 ), then h(X ) =
M
1
2
log(2e 2 ).
July 22, 2014
33 / 35

K
X
fX (x) log fX (x)
x=1
Z

If fX (x) = N (x; , 2 ), then h(X ) =
If var (Y ) = 2 , then h(Y )
M
1
2
1
2
log(2e 2 ).
log(2e 2 )
July 22, 2014
33 / 35
Kullback-Leibler divergence
Kullback-Leibler divergence (KLD) between two pmf:
D(fX kgX ) =
K
X
fX (x) log
x=1
M
fX (x)
gX (x)
July 22, 2014
34 / 35
D(fX kgX ) =
K
X
fX (x) log
x=1
fX (x)
gX (x)
Positivity: D(fX kgX ) 0

D(fX kgX ) = 0 fX (x) = gX (x), for x {1, ..., K }
M
July 22, 2014
34 / 35
D(fX kgX ) =
K
X
fX (x) log
x=1
fX (x)
gX (x)

KLD between two pdf:
Z
D(fX kgX ) =
M
fX (x) log
fX (x)
dx
gX (x)
July 22, 2014
34 / 35
D(fX kgX ) =
K
X
fX (x) log
x=1
fX (x)
gX (x)

KLD between two pdf:
Z
D(fX kgX ) =
fX (x) log
fX (x)
dx
gX (x)

D(fX kgX ) = 0 fX (x) = gX (x), almost everywhere
M
July 22, 2014
34 / 35
Enjoy LxMLS 2014
M
July 22, 2014
35 / 35

Probability Refresher

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability Refresher

Uploaded by

Copyright:

Available Formats

Review of Probability Theory

LxMLS: Lisbon Machine Learning School

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

LxMLS 2014: Probability Theory

July 22, 2014

...with N mutually exclusive equally likely outcomes,

Example: P(randomly drawn card is ) = 13/52.

LxMLS 2014: Probability Theory

July 22, 2014

...with N mutually exclusive equally likely outcomes,

Example: P(randomly drawn card is ) = 13/52.

...relative frequency of occurrence of A in infinite number of trials.

LxMLS 2014: Probability Theory

July 22, 2014

...with N mutually exclusive equally likely outcomes,

Example: P(randomly drawn card is ) = 13/52.

...relative frequency of occurrence of A in infinite number of trials.

Subjective probability: P(A) is a degree of belief.

...gives meaning to P(tomorrow will rain).

LxMLS 2014: Probability Theory

July 22, 2014

Key concepts: Sample space and events

Tossing two coins: X = {HH, TH, HT , TT }

Roulette: X = {1, 2, ..., 36}

Draw a card from a shuffled deck: X = {A, 2, ..., Q, K }.

LxMLS 2014: Probability Theory

July 22, 2014

Key concepts: Sample space and events

Tossing two coins: X = {HH, TH, HT , TT }

Roulette: X = {1, 2, ..., 36}

Draw a card from a shuffled deck: X = {A, 2, ..., Q, K }.

exactly one H in 2-coin toss: A = {TH, HT } {HH, TH, HT , TT }.

drawn a card: C = {A, 2, ..., K } {A, ..., K }

LxMLS 2014: Probability Theory

July 22, 2014

Kolmogorovs Axioms for Probability

LxMLS 2014: Probability Theory

July 22, 2014

Kolmogorovs Axioms for Probability

For any A X , P(A) 0

LxMLS 2014: Probability Theory

July 22, 2014

Kolmogorovs Axioms for Probability

For any A X , P(A) 0

LxMLS 2014: Probability Theory

July 22, 2014

Kolmogorovs Axioms for Probability

For any A X , P(A) 0

LxMLS 2014: Probability Theory

July 22, 2014

Kolmogorovs Axioms for Probability

For any A X , P(A) 0

From these axioms, many results can be derived. Examples:

P(A B) = P(A) + P(B) P(A B)