You are on page 1of 6

Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 1

Machine Learning Worksheet 2

Probability Theory Refresher

1 Basic Probability

Problem 1: A secret government agency has developed a scanner which determines whether a person
is a terrorist. The scanner is fairly reliable; 95% of all scanned terrorists are identified as terrorists, and
95% of all upstanding citizens are identified as such. An informant tells the agency that exactly one
passenger of 100 aboard an aeroplane in which you are seated is a terrorist. The agency decide to scan
each passenger and the shifty looking man sitting next to you is tested as ”TERRORIST”. What are the
chances that this man is a terrorist? Show your work!

The random variable T indicates, if a person is a terrorist or not, i.e. with the given information above:
p(T = 1) = 0.01 and p(T = 0) = 0.99. The random variable S indicates, if the scanner tests ”TERRORIST” or
not. So, p(S = 1∣T = 1) = 0.95 (given in the text) and therefore (laws of probability) p(S = 0∣T = 1) = 0.05.
Similarily, p(S = 0∣T = 0) = 0.95 and p(S = 1∣T = 0) = 0.05. We are interested in p(T=1|S=1):

p(S = 1∣T = 1)p(T = 1) 19


p(T = 1∣S = 1) = = ≈ 0.16
p(S = 1∣T = 1)p(T = 1) + p(S = 1∣T = 0)p(T = 0) 118

Note that in the denominator, we compute p(S = 1) using the law of total probability (also known as sum
rule).

Problem 2: Two balls are placed in a box as follows: A fair coin is tossed and a white ball is placed in
the box if a head occurs, otherwise a red ball is placed in the box. The coin is tossed again and a red ball
is placed in the box if a tail occurs, otherwise a white ball is placed in the box. Balls are drawn from the
box three times in succession (always with replacing the drawn ball back in the box). It is found that on
all three occasions a red ball is drawn. What is the probability that both balls in the box are red? Show
your work!

Denote by RRR the event that 3 red balls are drawn. Similarily, denote by rr the event that 2 red
balls are placed in the box, rw the event that first a white and then a red ball are placed in the box,
and wr and ww for the remaining two possibilities. We know that
1
p(rw) = p(rw) = p(wr) = p(ww) =
4
Furthermore
p(RRR∣rr) = 1, p(RRR∣rw) = p(RRR∣wr) = 1/8, p(RRR∣ww) = 0
Therefore
p(RRR∣rr)p(rr) 1/4 4
p(rr∣RRR) = = =
p(RRR) 5/16 5
where p(RRR) = ∑x∈{rr,rw,wr,ww} p(RRR∣x)p(x) is computed with the sum rule.

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 2

Problem 3: There are eleven urns labeled by u ∈ {0, 1, 2, . . . , 10}, each containing ten balls. Urn u
contains u black balls and 10 − u white balls. Alice selects an urn u at random and draws N times with
replacement from that urn, obtaining nB black balls and N − nB white balls. If after N = 10 draws nB = 3
black balls have been drawn, what is the probability that the urn Alice is using is urn u?

Problem 4: Now, let Alice draw another ball from the same urn. What is the probability that the next
drawn ball is black (show your work)?

Our goal is to compute P (u∣nB , N ). Given are P (nB ∣u, N ) and P (u):

N
P (nB ∣u, N ) = ( )bnuB (1 − bu )N −nB
nB
Use the above rules:
P (u, nB , N ) = P (u∣nB , N )P (nB , N )
Assuming that P (N ∣u) = P (N ):

P (u, nB , N ) = P (nB ∣u, N )P (u, N ) = P (nB ∣u, N )P (N )P (u)

Thus:
P (nB ∣u, N )P (u)P (N )
P (u∣nB , N ) =
P (nB , N )
Applying again the sum rule for the denominator:

P (nB , N ) = ∑ P (nB ∣u, N )P (u)P (N )


u

finally resulting in:


P (nB ∣u, N )P (u)
P (u∣nB , N ) =
∑u P (nB ∣u, N )P (u)
We derive the following conditional distribution:
u P (u∣nB = 3, N = 10)
0 0
1 0.063
2 0.22
3 0.29
4 0.24
5 0.13
6 0.047
7 0.0099
8 0.00086
9 0000096
10 0

Now, let Alice draw another ball from the same urn. What is the probability that the next drawn ball is black?

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 3

Let BN +1 denote next ball is black. Thus, using the sum rule:

P (BN +1 ∣nB , N ) = ∑ P (BN +1 ∣u, nB , N )P (u∣nB , N )


u

For a fixed u, balls are drawn with replacement, therefore P (BN +1 ∣u, nB , N ) = bu , thus

P (BN +1 ∣nB , N ) = ∑ bu P (u∣nB , N ) ≈ 0.333


u

Problem 5: Calculate the mean and the variance of the uniform random variable X with PDF p(x) =
1, ∀ x ∈ [0, 1], and 0 elsewhere.

Let X ∼ U nif orm(0, 1), i.e.


+∞
1
1 x2
E[X] = ∫ xdx = ∫ xdx = [ ] = 0.5
−∞ 0 2 0
and
+∞
1
1 x3 1
E[X 2 ] = ∫ x2 dx = ∫ x2 dx = [ ] =
−∞ 0 3 0 3
and thus V ar[X] = E[X 2 ] − E[X]2 = 1/3 − 1/4 = 1/12.

Problem 6: Consider two random variables X and Y with joint density p(x, y). Prove the following two
results:

E[X] = EY [EX∣Y [X]] (1)


V ar[X] = EY [V arX∣Y [X]] + V arY [EX∣Y [X]] (2)

Here EX∣Y [X] denotes the expectation of X under the conditional density p(x∣y), with a similar notation
for the conditional variance.

Consider the discrete case only, using the sum rule and term reordering as the only necessary strategies:

∑ (∑ xp(x∣y)) p(y) = ∑ ∑ xp(x∣y)p(y) = ∑ x ∑ p(x, y) = ∑ xp(x) = E[X]


y x x y x y x

Using this result and carefully deal with the meanings of the various symbols, we can derive also the second

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 4

equation:

EY [V arX∣Y [X]] = ∑ (∑(x − EX∣Y [X])2 p(x∣y)) p(y)


y x

= ∑ ∑ x2 p(x, y) − 2 ∑ (∑ xp(x∣y)) EX∣Y [X]p(y) + ∑ ∑ EX∣Y


2
[X]p(x, y)
y x y x y x

= ∑ x2 p(x) − 2 ∑ EX∣Y
2
[X]p(y) + ∑ EX∣Y
2
[X]p(y)
x y y

= EX [X 2
] − EY [EX∣Y
2
[X]]
2
V arY [EX∣Y [X]] = ∑ (EX∣Y [X] − EY [EX∣Y [X]]) p(y)
y

= ∑ EX∣Y
2
[X]p(y) − 2EX [X] ∑ EX∣Y [X]p(y) + EX
2
[X]
y y

= EY [EX∣Y
2
[X]] − 2EX
2
[X] + EX
2
[X]
= EY [EX∣Y
2
[X]] − EX
2
[X]

Thus we get:
EY [V arX∣Y [X]] + V arY [EX∣Y [X]] = EX [X 2 ] − EX
2
[X] = var[X]
There is an alternative, shorter way:

2 2 2 2 2 2
EY [V arX∣Y [X]] + V arY [EX∣Y [X]] = EY [EX∣Y [X ]] − EY [EX∣Y [X ]] + EY [EX∣Y [X ]] − EY [EX∣Y [X]] = EY [EX∣Y [X ]] − EX [X]

Consider these two results with setting X as a parameter θ that we want to learn and Y as a random
variable representing a possible observed data set D. Thus

Eθ [θ] = ED [Eθ [θ∣D]]

says that the posterior mean of θ (this is the right hand side of the above equation), averaged over
the distribution generating the data (out of which one particular realization of D is choosen) is equal
to the prior mean of θ. This is wired! Here, one mixes bayesian analysis (posteriors) with frequentist
view points (expectation over all possible data sets) and thus gets a rather strange result.
With
var[θ] = ED [varD [θ∣D]] + varD [Eθ [θ∣D]]
one gets an interesting statement regarding the expected posterior variance of θ (this is the first term
on the right hand side). It is always smaller than the original prior variance (left hand side)! (but
only in average, as already said), because the variance of the posterior mean (second term on the
right side) is a positive quanitity.

2 Probability Inequalities

Inequalities are useful for bounding quantities that might otherwise be hard to compute. We’ll begin
with a simple inequality, called the Markov inequality after Andrei A. Markov, a student of Pafnuty

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 5

Chebyshev.

2.1 Markov Inequality

Let X be a non-negative, discrete random variable, and let c > 0 be a positive constant.

Problem 7: Show that


E[X]
P (X > c) ≤ .
c
Now, consider flipping a fair coin n times. Using the Markov Inequality, what is the probability of getting
more than (3/4)n heads?

E[X] = ∑ p(x)x = ∑ p(x)x + ∑ p(x)x ≥ ∑ p(x)x ≥ c ∑ p(x) = cP (X > c)


x x≤c x>c x>c x>c

we get an upper bound of 2/3.

2.2 Chebyshev Inequality

Apply the Markov Inequality to the deviation of a random variable from its mean, i.e. for a general
random variable X we wish to bound the probability of the event {∣X − E[X]∣ > a}, which is the same as
the event {(X − E[X])2 > a2 }.

Problem 8: Prove that


V ar(X)
P (∣X − E[X]∣ > a) ≤
a2
holds. Again, consider flipping a fair coin n times. Now use the Chebyshev Inequality to bound the
probability of getting more than (3/4)n heads.

V ar(X)
P (∣X − E[X]∣ > a) = P ((X − E[X])2 > a2 ) ≤
a2
Then start from the statement and form it into something similar to the above:

n/4 4
P (X > 3/4n) = P (X − 1/2n > 1/4n) ≤ P (∣X − E[X]∣ ≥ 1/4n) ≤ =
n/16 n

Note that we are considering here X as the sum of independent coin flips and thus using the rule:
variance of the sum of independent variables is the sum of the variances. Interpretation of the result:
with more tosses, the expected result converges to the mean value, so we have a convergence to zero
in this case.

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 6

2.3 Jensen’s Inequality

Let f be a convex function. If λ1 , . . . , λn are positive numbers with λ1 + ⋅ ⋅ ⋅ + λn = 1, then for any
x1 , . . . , xn ∈ I:
f (λ1 x1 + ⋅ ⋅ ⋅ + λn xn ) ≤ λ1 f (x1 ) + ⋅ ⋅ ⋅ + λn f (xn )

Problem 9: Prove this inequality by using induction on n.

For n = 2 nothing is to prove, definition of convexity.


Assume the statement is true for n:
n+1 n+1 n+1 n+1
λi λi
f ( ∑ λi xi ) = f (λ1 x1 + (1 − λ1 ) ∑ xi ≤ λ1 f (x1 ) + (1 − λ1 )f ( ∑ xi ) ≤ ∑ λi f (xi )
i=1 i=2 1 − λ1 i=2 1 − λ1 i=1

where we used the fact that ∑i λi


1−λ1 = 1 and thus could apply the induction assumption.

Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET

You might also like