Professional Documents
Culture Documents
1 Basic Probability
Problem 1: A secret government agency has developed a scanner which determines whether a person
is a terrorist. The scanner is fairly reliable; 95% of all scanned terrorists are identified as terrorists, and
95% of all upstanding citizens are identified as such. An informant tells the agency that exactly one
passenger of 100 aboard an aeroplane in which you are seated is a terrorist. The agency decide to scan
each passenger and the shifty looking man sitting next to you is tested as ”TERRORIST”. What are the
chances that this man is a terrorist? Show your work!
The random variable T indicates, if a person is a terrorist or not, i.e. with the given information above:
p(T = 1) = 0.01 and p(T = 0) = 0.99. The random variable S indicates, if the scanner tests ”TERRORIST” or
not. So, p(S = 1∣T = 1) = 0.95 (given in the text) and therefore (laws of probability) p(S = 0∣T = 1) = 0.05.
Similarily, p(S = 0∣T = 0) = 0.95 and p(S = 1∣T = 0) = 0.05. We are interested in p(T=1|S=1):
Note that in the denominator, we compute p(S = 1) using the law of total probability (also known as sum
rule).
Problem 2: Two balls are placed in a box as follows: A fair coin is tossed and a white ball is placed in
the box if a head occurs, otherwise a red ball is placed in the box. The coin is tossed again and a red ball
is placed in the box if a tail occurs, otherwise a white ball is placed in the box. Balls are drawn from the
box three times in succession (always with replacing the drawn ball back in the box). It is found that on
all three occasions a red ball is drawn. What is the probability that both balls in the box are red? Show
your work!
Denote by RRR the event that 3 red balls are drawn. Similarily, denote by rr the event that 2 red
balls are placed in the box, rw the event that first a white and then a red ball are placed in the box,
and wr and ww for the remaining two possibilities. We know that
1
p(rw) = p(rw) = p(wr) = p(ww) =
4
Furthermore
p(RRR∣rr) = 1, p(RRR∣rw) = p(RRR∣wr) = 1/8, p(RRR∣ww) = 0
Therefore
p(RRR∣rr)p(rr) 1/4 4
p(rr∣RRR) = = =
p(RRR) 5/16 5
where p(RRR) = ∑x∈{rr,rw,wr,ww} p(RRR∣x)p(x) is computed with the sum rule.
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 2
Problem 3: There are eleven urns labeled by u ∈ {0, 1, 2, . . . , 10}, each containing ten balls. Urn u
contains u black balls and 10 − u white balls. Alice selects an urn u at random and draws N times with
replacement from that urn, obtaining nB black balls and N − nB white balls. If after N = 10 draws nB = 3
black balls have been drawn, what is the probability that the urn Alice is using is urn u?
Problem 4: Now, let Alice draw another ball from the same urn. What is the probability that the next
drawn ball is black (show your work)?
Our goal is to compute P (u∣nB , N ). Given are P (nB ∣u, N ) and P (u):
N
P (nB ∣u, N ) = ( )bnuB (1 − bu )N −nB
nB
Use the above rules:
P (u, nB , N ) = P (u∣nB , N )P (nB , N )
Assuming that P (N ∣u) = P (N ):
Thus:
P (nB ∣u, N )P (u)P (N )
P (u∣nB , N ) =
P (nB , N )
Applying again the sum rule for the denominator:
Now, let Alice draw another ball from the same urn. What is the probability that the next drawn ball is black?
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 3
Let BN +1 denote next ball is black. Thus, using the sum rule:
For a fixed u, balls are drawn with replacement, therefore P (BN +1 ∣u, nB , N ) = bu , thus
Problem 5: Calculate the mean and the variance of the uniform random variable X with PDF p(x) =
1, ∀ x ∈ [0, 1], and 0 elsewhere.
Problem 6: Consider two random variables X and Y with joint density p(x, y). Prove the following two
results:
Here EX∣Y [X] denotes the expectation of X under the conditional density p(x∣y), with a similar notation
for the conditional variance.
Consider the discrete case only, using the sum rule and term reordering as the only necessary strategies:
Using this result and carefully deal with the meanings of the various symbols, we can derive also the second
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 4
equation:
= ∑ x2 p(x) − 2 ∑ EX∣Y
2
[X]p(y) + ∑ EX∣Y
2
[X]p(y)
x y y
= EX [X 2
] − EY [EX∣Y
2
[X]]
2
V arY [EX∣Y [X]] = ∑ (EX∣Y [X] − EY [EX∣Y [X]]) p(y)
y
= ∑ EX∣Y
2
[X]p(y) − 2EX [X] ∑ EX∣Y [X]p(y) + EX
2
[X]
y y
= EY [EX∣Y
2
[X]] − 2EX
2
[X] + EX
2
[X]
= EY [EX∣Y
2
[X]] − EX
2
[X]
Thus we get:
EY [V arX∣Y [X]] + V arY [EX∣Y [X]] = EX [X 2 ] − EX
2
[X] = var[X]
There is an alternative, shorter way:
2 2 2 2 2 2
EY [V arX∣Y [X]] + V arY [EX∣Y [X]] = EY [EX∣Y [X ]] − EY [EX∣Y [X ]] + EY [EX∣Y [X ]] − EY [EX∣Y [X]] = EY [EX∣Y [X ]] − EX [X]
Consider these two results with setting X as a parameter θ that we want to learn and Y as a random
variable representing a possible observed data set D. Thus
says that the posterior mean of θ (this is the right hand side of the above equation), averaged over
the distribution generating the data (out of which one particular realization of D is choosen) is equal
to the prior mean of θ. This is wired! Here, one mixes bayesian analysis (posteriors) with frequentist
view points (expectation over all possible data sets) and thus gets a rather strange result.
With
var[θ] = ED [varD [θ∣D]] + varD [Eθ [θ∣D]]
one gets an interesting statement regarding the expected posterior variance of θ (this is the first term
on the right hand side). It is always smaller than the original prior variance (left hand side)! (but
only in average, as already said), because the variance of the posterior mean (second term on the
right side) is a positive quanitity.
2 Probability Inequalities
Inequalities are useful for bounding quantities that might otherwise be hard to compute. We’ll begin
with a simple inequality, called the Markov inequality after Andrei A. Markov, a student of Pafnuty
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 5
Chebyshev.
Let X be a non-negative, discrete random variable, and let c > 0 be a positive constant.
Apply the Markov Inequality to the deviation of a random variable from its mean, i.e. for a general
random variable X we wish to bound the probability of the event {∣X − E[X]∣ > a}, which is the same as
the event {(X − E[X])2 > a2 }.
V ar(X)
P (∣X − E[X]∣ > a) = P ((X − E[X])2 > a2 ) ≤
a2
Then start from the statement and form it into something similar to the above:
n/4 4
P (X > 3/4n) = P (X − 1/2n > 1/4n) ≤ P (∣X − E[X]∣ ≥ 1/4n) ≤ =
n/16 n
Note that we are considering here X as the sum of independent coin flips and thus using the rule:
variance of the sum of independent variables is the sum of the variances. Interpretation of the result:
with more tosses, the expected result converges to the mean value, so we have a convergence to zero
in this case.
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET
Machine Learning 1 — WS2014 — Module IN2064 Sheet 2 ⋅ Page 6
Let f be a convex function. If λ1 , . . . , λn are positive numbers with λ1 + ⋅ ⋅ ⋅ + λn = 1, then for any
x1 , . . . , xn ∈ I:
f (λ1 x1 + ⋅ ⋅ ⋅ + λn xn ) ≤ λ1 f (x1 ) + ⋅ ⋅ ⋅ + λn f (xn )
Submit to homework@class.brml.org with subject line homework sheet 2 by 2014/10/20, 1:59 CET