You are on page 1of 3

Machine learning

2nd Report
Ioannis Kouroudis

1.1

f(x) f’(x) f”(x)


(1−θ)h θt−2((t2 +(2h−1)t+h2 −h)θ 2 +((2−2h)t−2t2 )θ+t2 −t
θ (1 − θ)h
t h t−1
t(1 − θ) θ − h(1 − θ)h−1 θt (θ−1)2
t h (t+h)θ−t (t+h)θ 2 −2tθ+t
log(θ (1 − θ) ) (θ−1)θ (θ−1)2 θ 2

1.2
Log is a monotonic function and therefore preserves the critical points How-
ever, as can be seen from excercise 1.1, the logarithmic expression of the deriva-
tives is substantially simpler than the corresponding original one. Further, al-
though the critical points of both functions are the same, the values of the
distributions are substantially larger in the logarithmic case and consequently
less prone to cancellation and round off errors.

2.3
M LE = argmax(P (D|θ))
By using Bayes rule and omitting the constant denominator, as it doesnt
shift the critical point:
M AP = argmax(P (D|θ)P (θ))
consequently, we are searching for a probability distribution of θ that doesnt
affect the critical points. The only one is the constant number, i.e. a uniform
distribution. In a sense therefore, MLE is a special case of MAP with a uniform
prior.

2.4
The posteriori probability having a binomial estimator and a beta prior belief
a+m
is a Beta distribution as well, given by Beta(a+m,b+l) and its mean is a+m+b+l .
The proof of the above is the following: (The constants can be omited, as
we just need to determine the distribution and scale it to 0-1 range)

P (θ|D) ∝ Bin(N, m)Beta(α, β) → P (θ|D) ∝ θa+m−1 (1−θ)b+l−1 → P (θ|D) =


Beta(a + m, b + l)

m
M LE = m+l

1
the prior mean, since the prior is a beta distribution is given by
a
E(θ) = a+b

we can solve the following equations to get a function of the form E(θ|D) =
λE(θ) + (1 − λ)M LE
a a
a+b ∗p= a+b+m+l

and
m m
m+l ∗k = a+b+m+l

a+b
p= a+b+m+l

m+l
k= a+b+m+l

As can be easily seen, p + k = 1 Therefore:

E(θ|D) = λE(θ) + (1 − λ)M LE


a+b
with λ = a+b+m+l
Since both λ and 1 − λ are larger than zero and smaller than 1, it is sufficient
to prove that one of the two parts (i.e. either the MLE or E(θ)) is smaller than
E(θ|D) for the inequality to be proven.
Supposing
a a+m
a+b < a+b+m+l → mb < al

Then, supposing
m a+m
m+l < a+b+m+l → mb > al
a
Which is a contradiction. Similarly, it can be proven for case of a+b >
a+m m a+m
a+b+m+l and m+l > a+b+m+l . The equality case is valid for large numbers of
data. Otherwise, we can indeed assume inequality. Consequently, one of the
terms is larger and one smaller than the posterior and it is proven that E(θ|D)
lies between E(θ) and M LE

5a
Qn k
The overall P (λ|D) = i=1 e−λ λki !i
Pn
f (λ) = logP (λ|D) = −nl + i=1 (ki logλ − logκi !)
The critical (in this case maximum)
Pn
point of this probability distribution is
(ki )
at df /dλ = 0 which is at λmax = i=1 n

Since the values of the experiment are iid, E[ki ] = λ


Therefore,
Pn
i=1 (ki ) (nλ)
E[λmax ] = n = n =λ

Therefore the estimator is unbiased.

2
5b
As seen before, the posterior distribution is given by

P (D|λ)P (λ)
P (θ|D) = P (D)

α α−1 −βλ
Substituting P (λ) with β λ(a−1)!
e
andP (D|λ) with the beta distribution
and since the map estimate is not affected byP(D) we can ommit it entirely, we
get.
α α−1 −βλ
λki β λ
Qn Pn
e
P (θ|D) ∝ e−nλ i=1 ki ! α−1 ∝λ i=1 ki +a
− e−λ(β+n)

Since we are calculating the MAP estimate, the proportionality allows us


to drop constants without affecting the final result. Further, refering to the
answers of 1.1 and 1.2, we can use the natural
Pnlogarithm to ease calculations
λmap = argmax(lnP (λ|D)) = argmax(( i=1 ki + a − 1)lnλ − λ(n + β))

by differentiating the last expression with respect to λ we get


Pn Pn
ki +a−1) ki +a−1
i=1
λ − (n + β) = 0 → λ = i=1
n+β

At this point, i should check the second derivative to see the nature of the
critical point. The second derivative is −(β + n) < 0, which means that the
calculated λ is a maximum. Consequently
Pn
ki +a−1
λM AP = i=1
n+β

You might also like