Professional Documents
Culture Documents
1
(a)
Homework #1 Solution
Density Estimation
The PDF of Beta distribution is
f (x; , ) =
1
x1 (1 x)1
B(, )
When = 1, it becomes
f (x; ) = x1
L = L() =
n
Y
f (xi ; ) =
n
Y
i=1
x1
i
i=1
ln L = n ln + ( 1)
n
X
ln xi
i=1
Let
n X
ln L
= +
ln xi = 0
i=1
We have
=
n
Pn
i=1 ln xi
f (x; ) = N (, ) =
L = L() =
n
Y
(x)2
1
e 2
2
n
2
f (xi , ) = (2)
i=1
n
Y
(xi )2
2
i=1
n
n
n
n
n
1 X
n
n
1 X 2 X
n
ln L = ln(2) ln()
(xi )2 = ln(2) ln()
xi +
xi
2
2
2
2
2
2
2
i=1
i=1
Let
n
ln L
= +
2
Then we have
Pn
2
i=1 xi
2
2
Pn
2
i=1 xi
2
n
=0
2
n
n=0
We know > 0, so
=
q
1+
Pn
i=1
x2i
i=1
Homework #1 Solution
n
1X
x Xi
1
xt
h ,
we have
Z
Z
1
xt
K(
)f (t)dt = K(z)f (x zh)dz
h
h
z 2 h2 00
f (x) +
2
E[f(x)] =
z 2 h2 00
0
K(z)f (x zh)dz = K(z)dz f (x) zhf (x) +
f (x) +
2
Z
Z
Z
h2
z 2 K(z)dz +
= f (x) K(z)dz f 0 (x)h zK(z)dz + f 00 (x)
2
2
h2 K
= f (x) +
f 00 (x) + o(h2 )
2
Z
2
h2 K
f 00 (x) + o(h2 )
2
Part (a)
The estimate of density is equvalent to the portion of the samples that have fallen within the
given bin devided by the length of bin(h). Mathematically:
N
fn )(x) =
1 X
1(x0 ,x0 +h] (xi )
nh
i=1
1
F (x0 + h) F (x0 )
E(B) =
nh
h
2
Homework #1 Solution
1
1 F (x0 + h) + F (x0 )
E(B) = E(fn (x))
n2 h2
nh
Naive Bayes
(a) The label variable Y follows a Bernoulli distribution, with parameter = P (Y = 1).
(b) For each feature Xj , we have P (Xj |Y = yk ) which follows a Gaussian distribution
N (jk , j ).
Using the Naive Bayes assumption, for all j 0 6= j, Xj and Xj 0 are conditionally independent
given Y , compute P (Y = 1|X) and show that it can be written in the following form:
P (Y = 1|X) =
1
1 + exp(w0 + wT X)
Thus
P (Y = 1|X) =
1+
3
1 P (X|Y =0)
P (X|Y =1)
Homework #1 Solution
Explicitly, using
P (X|Y = yk ) =
D
Y
j=1
We get
1
P (Y = 1|X) =
1+
( 1
1)
QD
2 1
j=1 exp((2j ) ((xj
j1 )2 (xj j0 )2 )
Y
1
( 1)
exp((2j2 )1 ((xj j1 )2 (xj j0 )2 )
j=1
X
1
= exp(log( 1)) exp( (2j2 )1 (x2j 2xj j1 + 2j1 x2j + 2xj j0 2j2 ))
j
X
X
1
= exp(log( 1) +
(2j2 )1 (2j1 2j2 ) +
(j2 )1 (j0 j1 )xj )
1
1 + exp(w0 + wT X)
gives us:
(a) w0 = log( 1 1)
(b) wj =
j0 j1
, f orj
j2
P h
j
i
(2 2 )1 (2j1 2j0 ) .
>= 1.
(b) (10 points) IMPORTANT: if someone solves the problem with jk or j , either
one is fine. Here is two solutions.
Case 1: jk
The data D = {(Xi , Yi }N
i=1 , the parameters = {pk , jk , jk }j=1, ,D;k=1, ,K .
By Bayes rules, we know
P (Xi , Yi ; ) = P (Yi )P (Xi |Yi ) = P (Yi )
D
Y
P (xij |Yi )
j=1
D
Y
D
Y
1
p
e
= p Yi
N (j,Yi , j,Yi ) = pYi
2j,Yi
j=1
j=1
(xij j,Y )2
i
2j,Y
i
Homework #1 Solution
N
Y
P (Xi , Yi ; ) =
i=1
N
X
N
X
ln P (Xi , Yi ; )
i=1
D
Y
1
p
e
=
ln pYi
2j,Yi
i=1
j=1
N
X
ln pYi +
i=1
N
X
(xij j,Y )2
i
2j,Y
i
N X
D
X
i=1
ln pYi
X X (xij j,Y )2
1
i
ln p
2
2j,Yi
j,Yi
i=1 j=1
j=1
N
X X (xij j,Y )2
N D ln 2 1 X X
i
j,Yi
2
2
2j,Yi
i=1 j=1
i=1
PK
k=1 pk
L = L + (
K
X
i=1 j=1
pk 1)
k=1
We also denote Nk =
PN
i=1 I(Yi
We let
X I(Yi = k)
L
Nk
=
+=
+=0
pk
pYi
p Yi
i=1
Then we have
Nk
Nk
=
pk =
Also, we let
N
N
X
(xij jk )
L
1 X
=
I(Yi = k)
=
I(Yi = k)(xij jk ) = 0
jk
jk
jk
i=1
i=1
Then we have
PN
i=1 I(Yi
jk =
= k)xij
Nk
Also, we let
PN
N
N
X
(xij jk )2
I(Yi = k)(xij jk )2
L
1X
Nk
=
I(Yi = k)jk +
I(Yi = k)
=
+ i=1
=0
2
2
jk
2
2jk
2jk
2jk
i=1
i=1
Since jk > 0, we have
PN
jk =
i=1 I(Yi
= k)(xij jk
Nk
)2
PN
=
5
Nk
PN
i=1
I(Yi =k)xij 2
)
Nk
Homework #1 Solution
N
Y
P (Xi , Yi ; ) =
i=1
N
X
D
Y
i=1
N
X
N X
D
X
i=1
q
e
2j2
j=1
N
X
ln P (Xi , Yi ; )
i=1
ln yi
ln yi +
i=1
N
X
i=1
(xij j,y )2
j
2 2
j
N X
D
X
(xij j,yi )2
ln q
2j2
2j2 i=1 j=1
j=1
N D ln 2 X X
ln j
ln yi
2
i=1 j=1
N X
D
X
i=1 j=1
We let
L(j ) =
(xij j,yi )2
2j2
N X
D
X
i=1
N X
D
X
(xij j,yi )2
ln j
2j2
j=1
i=1 j=1
N
X
(xij jk )2
N
L
=
+
I(Y
=
k)
i
j2
2j2 i=1
2j4
PN
I(Yi = k)(xij jk )2
N
= 2 + i=1
=0
2j
2j4
Since jk > 0, we have
j2
PN
=
i=1 I(Yi
Nearest Neighbor
= k)(xij jk )2
N
Homework #1 Solution
x-coordinate
0
-7
-9
29
49
37
8
13
-6
-21
27
19
27
y-coordinate
49
32
47
12
31
38
9
-1
-3
12
-32
-14
-20
label
1
1
1
2
2
2
3
3
3
3
4
4
4
mean
std
x-coordinate
12.77
20.7170
y-coordinate
12.31
25.9306
Homework #1 Solution
L1 Distance
2.5851
2.2674
2.9424
0.6272
2.3253
2.0161
0.6564
0.6464
1.6407
2.1719
1.8419
0.8581
1.3791
L2 Distance
1.8856
1.6211
2.0830
0.4753
1.6781
1.4500
0.5843
0.4575
1.3129
1.9884
1.5415
0.8113
1.0947
label
1
1
1
2
2
2
3
3
3
4
4
4
4
Table 4: L1 and L2 distances between normalized queried data and each of the normalized labeled
data.
If sorted (from minimum to maximum) by L2 distance:
L2 Distance
0.4575
0.4753
0.5843
0.8113
1.0947
1.3129
1.4500
1.5415
1.6211
1.6781
1.8856
1.9884
2.0830
label
3
2
3
4
4
3
2
4
1
2
1
3
1
Table 5: Sorting based on L2 distance between normalized queried data and each of the normalized
labeled data.
If sorted (from minimum to maximum) by L1 distance:
Homework #1 Solution
L1 Distance
0.6272
0.6464
0.6564
0.8581
1.3791
1.6407
1.8419
2.0161
2.1719
2.2674
2.3254
2.5851
2.9424
label
2
3
3
4
4
3
4
2
3
1
2
1
1
Table 6: Sorting based on L1 distance between normalized queried data and each of the normalized
labeled data.
Thus:
If using L2 distance metric and K = 1, the predicted student major will be label 3
(Computer Science).
If using L2 distance metric and K = 5, the predicted student major will be label 2
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
If using L1 distance metric and K = 1, the predicted student major will be label 3
(Electrical Engineering).
If using L1 distance metric and K = 5, the predicted student major will be label 3
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
(b) Probabilistic K-Nearest Neighbor:
The unconditional density p (x) can be computed as follows:
X
p (x) =
p (x | Y = c) p (Y = c)
c
X Kc Nc
=
Nc V N
c
X Kc
=
NV
c
p (x) =
K
NV
9
Homework #1 Solution
p (x | Y = c) p (Y = c)
p (x)
Kc Nc
Nc V N
K
NV
Kc
NV
K
NV
Kc
K
(a) (3 points)
(a) The joint probability distribution, P (X = x, P = p)
P (X = x, P = p) = P (X = x | P = p)P (P = p)
n x
=
p (1 p)nx I(0 < p < 1)
x
n x
=
p (1 p)nx
x
(b) The marginal probability distribution, P (X = x)
Z
P (X) =
P (X, p)dp
Z 1
n x
=
p (1 p)nx dp
x
0
n
=
B(x + 1, n x + 1)
x
0
P (P = p | X = x) =
=
(b) (3 points)
10
px (1 p)nx
B(x + 1, n x + 1)
Homework #1 Solution
P (P = p, X = x)
P (X = x)
1 (1p)1
n x
nx p
x p (1 p)
B(,)
(nx)B(x+,nx+)
px+1 (1
B(,)
p)nx+1
B(x + , n x + )
(c) (9 points)
(a) MLE and MAP of (a)
nx px (1 p)nx
P (X = x | P = p)
=
p
p
n
=
xpx1 (1 p)nx px (n x)(1 p)nx1
x
n x1
=
p (1 p)nx1 (x(1 p) (n x)p)
x
x
p=
n
MAP has a same result because the prior is independent on p. If prior is independent
on the paramter p, the estimates of MLE and MAP are same.
(b) MLE and MAP of (b)
Estimate of MLE is same as above.
1
px+1 (1 p)nx+1
P (P = p | X = x)
=
p
B(x + , n x + )
p
1
=
(x + 1)px+2 (1 p)nx+1 px+1 (n x + 1)(1 p)nx+2
B(x + , n x + )
1
=
px+2 (1 p)nx+2 ((x + 1)(1 p) p(n x + 1))
B(x + , n x + )
x+1
p=
n++2
11
Homework #1 Solution
Estimates of MLE and MAP are different because of its prior distribution.
When x = 2, n = 10, we will say that p = 0.2 under the MLE estimation. However, it
2+501
is going to be p = 10+50+502
= 0.4722. If we have a probable prior distribution as like
= 50, = 50 (i.e., a coin is fair.), the MAP estimation is not sensitive on the small
number of exceptional occurrences(2 out of 10). Thus, MAP is more robust than MLE.
Decision Tree
Part (a) We should split on Traffic because it gives a perfect prediction of Accident rate. The
other cannot do perfect prediction. 5 Just mentioning the fact that Traffic gives perfect prediction.
Part (b) We can think about decision trees as partitioning the space of observations along each
axis. If every feature is continuous and ordered we can transform T1 into T2 by taking each decision
boundary, subtracting off the appropriate mean, and then dividing by the appropriate variance.
Both trees have the same structure and same accuracy. In other words, linear transformation does
not change informativeness of the features. 5 The argument that informativeness doesnt change
with linear transformation.
Part (c)
Consider the difference between the Gini Index and Cross Entropy:
G CE =
K
K
X
X
[pk (1 pk )] +
[pk log pk ]
k=1
G CE =
k=1
K
X
pk (1 pk + log pk )
k=1
Now examine the function f (x) = 1 x + log(x), where the base of the log is less than or equal
to e (the cross entropy is defined with base 2). Note that f is continuous on the positive real line.
d
1
Now consider the derivative dx
f = 1 + x log(a)
where a is the base of the log. This function is
1
also continuous on the positive real line. For all a e, log(a) 1 x log(a)
1 for all x (0, 1),
1
d
and for x = 1, x log(a)
= 1. This implies that dx
f (x) > 0 for x (0, 1), a < e so f has no critical
points in (0, 1).
Note that f (x) as x 0+ and consider x = 1. f (x) = 0, and has no previous critical
points, so it cannot have any positive points (if f were to have a positive point, since it is continous
it must decrease to f (0), but it then must have a negative derivative, meaning its derivative must
have a zero, meaning it must have a critical point. Contradiction.).
Thus, 1 pk + log pk < 0, meaning that G CE < 0, meaning that the Gini Index is always less
than the Cross Entropy. 5 Any correct proof is acceptable. Some partial credit should be given as
well if some of the ideas are correct.
12