You are on page 1of 12

CSCI567 Fall 2015

1
(a)

Homework #1 Solution

Density Estimation
The PDF of Beta distribution is
f (x; , ) =

1
x1 (1 x)1
B(, )

When = 1, it becomes
f (x; ) = x1

L = L() =

n
Y

f (xi ; ) =

n
Y

i=1

x1
i

i=1

ln L = n ln + ( 1)

n
X

ln xi

i=1

Let

n X
ln L
= +
ln xi = 0

i=1

We have
=

n
Pn

i=1 ln xi

f (x; ) = N (, ) =

L = L() =

n
Y

(x)2
1
e 2
2

n
2

f (xi , ) = (2)

i=1

n
Y

(xi )2
2

i=1

n
n
n
n
n
1 X
n
n
1 X 2 X
n
ln L = ln(2) ln()
(xi )2 = ln(2) ln()
xi +
xi
2
2
2
2
2
2
2
i=1

i=1

Let
n
ln L
= +

2
Then we have

Pn

2
i=1 xi
2
2

Pn

2
i=1 xi
2

n
=0
2

n
n=0

We know > 0, so
=

q
1+

Pn

i=1

x2i

i=1

CSCI567 Fall 2015


(b)

Homework #1 Solution




n
1X
x Xi
1

EX1 , ,Xn [f (x)] =


K
EXi
n
h
h
i=1



1
xX
=E K
h
h
Z
Z
1
xt
xt
1
=
K(
K(
)f (t)dt =
)f (t)dt
h
h
h
h
By letting z =

xt
h ,

we have
Z
Z
1
xt
K(
)f (t)dt = K(z)f (x zh)dz
h
h

By Taylors theorem, we have


f (x zh) = f (x) zhf 0 (x) +

z 2 h2 00
f (x) +
2

E[f(x)] =



z 2 h2 00
0
K(z)f (x zh)dz = K(z)dz f (x) zhf (x) +
f (x) +
2
Z
Z
Z
h2
z 2 K(z)dz +
= f (x) K(z)dz f 0 (x)h zK(z)dz + f 00 (x)
2
2
h2 K
= f (x) +
f 00 (x) + o(h2 )
2
Z

So we know the bias term is


E[f(x)] f (x) =

2
h2 K
f 00 (x) + o(h2 )
2

Histogram Density Estimates

Part (a)
The estimate of density is equvalent to the portion of the samples that have fallen within the
given bin devided by the length of bin(h). Mathematically:
N

fn )(x) =

1 X
1(x0 ,x0 +h] (xi )
nh
i=1

where 1(x0 ,x0 +h] (xi ) is 1 if xi (x0 , x0 + h] and 0 otherwise.


Mean of B is np hence:
E(fn )(x)) =

1
F (x0 + h) F (x0 )
E(B) =
nh
h
2

CSCI567 Fall 2015

Homework #1 Solution

Var of B is np(1 p) hence:


V ar(fn )(x)) =

1
1 F (x0 + h) + F (x0 )
E(B) = E(fn (x))
n2 h2
nh

if we let h 0 and n then:


E(fn (x)) f (xn )
since the pdf is the derivative of the CDF. But since x is between x0 and x0 +h, f (x0 ) f (x).
So if we use smaller and smaller bins as we get more data, the histogram density estimate is
unbiased. Wed also like its variance to shrink as the same grows. Since 1F (x0 +h)+F (x0 )
1 as h 0, to get the variance to go away we need nh .
To put this together, then, our first conclusion is that histogram density estimates will be
consistent when h 0 but nh as n The bin-width h needs to shrink, but slower
than n1 .

Naive Bayes

D represents the features and Y {0, 1} represents


(a) (10 points)Suppose X = {Xi }D
i=1 R
the class labels. Let the following assumptions hold:

(a) The label variable Y follows a Bernoulli distribution, with parameter = P (Y = 1).
(b) For each feature Xj , we have P (Xj |Y = yk ) which follows a Gaussian distribution
N (jk , j ).
Using the Naive Bayes assumption, for all j 0 6= j, Xj and Xj 0 are conditionally independent
given Y , compute P (Y = 1|X) and show that it can be written in the following form:
P (Y = 1|X) =

1
1 + exp(w0 + wT X)

Specifically, find the explicit form of w0 and w in terms of , jk , and j , for j = 1, . . . , D,


and k {0, 1}. Solution: For ease of indexing, and without loss of generality, let y0 = 0 and
y1 = 1:
P (X|Y = 1)
P (Y = 1)
P (Y = 1|X) = P (X|Y = 1)
=
.
P (X)
P (X)
with
P (Y = 1) =
Now
P (X) =

P (X|Y = y) = P (X|Y = 1) + (1 )P (X|Y = 0)

Thus

P (Y = 1|X) =
1+
3

1 P (X|Y =0)
P (X|Y =1)

CSCI567 Fall 2015

Homework #1 Solution

Explicitly, using
P (X|Y = yk ) =

D
Y

(2j2 ) 2 exp((2j2 )1 (xj jk )2 )

j=1

We get
1

P (Y = 1|X) =

1+

( 1

1)

QD

2 1
j=1 exp((2j ) ((xj

j1 )2 (xj j0 )2 )

Consider only the second term of the denominator:


D

Y
1
( 1)
exp((2j2 )1 ((xj j1 )2 (xj j0 )2 )

j=1
X
1
= exp(log( 1)) exp( (2j2 )1 (x2j 2xj j1 + 2j1 x2j + 2xj j0 2j2 ))

j
X
X
1
= exp(log( 1) +
(2j2 )1 (2j1 2j2 ) +
(j2 )1 (j0 j1 )xj )

Comparing this with the second term in the denominator of


P (Y = 1|X) =

1
1 + exp(w0 + wT X)

gives us:
(a) w0 = log( 1 1)
(b) wj =

j0 j1
, f orj
j2

P h
j

i
(2 2 )1 (2j1 2j0 ) .

>= 1.

(b) (10 points) IMPORTANT: if someone solves the problem with jk or j , either
one is fine. Here is two solutions.
Case 1: jk
The data D = {(Xi , Yi }N
i=1 , the parameters = {pk , jk , jk }j=1, ,D;k=1, ,K .
By Bayes rules, we know
P (Xi , Yi ; ) = P (Yi )P (Xi |Yi ) = P (Yi )

D
Y

P (xij |Yi )

j=1
D
Y

D
Y

1
p
e
= p Yi
N (j,Yi , j,Yi ) = pYi
2j,Yi
j=1
j=1

(xij j,Y )2
i
2j,Y
i

CSCI567 Fall 2015

Homework #1 Solution

The log likelihood is


L = L(D; ) = ln

N
Y

P (Xi , Yi ; ) =

i=1
N
X

N
X

ln P (Xi , Yi ; )

i=1

D
Y

1
p
e
=
ln pYi
2j,Yi
i=1
j=1

N
X

ln pYi +

i=1

N
X

(xij j,Y )2
i
2j,Y
i

N X
D
X
i=1

ln pYi

X X (xij j,Y )2
1
i

ln p
2
2j,Yi
j,Yi
i=1 j=1
j=1
N

X X (xij j,Y )2
N D ln 2 1 X X
i

j,Yi

2
2
2j,Yi
i=1 j=1

i=1

To solve the MLE problem with constrain

PK

k=1 pk

L = L + (

K
X

i=1 j=1

= 1, we use Lagrange multiplier:

pk 1)

k=1

We also denote Nk =

PN

i=1 I(Yi

We let

= k) as the number of samples belonging to class k.


N

X I(Yi = k)
L
Nk
=
+=
+=0
pk
pYi
p Yi
i=1

Then we have

Nk
Nk
=

pk =
Also, we let

N
N
X
(xij jk )
L
1 X
=
I(Yi = k)
=
I(Yi = k)(xij jk ) = 0
jk
jk
jk
i=1

i=1

Then we have

PN

i=1 I(Yi

jk =

= k)xij

Nk

Also, we let
PN
N
N
X
(xij jk )2
I(Yi = k)(xij jk )2
L
1X
Nk
=
I(Yi = k)jk +
I(Yi = k)
=
+ i=1
=0
2
2
jk
2
2jk
2jk
2jk
i=1
i=1
Since jk > 0, we have
PN
jk =

i=1 I(Yi

= k)(xij jk
Nk

)2

PN
=
5

i=1 I(Yi = k)(xij

Nk

PN

i=1

I(Yi =k)xij 2
)
Nk

CSCI567 Fall 2015

Homework #1 Solution

Case 2. It has same answers for and j


The data {(xi , yi )}N
i=1 , the parameters = {k , jk , j }j=1, ,D;k0,1
The log likelihood is
L = L(D; ) = ln

N
Y

P (Xi , Yi ; ) =

i=1

N
X

D
Y

i=1
N
X

N X
D
X

i=1

q
e
2j2
j=1

N
X

ln P (Xi , Yi ; )

i=1

ln yi

ln yi +

i=1

N
X

i=1

(xij j,y )2
j
2 2
j

N X
D
X
(xij j,yi )2

ln q
2j2
2j2 i=1 j=1
j=1

N D ln 2 X X

ln j
ln yi
2
i=1 j=1

N X
D
X
i=1 j=1

We let
L(j ) =

(xij j,yi )2
2j2

N X
D
X
i=1

N X
D
X
(xij j,yi )2
ln j
2j2
j=1
i=1 j=1
N

X
(xij jk )2
N
L
=

+
I(Y
=
k)
i
j2
2j2 i=1
2j4
PN
I(Yi = k)(xij jk )2
N
= 2 + i=1
=0
2j
2j4
Since jk > 0, we have
j2

PN
=

i=1 I(Yi

Nearest Neighbor

(a) Let us follow the following label convention:


Mathematics: 1
Electrical Engineering: 2
Computer Science: 3
Economics : 4
Thus the unnormalized data is:
6

= k)(xij jk )2
N

CSCI567 Fall 2015

Homework #1 Solution
x-coordinate
0
-7
-9
29
49
37
8
13
-6
-21
27
19
27

y-coordinate
49
32
47
12
31
38
9
-1
-3
12
-32
-14
-20

label
1
1
1
2
2
2
3
3
3
3
4
4
4

Table 1: Unnormalized labeled data.


The mean and standard deviation are:

mean
std

x-coordinate
12.77
20.7170

y-coordinate
12.31
25.9306

Table 2: Mean and standard deviation of the labeled data.


The normalized queried student coordinate is:
normalized queried student x-coordinate
0.3490

normalized queried student y-coordinate


-0.2047

Table 3: Normalized queried data.


The L1 and L2 distances between the queried student coordinate and each labeled data are:

CSCI567 Fall 2015

Homework #1 Solution
L1 Distance
2.5851
2.2674
2.9424
0.6272
2.3253
2.0161
0.6564
0.6464
1.6407
2.1719
1.8419
0.8581
1.3791

L2 Distance
1.8856
1.6211
2.0830
0.4753
1.6781
1.4500
0.5843
0.4575
1.3129
1.9884
1.5415
0.8113
1.0947

label
1
1
1
2
2
2
3
3
3
4
4
4
4

Table 4: L1 and L2 distances between normalized queried data and each of the normalized labeled
data.
If sorted (from minimum to maximum) by L2 distance:
L2 Distance
0.4575
0.4753
0.5843
0.8113
1.0947
1.3129
1.4500
1.5415
1.6211
1.6781
1.8856
1.9884
2.0830

label
3
2
3
4
4
3
2
4
1
2
1
3
1

Table 5: Sorting based on L2 distance between normalized queried data and each of the normalized
labeled data.
If sorted (from minimum to maximum) by L1 distance:

CSCI567 Fall 2015

Homework #1 Solution
L1 Distance
0.6272
0.6464
0.6564
0.8581
1.3791
1.6407
1.8419
2.0161
2.1719
2.2674
2.3254
2.5851
2.9424

label
2
3
3
4
4
3
4
2
3
1
2
1
1

Table 6: Sorting based on L1 distance between normalized queried data and each of the normalized
labeled data.
Thus:
If using L2 distance metric and K = 1, the predicted student major will be label 3
(Computer Science).
If using L2 distance metric and K = 5, the predicted student major will be label 2
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
If using L1 distance metric and K = 1, the predicted student major will be label 3
(Electrical Engineering).
If using L1 distance metric and K = 5, the predicted student major will be label 3
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
(b) Probabilistic K-Nearest Neighbor:
The unconditional density p (x) can be computed as follows:
X
p (x) =
p (x | Y = c) p (Y = c)
c

X Kc Nc
=
Nc V N
c
X Kc
=
NV
c
p (x) =

K
NV
9

CSCI567 Fall 2015

Homework #1 Solution

The posterior probability of class membership p (Y = c | x) can be computed as follows:


p (Y = c | x) =
=
=
p (Y = c | x) =

p (x | Y = c) p (Y = c)
p (x)
Kc Nc
Nc V N
K
NV
Kc
NV
K
NV

Kc
K

MLE and MAP

(a) (3 points)
(a) The joint probability distribution, P (X = x, P = p)
P (X = x, P = p) = P (X = x | P = p)P (P = p)
 
n x
=
p (1 p)nx I(0 < p < 1)
x
 
n x
=
p (1 p)nx
x
(b) The marginal probability distribution, P (X = x)
Z

P (X) =

P (X, p)dp
Z 1 
n x
=
p (1 p)nx dp
x
0
 
n
=
B(x + 1, n x + 1)
x
0

(c) The posterior distribution, P (P = p | X = x)


P (P = p, X = x)
P (X = x)

n x
p (1 p)nx
x
= n
x B(x + 1, n x + 1)

P (P = p | X = x) =

=
(b) (3 points)

10

px (1 p)nx
B(x + 1, n x + 1)

CSCI567 Fall 2015

Homework #1 Solution

(a) The marginal probability distribution, P (X = x)


Z 1
P (X, p)dp
P (X = x) =
0
Z 1 
n x
p1 (1 p)1
p (1 p)nx
=
dp
x
B(, )
0
 Z 1
n
x
px+1 (1 p)nx+1 dp
B(, ) 0

n
B(x + , n x + )
= x
B(, )

(b) The posterior distribution, P (P = p | X = x)


P (P = p | X = x) =
=

P (P = p, X = x)
P (X = x)

1 (1p)1
n x
nx p
x p (1 p)
B(,)
(nx)B(x+,nx+)
px+1 (1

B(,)
p)nx+1

B(x + , n x + )

(c) (9 points)
(a) MLE and MAP of (a)

nx px (1 p)nx
P (X = x | P = p)
=
p
p
 
n
=
xpx1 (1 p)nx px (n x)(1 p)nx1
x
 
n x1
=
p (1 p)nx1 (x(1 p) (n x)p)
x
x
p=
n
MAP has a same result because the prior is independent on p. If prior is independent
on the paramter p, the estimates of MLE and MAP are same.
(b) MLE and MAP of (b)
Estimate of MLE is same as above.
1
px+1 (1 p)nx+1
P (P = p | X = x)
=
p
B(x + , n x + )
p


1
=
(x + 1)px+2 (1 p)nx+1 px+1 (n x + 1)(1 p)nx+2
B(x + , n x + )
1
=
px+2 (1 p)nx+2 ((x + 1)(1 p) p(n x + 1))
B(x + , n x + )
x+1
p=
n++2
11

CSCI567 Fall 2015

Homework #1 Solution

Estimates of MLE and MAP are different because of its prior distribution.
When x = 2, n = 10, we will say that p = 0.2 under the MLE estimation. However, it
2+501
is going to be p = 10+50+502
= 0.4722. If we have a probable prior distribution as like
= 50, = 50 (i.e., a coin is fair.), the MAP estimation is not sensitive on the small
number of exceptional occurrences(2 out of 10). Thus, MAP is more robust than MLE.

Decision Tree

Part (a) We should split on Traffic because it gives a perfect prediction of Accident rate. The
other cannot do perfect prediction. 5 Just mentioning the fact that Traffic gives perfect prediction.
Part (b) We can think about decision trees as partitioning the space of observations along each
axis. If every feature is continuous and ordered we can transform T1 into T2 by taking each decision
boundary, subtracting off the appropriate mean, and then dividing by the appropriate variance.
Both trees have the same structure and same accuracy. In other words, linear transformation does
not change informativeness of the features. 5 The argument that informativeness doesnt change
with linear transformation.
Part (c)
Consider the difference between the Gini Index and Cross Entropy:
G CE =

K
K
X
X
[pk (1 pk )] +
[pk log pk ]
k=1

G CE =

k=1
K
X

pk (1 pk + log pk )

k=1

Now examine the function f (x) = 1 x + log(x), where the base of the log is less than or equal
to e (the cross entropy is defined with base 2). Note that f is continuous on the positive real line.
d
1
Now consider the derivative dx
f = 1 + x log(a)
where a is the base of the log. This function is
1
also continuous on the positive real line. For all a e, log(a) 1 x log(a)
1 for all x (0, 1),
1
d
and for x = 1, x log(a)
= 1. This implies that dx
f (x) > 0 for x (0, 1), a < e so f has no critical
points in (0, 1).
Note that f (x) as x 0+ and consider x = 1. f (x) = 0, and has no previous critical
points, so it cannot have any positive points (if f were to have a positive point, since it is continous
it must decrease to f (0), but it then must have a negative derivative, meaning its derivative must
have a zero, meaning it must have a critical point. Contradiction.).
Thus, 1 pk + log pk < 0, meaning that G CE < 0, meaning that the Gini Index is always less
than the Cross Entropy. 5 Any correct proof is acceptable. Some partial credit should be given as
well if some of the ideas are correct.

12

You might also like