Hw1 Solution

CSCI567 Fall 2015
1
(a)
Homework #1 Solution
Density Estimation
The PDF of Beta distribution is
f (x; , ) =
1
x1 (1 x)1
B(, )
When = 1, it becomes
f (x; ) = x1
L = L() =
n
Y
f (xi ; ) =
n
Y
i=1
x1
i
i=1
ln L = n ln + ( 1)
n
X
ln xi
i=1
Let
n X
ln L
= +
ln xi = 0
i=1
We have
=
n
Pn
i=1 ln xi
f (x; ) = N (, ) =
L = L() =
n
Y
(x)2
1
e 2
2
n
2
f (xi , ) = (2)
i=1
n
Y
(xi )2
2
i=1
n
n
n
n
n
1 X
n
n
1 X 2 X
n
ln L = ln(2) ln()
(xi )2 = ln(2) ln()
xi +
xi
2
2
2
2
2
2
2
i=1
i=1
Let
n
ln L
= +
2
Then we have
Pn
2
i=1 xi
2
2
Pn
2
i=1 xi
2
n
=0
2
n
n=0
We know > 0, so
=
q
1+
Pn
i=1
x2i
i=1
CSCI567 Fall 2015

(b)

n
1X
x Xi
1
EX1 , ,Xn [f (x)] =

K
EXi
n
h
h
i=1

1
xX
=E K
h
h
Z
Z
1
xt
xt
1
=
K(
K(
)f (t)dt =
)f (t)dt
h
h
h
h
By letting z =
xt
h ,
we have
Z
Z
1
xt
K(
)f (t)dt = K(z)f (x zh)dz
h
h
By Taylors theorem, we have

f (x zh) = f (x) zhf 0 (x) +
z 2 h2 00
f (x) +
2
E[f(x)] =

z 2 h2 00
0
K(z)f (x zh)dz = K(z)dz f (x) zhf (x) +
f (x) +
2
Z
Z
Z
h2
z 2 K(z)dz +
= f (x) K(z)dz f 0 (x)h zK(z)dz + f 00 (x)
2
2
h2 K
= f (x) +
f 00 (x) + o(h2 )
2
Z
So we know the bias term is

E[f(x)] f (x) =
2
h2 K
f 00 (x) + o(h2 )
2
Histogram Density Estimates
Part (a)
The estimate of density is equvalent to the portion of the samples that have fallen within the
given bin devided by the length of bin(h). Mathematically:
N
fn )(x) =
1 X
1(x0 ,x0 +h] (xi )
nh
i=1
where 1(x0 ,x0 +h] (xi ) is 1 if xi (x0 , x0 + h] and 0 otherwise.

Mean of B is np hence:
E(fn )(x)) =
1
F (x0 + h) F (x0 )
E(B) =
nh
h
2
CSCI567 Fall 2015
Var of B is np(1 p) hence:

V ar(fn )(x)) =
1
1 F (x0 + h) + F (x0 )
E(B) = E(fn (x))
n2 h2
nh
if we let h 0 and n then:

E(fn (x)) f (xn )
since the pdf is the derivative of the CDF. But since x is between x0 and x0 +h, f (x0 ) f (x).
So if we use smaller and smaller bins as we get more data, the histogram density estimate is
unbiased. Wed also like its variance to shrink as the same grows. Since 1F (x0 +h)+F (x0 )
1 as h 0, to get the variance to go away we need nh .
To put this together, then, our first conclusion is that histogram density estimates will be
consistent when h 0 but nh as n The bin-width h needs to shrink, but slower
than n1 .
Naive Bayes
D represents the features and Y {0, 1} represents

(a) (10 points)Suppose X = {Xi }D
i=1 R
the class labels. Let the following assumptions hold:
(a) The label variable Y follows a Bernoulli distribution, with parameter = P (Y = 1).
(b) For each feature Xj , we have P (Xj |Y = yk ) which follows a Gaussian distribution
N (jk , j ).
Using the Naive Bayes assumption, for all j 0 6= j, Xj and Xj 0 are conditionally independent
given Y , compute P (Y = 1|X) and show that it can be written in the following form:
P (Y = 1|X) =
1
1 + exp(w0 + wT X)
Specifically, find the explicit form of w0 and w in terms of , jk , and j , for j = 1, . . . , D,

and k {0, 1}. Solution: For ease of indexing, and without loss of generality, let y0 = 0 and
y1 = 1:
P (X|Y = 1)
P (Y = 1)
P (Y = 1|X) = P (X|Y = 1)
=
.
P (X)
P (X)
with
P (Y = 1) =
Now
P (X) =
P (X|Y = y) = P (X|Y = 1) + (1 )P (X|Y = 0)
Thus
P (Y = 1|X) =
1+
3
1 P (X|Y =0)
P (X|Y =1)
CSCI567 Fall 2015
Explicitly, using
P (X|Y = yk ) =
D
Y
(2j2 ) 2 exp((2j2 )1 (xj jk )2 )
j=1
We get
1
P (Y = 1|X) =
1+
( 1
1)
QD
2 1
j=1 exp((2j ) ((xj
j1 )2 (xj j0 )2 )
Consider only the second term of the denominator:

D
Y
1
( 1)
exp((2j2 )1 ((xj j1 )2 (xj j0 )2 )
j=1
X
1
= exp(log( 1)) exp( (2j2 )1 (x2j 2xj j1 + 2j1 x2j + 2xj j0 2j2 ))
j
X
X
1
= exp(log( 1) +
(2j2 )1 (2j1 2j2 ) +
(j2 )1 (j0 j1 )xj )
Comparing this with the second term in the denominator of

P (Y = 1|X) =
1
1 + exp(w0 + wT X)
gives us:
(a) w0 = log( 1 1)
(b) wj =
j0 j1
, f orj
j2
P h
j
i
(2 2 )1 (2j1 2j0 ) .
>= 1.
(b) (10 points) IMPORTANT: if someone solves the problem with jk or j , either
one is fine. Here is two solutions.
Case 1: jk
The data D = {(Xi , Yi }N
i=1 , the parameters = {pk , jk , jk }j=1, ,D;k=1, ,K .
By Bayes rules, we know
P (Xi , Yi ; ) = P (Yi )P (Xi |Yi ) = P (Yi )
D
Y
P (xij |Yi )
j=1
D
Y
D
Y
1
p
e
= p Yi
N (j,Yi , j,Yi ) = pYi
2j,Yi
j=1
j=1
(xij j,Y )2
i
2j,Y
i
CSCI567 Fall 2015
The log likelihood is

L = L(D; ) = ln
N
Y
P (Xi , Yi ; ) =
i=1
N
X
N
X
ln P (Xi , Yi ; )
i=1
D
Y
1
p
e
=
ln pYi
2j,Yi
i=1
j=1
N
X
ln pYi +
i=1
N
X
(xij j,Y )2
i
2j,Y
i
N X
D
X
i=1
ln pYi
X X (xij j,Y )2
1
i
ln p
2
2j,Yi
j,Yi
i=1 j=1
j=1
N
X X (xij j,Y )2
N D ln 2 1 X X
i
j,Yi
2
2
2j,Yi
i=1 j=1
i=1
To solve the MLE problem with constrain
PK
k=1 pk
L = L + (
K
X
i=1 j=1
= 1, we use Lagrange multiplier:
pk 1)
k=1
We also denote Nk =
PN
i=1 I(Yi
We let
= k) as the number of samples belonging to class k.

N
X I(Yi = k)
L
Nk
=
+=
+=0
pk
pYi
p Yi
i=1
Then we have
Nk
Nk
=
pk =
Also, we let
N
N
X
(xij jk )
L
1 X
=
I(Yi = k)
=
I(Yi = k)(xij jk ) = 0
jk
jk
jk
i=1
i=1
Then we have
PN
i=1 I(Yi
jk =
= k)xij
Nk
Also, we let
PN
N
N
X
(xij jk )2
I(Yi = k)(xij jk )2
L
1X
Nk
=
I(Yi = k)jk +
I(Yi = k)
=
+ i=1
=0
2
2
jk
2
2jk
2jk
2jk
i=1
i=1
Since jk > 0, we have
PN
jk =
i=1 I(Yi
= k)(xij jk
Nk
)2
PN
=
5
i=1 I(Yi = k)(xij
Nk
PN
i=1
I(Yi =k)xij 2
)
Nk
CSCI567 Fall 2015
Case 2. It has same answers for and j

The data {(xi , yi )}N
i=1 , the parameters = {k , jk , j }j=1, ,D;k0,1
The log likelihood is
L = L(D; ) = ln
N
Y
P (Xi , Yi ; ) =
i=1
N
X
D
Y
i=1
N
X
N X
D
X
i=1
q
e
2j2
j=1
N
X
ln P (Xi , Yi ; )
i=1
ln yi
ln yi +
i=1
N
X
i=1
(xij j,y )2
j
2 2
j
N X
D
X
(xij j,yi )2
ln q
2j2
2j2 i=1 j=1
j=1
N D ln 2 X X
ln j
ln yi
2
i=1 j=1
N X
D
X
i=1 j=1
We let
L(j ) =
(xij j,yi )2
2j2
N X
D
X
i=1
N X
D
X
(xij j,yi )2
ln j
2j2
j=1
i=1 j=1
N
X
(xij jk )2
N
L
=
+
I(Y
=
k)
i
j2
2j2 i=1
2j4
PN
I(Yi = k)(xij jk )2
N
= 2 + i=1
=0
2j
2j4
Since jk > 0, we have
j2
PN
=
i=1 I(Yi
Nearest Neighbor
(a) Let us follow the following label convention:

Mathematics: 1
Electrical Engineering: 2
Computer Science: 3
Economics : 4
Thus the unnormalized data is:
6
= k)(xij jk )2
N
CSCI567 Fall 2015
x-coordinate
0
-7
-9
29
49
37
8
13
-6
-21
27
19
27
y-coordinate
49
32
47
12
31
38
9
-1
-3
12
-32
-14
-20
label
1
1
1
2
2
2
3
3
3
3
4
4
4
Table 1: Unnormalized labeled data.

The mean and standard deviation are:
mean
std
x-coordinate
12.77
20.7170
y-coordinate
12.31
25.9306
Table 2: Mean and standard deviation of the labeled data.

The normalized queried student coordinate is:
normalized queried student x-coordinate
0.3490
normalized queried student y-coordinate

-0.2047
Table 3: Normalized queried data.

The L1 and L2 distances between the queried student coordinate and each labeled data are:
CSCI567 Fall 2015
L1 Distance
2.5851
2.2674
2.9424
0.6272
2.3253
2.0161
0.6564
0.6464
1.6407
2.1719
1.8419
0.8581
1.3791
L2 Distance
1.8856
1.6211
2.0830
0.4753
1.6781
1.4500
0.5843
0.4575
1.3129
1.9884
1.5415
0.8113
1.0947
label
1
1
1
2
2
2
3
3
3
4
4
4
4
Table 4: L1 and L2 distances between normalized queried data and each of the normalized labeled
data.
If sorted (from minimum to maximum) by L2 distance:
L2 Distance
0.4575
0.4753
0.5843
0.8113
1.0947
1.3129
1.4500
1.5415
1.6211
1.6781
1.8856
1.9884
2.0830
label
3
2
3
4
4
3
2
4
1
2
1
3
1
Table 5: Sorting based on L2 distance between normalized queried data and each of the normalized
labeled data.
If sorted (from minimum to maximum) by L1 distance:
CSCI567 Fall 2015
L1 Distance
0.6272
0.6464
0.6564
0.8581
1.3791
1.6407
1.8419
2.0161
2.1719
2.2674
2.3254
2.5851
2.9424
label
2
3
3
4
4
3
4
2
3
1
2
1
1
Table 6: Sorting based on L1 distance between normalized queried data and each of the normalized
labeled data.
Thus:
If using L2 distance metric and K = 1, the predicted student major will be label 3
(Computer Science).
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
(Electrical Engineering).
(Computer Science). Actually this is a tie between Computer Science and Economics,
but tie-breaking is chosen by the labeled data with shortest distance.
(b) Probabilistic K-Nearest Neighbor:
The unconditional density p (x) can be computed as follows:
X
p (x) =
p (x | Y = c) p (Y = c)
c
X Kc Nc
=
Nc V N
c
X Kc
=
NV
c
p (x) =
K
NV
9
CSCI567 Fall 2015
The posterior probability of class membership p (Y = c | x) can be computed as follows:

p (Y = c | x) =
=
=
p (Y = c | x) =
p (x | Y = c) p (Y = c)
p (x)
Kc Nc
Nc V N
K
NV
Kc
NV
K
NV
Kc
K
MLE and MAP
(a) (3 points)
(a) The joint probability distribution, P (X = x, P = p)
P (X = x, P = p) = P (X = x | P = p)P (P = p)

n x
=
p (1 p)nx I(0 < p < 1)
x

n x
=
p (1 p)nx
x
(b) The marginal probability distribution, P (X = x)
Z
P (X) =
P (X, p)dp
Z 1
n x
=
p (1 p)nx dp
x
0

n
=
B(x + 1, n x + 1)
x
0
(c) The posterior distribution, P (P = p | X = x)

P (P = p, X = x)
P (X = x)

n x
p (1 p)nx
x
= n
x B(x + 1, n x + 1)
P (P = p | X = x) =
=
(b) (3 points)
10
px (1 p)nx
B(x + 1, n x + 1)
CSCI567 Fall 2015
(a) The marginal probability distribution, P (X = x)

Z 1
P (X, p)dp
P (X = x) =
0
Z 1
n x
p1 (1 p)1
p (1 p)nx
=
dp
x
B(, )
0
Z 1
n
x
px+1 (1 p)nx+1 dp
B(, ) 0

n
B(x + , n x + )
= x
B(, )
(b) The posterior distribution, P (P = p | X = x)

P (P = p | X = x) =
=
P (P = p, X = x)
P (X = x)

1 (1p)1
n x
nx p
x p (1 p)
B(,)
(nx)B(x+,nx+)
px+1 (1
B(,)
p)nx+1
B(x + , n x + )
(c) (9 points)
(a) MLE and MAP of (a)

nx px (1 p)nx
P (X = x | P = p)
=
p
p

n
=
xpx1 (1 p)nx px (n x)(1 p)nx1
x

n x1
=
p (1 p)nx1 (x(1 p) (n x)p)
x
x
p=
n
MAP has a same result because the prior is independent on p. If prior is independent
on the paramter p, the estimates of MLE and MAP are same.
(b) MLE and MAP of (b)
Estimate of MLE is same as above.
1
px+1 (1 p)nx+1
P (P = p | X = x)
=
p
B(x + , n x + )
p

1
=
(x + 1)px+2 (1 p)nx+1 px+1 (n x + 1)(1 p)nx+2
B(x + , n x + )
1
=
px+2 (1 p)nx+2 ((x + 1)(1 p) p(n x + 1))
B(x + , n x + )
x+1
p=
n++2
11
CSCI567 Fall 2015
Estimates of MLE and MAP are different because of its prior distribution.
When x = 2, n = 10, we will say that p = 0.2 under the MLE estimation. However, it
2+501
is going to be p = 10+50+502
= 0.4722. If we have a probable prior distribution as like
= 50, = 50 (i.e., a coin is fair.), the MAP estimation is not sensitive on the small
number of exceptional occurrences(2 out of 10). Thus, MAP is more robust than MLE.
Decision Tree
Part (a) We should split on Traffic because it gives a perfect prediction of Accident rate. The
other cannot do perfect prediction. 5 Just mentioning the fact that Traffic gives perfect prediction.
Part (b) We can think about decision trees as partitioning the space of observations along each
axis. If every feature is continuous and ordered we can transform T1 into T2 by taking each decision
boundary, subtracting off the appropriate mean, and then dividing by the appropriate variance.
Both trees have the same structure and same accuracy. In other words, linear transformation does
not change informativeness of the features. 5 The argument that informativeness doesnt change
with linear transformation.
Part (c)
Consider the difference between the Gini Index and Cross Entropy:
G CE =
K
K
X
X
[pk (1 pk )] +
[pk log pk ]
k=1
G CE =
k=1
K
X
pk (1 pk + log pk )
k=1
Now examine the function f (x) = 1 x + log(x), where the base of the log is less than or equal
to e (the cross entropy is defined with base 2). Note that f is continuous on the positive real line.
d
1
Now consider the derivative dx
f = 1 + x log(a)
where a is the base of the log. This function is
1
also continuous on the positive real line. For all a e, log(a) 1 x log(a)
1 for all x (0, 1),
1
d
and for x = 1, x log(a)
= 1. This implies that dx
f (x) > 0 for x (0, 1), a < e so f has no critical
points in (0, 1).
Note that f (x) as x 0+ and consider x = 1. f (x) = 0, and has no previous critical
points, so it cannot have any positive points (if f were to have a positive point, since it is continous
it must decrease to f (0), but it then must have a negative derivative, meaning its derivative must
have a zero, meaning it must have a critical point. Contradiction.).
Thus, 1 pk + log pk < 0, meaning that G CE < 0, meaning that the Gini Index is always less
than the Cross Entropy. 5 Any correct proof is acceptable. Some partial credit should be given as
well if some of the ideas are correct.
12

Hw1 Solution

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hw1 Solution

Uploaded by

Copyright:

Available Formats

CSCI567 Fall 2015

CSCI567 Fall 2015

EX1 , ,Xn [f (x)] =

By Taylors theorem, we have

So we know the bias term is

Histogram Density Estimates

where 1(x0 ,x0 +h] (xi ) is 1 if xi (x0 , x0 + h] and 0 otherwise.

CSCI567 Fall 2015

Var of B is np(1 p) hence:

if we let h 0 and n then:

D represents the features and Y {0, 1} represents

Specifically, find the explicit form of w0 and w in terms of , jk , and j , for j = 1, . . . , D,

P (X|Y = y) = P (X|Y = 1) + (1 )P (X|Y = 0)

CSCI567 Fall 2015

(2j2 ) 2 exp((2j2 )1 (xj jk )2 )

Consider only the second term of the denominator:

Comparing this with the second term in the denominator of

CSCI567 Fall 2015

The log likelihood is

To solve the MLE problem with constrain

= 1, we use Lagrange multiplier:

= k) as the number of samples belonging to class k.

i=1 I(Yi = k)(xij

CSCI567 Fall 2015

Case 2. It has same answers for and j

(a) Let us follow the following label convention:

CSCI567 Fall 2015

Table 1: Unnormalized labeled data.

Table 2: Mean and standard deviation of the labeled data.

normalized queried student y-coordinate

Table 3: Normalized queried data.

CSCI567 Fall 2015

CSCI567 Fall 2015

CSCI567 Fall 2015

The posterior probability of class membership p (Y = c | x) can be computed as follows:

MLE and MAP

(c) The posterior distribution, P (P = p | X = x)

CSCI567 Fall 2015

(a) The marginal probability distribution, P (X = x)

(b) The posterior distribution, P (P = p | X = x)

CSCI567 Fall 2015

You might also like