You are on page 1of 2

In the following we assume for simplicity that P [Y = 0] = P [Y = 1] = 1/2.

We denote the class conditional


distributions by ρ0 ( x ) = p( x |Y = 0), ρ1 ( x ) = p( x |Y = 1). The Bayes optimal classifier is
(
1, ρ1 ( x ) > ρ0 ( x )
f ∗ (x) =
0, otherwise.

A plug-in classifier is (
1, ρe1 ( x ) > ρe0 ( x )
f (x) =
0, otherwise.
We will use the following upper bound for the classification error. For a proof see next page.

Lemma 1.
1 1
Z Z
E[1{ f ( X ) 6= Y }] − E[1{ f ∗ ( X ) 6= Y }] ≤ |ρ0 ( x ) − ρe0 ( x )| + |ρ1 ( x ) − ρe1 ( x )|.
2 2
Now, our classifier first estimates θ, and then approximates the class conditional distributions using
θ̂. Therefore, ρe0 ( x ) is the density of N (−θ̂, σ2 I) and ρe1 ( x ) is the density of N (θ̂, σ2 I). Using Pinsker’s
inequality we have
s 
1 Dkl N (θ, σ2 I)kN (θ̂, σ2 I) kθ − θ̂ k2
Z
|ρ0 ( x ) − ρe0 ( x )|dx ≤ =
2 2 2σ

Similarly,

1 kθ − θ̂ k2
Z
|ρ1 ( x ) − ρe1 ( x )|dx ≤
2 2σ
Therefore, we obtain

kθ − θ̂ (( X1 , Y1 ), . . . , ( Xn , Yn ))k2
E[1{ f ( X ) 6= Y }] − E[1{ f ∗ ( X ) 6= Y }] ≤ E(X1 ,Y1 ),...,(X1 ,Yn ) . (1)

σ2 I
Now we have E[θ̂ ] = θ and Cov[θ̂ ] = n .

E[kθ̂ − θ k22 ] = E[(θ̂ − θ ) T (θ̂ − θ )] = E[tr((θ̂ − θ ) T (θ̂ − θ ))] = E[tr((θ̂ − θ )(θ̂ − θ ) T )]


dσ2
= tr(E[((θ̂ − θ )(θ̂ − θ )T )]) = tr(σ2 I) =
n
By Jensen’s inequality we get r
d
q
E[kθ̂ − θ k2 ] ≤ E[kθ̂ − θ k22 ] =σ (2)
n
Combining (1) and (2) we obtain
r
∗ 1 d
E[1{ f ( X ) 6= Y }] − E[1{ f ( X ) 6= Y }] ≤
2 n
and finally using the Bayes error that we computed in class we have
r
σ2 1 d
E[1{ f ( X ) 6= Y }] ≤ +
kθ k22 2 n

We now prove the lemma we used.

1
Lemma 2.
1 1
Z Z
E[1{ f ( X ) 6= Y }] − E[1{ f ∗ ( X ) 6= Y }] ≤ |ρ0 ( x ) − ρe0 ( x )| + |ρ1 ( x ) − ρe1 ( x )|.
2 2
Proof. For any classifier f we have

1 1
Z Z
E[1{ f ( X ) 6= Y }] = 1{ f ( x ) = 1}ρ0 ( x )dx + 1{ f ( x ) = 0}ρ1 ( x )dx
2 2
1 1
Z Z
= 1{ f ( x ) = 1}ρ0 ( x )dx + (1 − 1{ f ( x ) = 1})ρ1 ( x )dx
2 2
1 1 1
Z Z
= + 1{ f ( x ) = 1}ρ0 ( x )dx − 1{ f ( x ) = 1})ρ1 ( x )dx
2 2 2
1 1
Z
= + 1{ f ( x ) = 1}(ρ0 ( x ) − ρ1 ( x ))dx
2 2
Therefore, for the classifiers f , f ∗ we have that

1
Z
E[1{ f ( X ) 6= Y }] − E[1{ f ∗ ( X ) 6= Y }] = (1{ f ( x ) = 1} − 1{ f ∗ ( x ) = 1})(ρ0 ( x ) − ρ1 ( x ))dx
2
Now we prove that

(1{ f ( x ) = 1} − 1{ f ∗ ( x ) = 1})(ρ0 ( x ) − ρ1 ( x )) ≤ |ρ0 ( x ) − ρe0 ( x )|+ (3)

• If 1{ f ( x ) = 1} − 1{ f ∗ ( x ) = 1} = 0 then the inequality is true.

• If 1{ f ( x ) = 1} − 1{ f ∗ ( x ) = 1} = 1, then f ( x ) = 1 and f ∗ ( x ) = 0. But this implies that ρ1 ( x ) ≤ ρ0 ( x )


and ρe0 ( x ) < ρe1 ( x ). Therefore, the LHS of inequality (3) is

ρ0 ( x ) − ρ1 ( x ) = ρ0 ( x ) − ρe0 ( x ) + ρe0 ( x ) − ρe1 ( x ) + ρe1 ( x ) − ρ1 ( x ) ≤ |ρ0 ( x ) − ρe0 ( x )| + |ρe1 ( x ) − ρ1 ( x )|,

because ρe0 ( x ) − ρe1 ( x ) ≤ 0.

• If 1{ f ( x ) = 1} − 1{ f ∗ ( x ) = 1} = −1, the same argument as above works.