Professional Documents
Culture Documents
Discrimminant Analysis
πk fk (x )
P(Y = k | X = x ) = PK . (1)
i=1 πi fi (x )
Using Bayes’ Theorem for Classification
I Let pk (X ) = Pr (Y = k|X ).
I Instead of directly computing pk (X ), we can simply plug in
estimates of πk and fk (X ) into (1).
I Estimating πk is easy if we have a random sample of Y s from
the population.
I We simply compute the fraction of the training observations
that belong to the kth class.
I However, estimating fk (X ) tends to be more challenging, unless
we assume some simple forms for these densities.
Bayes Classifier
1 1
fk (x ) = √ exp{ 2 (x − µk )2 } (2)
2πσ 2σ
πk √ 1 exp{ 2σ1 2 (x − µk )2 }
pk (x ) = PK 2πσ 1 1
. (3)
2
i=1 πi 2πσ exp{ 2σ 2 (x − µi ) }
√
LDA (continued)
µk µ2k
δ(x )k = x − + log(πk ) (4)
σ2 2σ 2
is largest (Exercise).
LDA (continued)
5
4
3
2
1
0
−4 −2 0 2 4 −3 −2 −1 0 1 2 3 4
1 X
µ̂k = xi (5)
nk i:y =k
i
K X
1 X
σ̂ 2 = (xi − µ̂k )2 , (6)
n − K k=1 i:y =k
i
The LDA classifier plugs the estimates given in (6) and (7) into (8),
and assigns an observation X = x to the class for which
µ̂k µ̂2k
δk (x ) = x − + log(π̂k ) (8)
σ̂ 2 2σ̂ 2
is largest.
The word linear in the classifierÕs name stems from the fact that
the discriminant functions δk (x ) in (8) are linear functions of x (as
opposed to a more complex function of x).
LDA classifier (p > 1)
We can can see that using linear regression (Figure 2 to predict for
different classes can lead to a masking effect of one group or more.
This occurs for the following reasons:
7
●
● ● ●
● ●
● ● ● ●
●
6
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ●
● ● ●
●
●
● ● ● ● ●
● ●
● ● ●
5
● ● ●
Petal Length
● ●
● ● ● ●
●
● ● ●
● ● ● ●
● ●
●
● ● ●
● ●
● ● ● ● ●
4
●
● ● ●
●
●
●
3
Setosa
2
● ● ●
● ● ●
● ● ● ●
●
● ●
● ● ●
●
● Versicolor
●
● ● ●
● ●
● Virginica
1
Petal Width
Iris data with LDA
We instead consider LDA, which has the following assumptions:
For each person, conditional on them being in class k, we assume
X|G = k ∼ Np (µk , Σ).
That is,
1 1
fk (X) = exp − (X − µk )0 Σ−1 (X − µk ) .
(2π)p/2 |Σ|1/2 2
}
LDA assumes Σk = Σ for all k (just as was true in the p = 1
setting).
In practice the parameters of the Gaussian distribution are unknown
and must be estimated by:
I πˆk = Nk /N, where Nk is the number of people of class k
P
I µˆk = i:gi =k x i /Nk
Σ̂ = K (x i − µˆk )(x i − µˆk )0 /(N − K ),
P P
I
LDA Computation
P(G = k, X = X)
P(G = k|X = X) =
P(X = X)
Then
P(G = k1 |X = X) fk (X)πk1
log = log 1
P(G = k2 |X = X) fk2 (X)πk2
1 1 πk1
= − (X − µk1 )0 Σ−1 (X − µk1 ) + (X − µk2 )0 Σ−1 (X − µk2 ) + log
2 2 πk2
1 1 πk1
= (µk1 − µk2 )0 Σ−1 X − µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 + log
2 2 πk2
Boundary Lines for LDA
1 1 πk1
0 −1
(µk1 − µk2 ) Σ X − µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 + log = 0,
2 2 πk2
which is linear in X.
Boundary Lines for LDA
I The boundary will be a line for two dimensional problems.
I The boundary will be a hyperplane for three dimensional
problems.
The linear log-odds function implies that our decision boundary
between classes k1 and k2 will be the set where
P(G = k1 |X = X) = P(G = k2 |X = X),
which is linear in X. In p dimensions, this is a hyperplane.
} \frame{
We can then say that class k1 is more likely than class k2 if
P(G = k1 |X = X) > P(G = k2 |X = X) =⇒
P(G = k1 |X = X)
log > 0 =⇒
P(G = k2 |X = X)
1 1 πk1
0 −1
(µk1 −µk2 ) Σ X− µ0k1 Σ−1 µk1 + µ0k2 Σ−1 µk2 +log > 0 =⇒
2 2 πk2
Linear Discrimminant Function
1 1
δkQ (X) = − log |Σk | − (X − µk )0 Σ−1
k (X − µk ) + log(πk ).
2 2
7
●
● ● ●
● ●
● ● ● ●
●
6
● ● ● ● ●
● ● ● ● ● ● ●
●
● ●
● ●
● ● ●
●
●
● ● ● ● ●
● ●
● ● ●
5
● ● ●
Petal Length
● ●
● ● ● ●
●
● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
4
● ●
● ● ●
●
●
●
3
Setosa
2
● ● ●
●
● ● ●
● ● ●
●
● ●
● ●
● ● ● Versicolor
●
● ● ●
● ●
● Virginica
1
Petal Width
7
●
● ● ●
● ●
● ● ● ●
●
6
● ● ● ● ●
● ● ● ● ● ● ●
●
● ●
● ●
● ● ●
●
●
● ● ● ● ●
● ●
● ● ●
5
● ● ●
Petal Length
● ●
● ● ● ●
●
● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
4
● ●
● ● ●
●
●
●
3
Setosa
2
● ● ●
●
● ● ●
● ● ●
●
● ●
● ●
● ● ● Versicolor
●
● ● ●
● ●
● Virginica
1
Petal Width
a 0 Ba
max 0 (exercise).
a a Wa
The solution is a = largest eigenvalue of W −1 B.