Professional Documents
Culture Documents
Reduction
Linear Discriminant
Analysis (LDA)
Aly A. Farag
Shireen Y. Elhabian
CVIP Lab
University of Louisville
www.cvip.uofl.edu
October 2, 2008
Outline
LDA objective
Recall PCA
Now LDA
LDA Two Classes
Counter example
LDA C Classes
Illustrative Example
We didnt care about whether this dataset represent features from one
or more classes, i.e. the discrimination power was not taken into
consideration while we were talking about PCA.
Eigen values and eigen vectors were then computed for Sx. Hence the
new basis vectors are those eigen vectors with highest eigen values,
where the number of those vectors was our choice.
Thus, using the new basis, we can project the dataset onto a less
dimensional space with more powerful data representation.
Now LDA
Consider a pattern classification problem, where we have C-
classes, e.g. seabass, tuna, salmon
Each class has Ni m-dimensional samples, where i = 1,2, , C.
Hence we have a set of m-dimensional samples {x1, x2,, xNi}
belong to class i.
Stacking these samples from different classes into one big fat
matrix X such that each column represents one sample.
We seek to obtain a transformation of X to Y through projecting
the samples in X onto a hyperplane with dimension C-1.
Lets see what does this mean?
LDA Two Classes
The two classes are not well
separated when projected onto Assume we have m-dimensional samples {x1,
this line
x2,, xN}, N1 of which belong to 1 and
N2 belong to 2.
x1 w1
. .
y = wT x where x = and w =
. .
xm wm
xi
1
= wT
Ni
x
xi
= wT
i
~
si 2 measures the variability within class i after projecting it on
the y-space.
Thus ~
s12 + ~
s22 measures the variability within the two
classes at hand after projection, hence it is called within-class scatter
of the projected samples.
LDA Two Classes
The Fisher linear discriminant is defined as
the linear function wTx that maximizes the
criterion function:
2
~ ~
1 2
J ( w) = ~ 2 ~ 2
s +s
1 2
Si = (x )(x )
T
i i
x i
S w = S1 + S 2
Where Si is the covariance matrix of class i, and Sw is called the
within-class scatter matrix.
LDA Two Classes
Now, the scatter of the projection y can then be expressed as a function of
the scatter matrix in feature space x.
~
( y i ) =
~
(w x w i )
2
si 2 =
2 T T
yi x i
= w ( x i )( x i ) w
T T
xi
= wT Si w
s1 + s2 = w S1w + w S 2 w = w (S1 + S 2 )w = w SW w = SW
~ 2 ~ 2 T T T T ~
( ~1 ~2 ) = (w
2 T
1 w 2
T
)
2
= w (1 2 )(1 2 ) w
T T
144 42444 3
SB
~
= w SB w = SB
T
Since SB is the outer product of two vectors, its rank is at most one.
LDA Two Classes
We can finally express the Fisher criterion in terms of
SW and SB as:
2
~ ~
1 2 wT S B w
J ( w) = ~ 2 ~ 2 = T
s1 + s2 w SW w
(
w SW w
T d
dw
) ( ) (
w SB w w SB w
T T d
dw
) (
wT SW w = 0 )
( ) ( )
wT SW w 2S B w wT S B w 2SW w = 0
Dividing by 2 wT SW w :
wT SW w wT S B w
T S B w T SW w = 0
w SW w w SW w
S B w J ( w) SW w = 0
SW1S B w J ( w) w = 0
LDA Two Classes
Solving the generalized eigen value problem
w w w SW w
This is known as Fishers Linear Discriminant, although it is not a
discriminant but rather a specific choice of direction for the projection
of the data down to one dimension.
Using the same notation as PCA, the solution will be the eigen
1
vector(s) of S X = SW S B
LDA Two Classes - Example
Compute the Linear Discriminant projection for the following two-
dimensional dataset.
Samples for class 1 : X1=(x1,x2)={(4,2),(2,4),(2,3),(3,6),(4,4)}
5
x2
0
0 1 2 3 4 5 6 7 8 9 10
x1
LDA Two Classes - Example
The classes mean are :
1 1 4 2 2 3 4 3
1 =
N1
x1
x = + + + + =
5 2 4 3 6 4 3.8
1 1 9 6 9 8 10 8.4
2 =
N2
x 2
x = + + + + =
5 10 8 5 7 8 7.6
LDA Two Classes - Example
Covariance matrix of the first class:
2 2
4 3 2 3
S1 = (x )(x ) = +
T
1 1
x 1 2 3.8 4 3.8
2 2 2
2 3 3 3 4 3
+ + +
3 3.8 6 3.8 4 3.8
1 0.25
=
0.25 2.2
LDA Two Classes - Example
Covariance matrix of the second class:
2 2
9 8.4 6 8.4
S2 = (x )(x ) = +
T
2 2
x 2 10 7.6 8 7.6
2 2 2
9 8.4 8 8.4 10 8.4
+ + +
5 7.6 7 7.6 8 7.6
2.3 0.05
=
0.05 3.3
LDA Two Classes - Example
Within-class scatter matrix:
S B = (1 2 )(1 2 )
T
T
3 8.4 3 8.4
=
3.8 7.6 3.8 7.6
5.4
= ( 5.4 3.8)
3.8
29.16 20.52
=
20.52 14.44
LDA Two Classes - Example
The LDA projection is then obtained as the solution of the generalized eigen
value problem 1
SW S B w = w
SW1S B I = 0
1
3.3 0.3 29.16 20.52 1 0
= 0
0. 3 5. 5 20.52 14.44 0 1
0.3045 0.0166 29.16 20.52 1 0
= 0
0.0166 0.1827 20.52 14.44 0 1
9.2213 6.489
4.2339 2.9794
= (9.2213 )(2.9794 ) 6.489 4.2339 = 0
2 12.2007 = 0 ( 12.2007 ) = 0
1 = 0, 2 = 12.2007
LDA Two Classes - Example
Hence
9.2213 6.489 w1
w1 = 0{
4.2339 2.9794 1 w2
and
9.2213 6.489 w1
w2 = 12
14 .2
2007
4
3
4.2339 2.9794 2 w2
Thus;
0.5755 0.9088
w1 = and w2 = = w*
0.8178 0.4173
The optimal projection is the one that given maximum = J(w)
LDA Two Classes - Example
Or directly;
1
3.3 0.3 3 8.4
w = S (1 2 ) =
* 1
0.3 5.5 3.8 7.6
W
p(y|w )
i
9
0.15
8
7 0.1
0.05
x2
4
0
-4 -3 -2 -1 0 1 2 3 4 5 6
y
3
Using this vector leads to
2
bad separability
1 between the two classes
0
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
x1
LDA - Projection
Classes PDF : using the LDA projection vector with highest eigen value = 12.2007
0.4
0.25
LDA projection vector with the highest eigen value = 12.2007
10
p(y|w )
i
0.2
9
0.15
8
7 0.1
6
0.05
x2
5
0
0 5 10 15
4 y
3
Using this vector leads to
2 good separability
1
between the two classes
0
0 1 2 3 4 5 6 7 8 9 10
x1
LDA C-Classes
Now, we have C-classes instead of just two.
We are now seeking (C-1) projections [y1, y2, , yC-1] by means
of (C-1) projection vectors wi.
wi can be arranged by columns into a projection matrix W =
[w1|w2||wC-1] such that:
yi = wiT x y =WTx
x1 y1
. .
where xm1 = , yC 11 =
. .
xm yC 1
and WmC 1 = [w1 | w2 | ... | wC 1 ]
LDA C-Classes
If we have n-feature vectors, we can stack them into one matrix
as follows;
Y =W X T
1
and i =
Ni
x
x
Ni : number of data samples
in class i.
i
LDA C-Classes
Example of two-dimensional
features (m = 2), with three
Recall the two classes case, the between- classes C = 3.
Sw
class scatter was computed as: 1
1
x2
S B = (1 2 )(1 2 )
T
i =1
Sw2
1 1
= x = N i i
x1
where
N x N N: number of all data .
x
1
and i =
N i xi
x Ni : number of data samples
in class i.
LDA C-Classes
Similarly,
We can define the mean vectors for the projected samples y as:
1 1
~
i =
Ni
y
y
and ~
=
N
y
y
i
While the scatter matrices for the projected samples y will be:
C C
SW = Si = ( y ~i )( y ~i )
~ ~ T
i =1 i =1 yi
C
S B = N i (~i ~ )(~i ~ )
~ T
i =1
LDA C-Classes
Recall in two-classes case, we have expressed the scatter matrices of the
projected samples in terms of those of the original samples as:
~
SW = W T SW W
~
S B = W T S BW This still hold in C-classes case.
Recall that we are looking for a projection that maximizes the ratio of
between-class to within-class scatter.
Since the projection is no longer a scalar (it has C-1 dimensions), we then use
the determinant of the scatter matrices to obtain a scalar objective function:
~
SB W T S BW
J (W ) = ~ = T
SW W SW W
And we will seek the projection W* that maximizes this ratio.
LDA C-Classes
To find the maximum of J(W), we differentiate with respect to W and equate
to zero.
3 2.5 7
1 = + , 2 = + , 3 = +
7 3.5 5 x1
5 1
S1 = Negative covariance to lead to data samples distributed along the y = -x line.
3 3
4 0
S2 = Zero covariance to lead to data samples distributed horizontally.
0 4
3.5 1
S3 = Positive covariance to lead to data samples distributed along the y = x line.
3 2.5
In Matlab
Its Working
1
x2
20
3
15 2
X - the second feature
10
x1
5
2
-5
-5 0 5 10 15 20
X - the first feature
1
Computing LDA Projection Vectors
Recall
C
SW = Si
i =1
Si = (x )(x )
T
where i i
x i
1
and i =
Ni
x
x i
C
S B = N i (i )(i )
T
i =1
1
S SB 1 1
N N
W
where = x = i i
N x x
1
and i =
N i xi
x
Lets visualize the projection vectors W
25
20
15
2
0
-5
-10
-15 -10 -5 0 5 10 15 20 25
X - the first feature
1
Projection y = WTx
Along first projection vector
Classes PDF : using the first projection vector with eigen value = 4508.2089
0.4
0.35
0.3
0.25
p(y|w )
i
0.2
0.15
0.1
0.05
0
-5 0 5 10 15 20 25
y
Projection y = WTx
Along second projection vector
Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4
0.35
0.3
0.25
p(y|w )
i
0.2
0.15
0.1
0.05
0
-10 -5 0 5 10 15 20
y
Which is Better?!!!
Apparently, the projection vector that has the highest eigen
value provides higher discrimination power between classes
Classes PDF : using the first projection vector with eigen value = 4508.2089 Classes PDF : using the second projection vector with eigen value = 1878.8511
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
p(y|w i )
p(y|w )
i
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-5 0 5 10 15 20 25 -10 -5 0 5 10 15 20
y y
PCA vs LDA
Limitations of LDA
LDA produces at most C-1 feature projections
If the classification error estimates establish that more features are needed, some other method must be
employed to provide those additional features