Lecture 3 - Linear Methods For Classification

Chapter 4: Linear Methods for Classification
DD3364
March 23, 2012
Introduction
Focus on linear classification

Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their
classification.
The boundaries between these regions are termed the
decision boundaries.
When these decision boundaries are linear we term the
classification method as linear.

classification.

classification.

classification.
An example when a linear decision boundaries arises

Learn a discriminant function k (x) for each class k and set
G(x) = arg max k (x)

k
This generates a linear decision boundary when some
monotone transformation g of k (x) which is linear.
That is g is a monotone function s.t.
g(k (x)) = k0 + kt x


k
g(k (x)) = k0 + kt x


k
g(k (x)) = k0 + kt x
Examples of discriminant functions

Example 1: Fit a linear regression model to the class
indicator variables. Then the discriminant functions are

k (x) = k0 + kt x
Example 2: Use the posterior probabilities P (G = k | X = x)
as the discriminant functions k (x)
A popular model when there are two classes is:
exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =
g(p) = log(p/(1 p)) can be applied as a monotonic function
to k (x) = P (G = 1|X = x) to make it linear.


k (x) = k0 + kt x
exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =


k (x) = k0 + kt x
exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =
Can directly learn the linear decision boundary

For a two class problem with p-dimensional inputs this =
modelling the decision boundary as a hyperplane.
This chapter looks at two methods which explicitly look for
the separating hyperplane. These are
Perceptron model and algorithm of Rosenblatt,

SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating
hyperplanes if they exist and fail of the points are not linearly
separable.
There are fixes for the non-separable case but we will not
consider these today.


separable.


separable.


separable.


separable.


separable.
Linear decision boundaries can be made non-linear

Can expand the variable set X1 , X2 , . . . , Xp by including their
squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .
This adds p(p + 1)/2 additional variables.

Linear decision boundaries in the augmented space
corresponds to quadratic decision boundaries in the original

space.
4.2 Linear Regression of an Indicator Matrix
103
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
FIGURE 4.1. The left plot shows some data from three classes, with linear



space.
103
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3



space.
103
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
Linear Regression of an Indicator Matrix
Use linear regression to find discriminant functions

p
Have training data {(xi , gi )}n
i=1 where each xi R and
gi {1, . . . , K}.
For each k construct a linear discriminant k (x) via:

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x
Classify a new point x with

k

p
gi {1, . . . , K}.

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x

k

p
gi {1, . . . , K}.

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x

k

p
gi {1, . . . , K}.

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x

k

p
gi {1, . . . , K}.

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x

k

p
gi {1, . . . , K}.

1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
2 Compute (
k
i=1
0 ,k
Define
k (x) = 0k + kt x

k
3 class example
Use linear regression of an indicator matrix to find the discriminant

functions for the above 3-classes.
Construct K linear regression problems

For each k construct the response vectors from the class labels
0.5
0.5
0.5
10
0
0
10
20
10
0
0
10
20
10
0
0
10
20
For each k fit a hyperplane that minimizes the RSS
0.5
0.5
10
0
0
10
20
0.5
10
0
0
10
20
10
0
0
10
20
Construct K discriminant functions

For each k construct the response vectors from the class labels
0.5
0.5
0.5
10
0
0
10
20
10
0
0
10
20
10
0
0
10
20
The k discriminant fns defined by the least square hyperplanes
0.5
0.5
1 (x)
1.5
0.5
0.5
2 (x)
1.5
0.5
0.5
3 (x)
1.5
The decision boundary defined by these discriminant fns
This approach will fail in this case

The training data from 3 classes
The discriminant functions learnt via regression
0.5
1 (x)
0.5
2 (x)
0.5
3 (x)
The resulting decision boundary
The discriminant functions learnt via regression
0.5
1 (x)
0.5
2 (x)
0.5
3 (x)

In this last example masking has occurred.
This occurs because of the rigid nature of the linear
discriminant functions.
This example is extreme but for large K and small p such
maskings occur naturally.
The other methods in this chapter are based on linear decision
functions of x, but they are learnt in a smarter why...



Linear Discriminant Analysis
Optimal classification requires the posterior

To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with
A simple application of Bayes Rule gives
PK
k=1
fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l
k = 1
Therefore for classification having fk (x) is almost equivalent
to having P (G = k | X = x).

PK
k=1
fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l
k = 1

PK
k=1
fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l
k = 1

PK
k=1
fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l
k = 1
How to model the class densities

Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian
distributions,
mixture of Gaussian distributions produce non-linear decision
boundaries,
non-parametric density estimates which allow the most
flexibility,
Naive Bayes where fk (X) =
Qp
j=1
fkj (Xj ).

distributions,
boundaries,
flexibility,
Qp
j=1
fkj (Xj ).

distributions,
boundaries,
flexibility,
Qp
j=1
fkj (Xj ).

distributions,
boundaries,
flexibility,
Qp
j=1
fkj (Xj ).

distributions,
boundaries,
flexibility,
Qp
j=1
fkj (Xj ).
Multivariate Gaussian class densities

Model each fk (x) as a multivariate Gaussian
Linear discriminant functions
1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k
class distributions
decision boundary
partition
In this
lecture,
no assumptions,
One gets
linear
decision
boundaries. made about the underlying de
Multivariate Gaussian class densities

Model each fk (x) as a multivariate Gaussian
Linear discriminant functions
1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k
class distributions
decision boundary
partition
In this
lecture,
no assumptions,
One gets
linear
decision
boundaries. made about the underlying de
LDA
Can see this as
log
P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b
a linear function
t 1
The equal covariance matrices allow the xt 1
k x and x l x
terms to cancel out.
From the log-odds function we see that the linear discriminant
functions
k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
k
LDA
Can see this as
log
P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b
a linear function
t 1
k x and x l x
functions
k (x) = xt 1 k .5 tk 1 k + log k
k
LDA
Can see this as
log
P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b
a linear function
t 1
k x and x l x
functions
k (x) = xt 1 k .5 tk 1 k + log k
k
LDA: Some practicalities

In practice dont know the parameters of the Gaussian distributions
and estimate these from the training data.
Let nk be the number of class k observations then

k = nk /n
P

k = gi =k xi /nk
k =

PK P
k=1
(xi
k )(xi Analysis
k )t /(n109
Linear
Discriminant
g4.3
i =k
13
3
3
3
33
3
2 2
13
2
3
3
3
31 3
3 22
2
1
3 3
2
3
33
11 23 33 1
22 2 2
2
3
2
1 1 1 1 22
13
3
2
1 31 1
3
1 11
2 22
11
22
1
1
2
2
1 1 2
1
1
1 2
1
2
2
2
1 3
33
K)
When k s are not all equal
Bivariate example
If the k are not assumed to be equal then the quadratic
terms remain and we get quadratic discriminant functions

(QDA)
Have a two class problem with
t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,
.4
.3
.4
.2
In this case the decision boundary between classes are
described by a quadratic equation {x : k (x) = l (x)}.
class distributions
decision boundaries
partition
When k s are not all equal
Bivariate example
If the k are not assumed to be equal then the quadratic
terms remain and we get quadratic discriminant functions

(QDA)
Have a two class problem with
t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,
.4
.3
.4
.2
In this case the decision boundary between classes are
described by a quadratic equation {x : k (x) = l (x)}.
class distributions
decision boundaries
partition
Best way to compute a quadratic discriminant function?

Left plot shows the quadratic decision boundaries found using
2
2
LDA in the five dimensional space4.3X
.
1 , XDiscriminant
2 , X1 , XAnalysis
2 , X1 X2111
Linear
2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 313 1
111 1 1
333333
3
1 11 1
1
3
3
3
3
1
3
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 31 1
111 1 1
333333
33333
1 11 1
1
3
3
1
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3
FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows
the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA
in the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows the
quadratic decision boundaries found by QDA. The differences are small, as is
usually the case.
Right plot shows the quadratic decision boundaries found by QDA.
between the discriminant functions where K is some pre-chosen class (here

we have chosen the last), and each difference requires p + 1 parameters3 .
LDA and QDA summary

These methods can be surprisingly effective.
Can explain this
Reduced-Rank Linear Discriminant Analysis
Affine subspace defined by centroids of the classes

Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace
HK1 where if u HK1 then
u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1
If x Rp then it can be written as
x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,
where x HK1
.
If x has been whitened with respect to the common covariance
matrix then the Mahalhobnis distance to centroid j
kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k
= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k
x does not change with j , therefore to locate the closest
centroid can ignore it.

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1
x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,
where x HK1
.
kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k
= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1
x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,
where x HK1
.
kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k
= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1
x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,
where x HK1
.
kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k
= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1
x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,
where x HK1
.
kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k
= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k
To summarize
K centroids in p-dimensional input space lie in an affine
subspace of dimension K 1.
If p K this is a big drop in dimension.

To locate the closest centroid can ignore the directions
orthogonal to this subspace if the data has been sphered.
Therefore can just project X onto this centroid-spanning
subspace HK1 and make comparisons there.
LDA thus performs dimensionality reduction and one need
only consider the data in a subspace of dimension at most

K 1.
What about a subspace of dimension L < K 1?

If K > 3 can ask the question:
Which subspace of dimensional L < K 1 should we project

onto for optimality w.r.t. LDA?
Fisher defined optimal as the projected centroids are spread
out as much as possible in terms of variance.

4.3 Linear Discriminant Analysis
107
Find the principal component subspace of the centroids.

4
2
0
o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o
o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o
o o
o
o
-6
Coordinate 2 for Training Data
-2
-4
oooo
oo
oooo
o
oo
oo
o
-4
-2
o
0
In this example have 11

classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.



107

4
2
0
o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o
o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o
o o
o
o
-6
-2
-4
oooo
oo
oooo
o
oo
oo
o
-4
-2
o
0

input vectors.
directions.



107

4
2
0
o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o
o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o
o o
o
o
-6
-2
-4
oooo
oo
oooo
o
oo
oo
o
-4
-2
o
0

input vectors.
directions.



107

4
2
0
o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o
o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o
o o
o
o
-6
-2
-4
oooo
oo
oooo
o
oo
oo
o
-4
-2
o
0

input vectors.
directions.
The optimal sequence of subspaces

To find the sequences of optimal subspaces for LDA:
1
Compute the K p matrix of class centroids M and the

common covariance matrix W - the within-class variance.
1
2
3
Compute M = M W 2 using the eigen-decomposition of W

Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1
The lth discriminant variable is given by Zl = vl W 2 X
The optimal sequence of subspaces

To find the sequences of optimal subspaces for LDA:
1
Compute the K p matrix of class centroids M and the

common covariance matrix W - the within-class variance.
1
2
3
Compute M = M W 2 using the eigen-decomposition of W

Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1
The lth discriminant variable is given by Zl = vl W 2 X
-4
-2
-6
-2
2
o
o
1
-2
-3
Coordinate 10
-2
0
Coordinate 1
o
o
oo
o o o
o o
o
o o oooo o
o
o
o o o o
o
o oooo o o o oo
o
oo
o
o o
o
oo o oo
o
oo
o
o
oo
o o
o
o
ooooo o
o oo o o o o
ooo
o
o
o
o
o
o
o
o o
oo
o ooo ooooooo ooo oooo
o
o o oo ooo ooooo
o
o
o
o
o
o
o oo oooooo o o
o
o
o o
o o
oo ooo o o o
oo
ooo
o oo
oooooooooo oo
o oo o oooo
ooo o
o
o
o
o
o oo
o o o o oooo oo o
ooooo
o
o ooooo o
oo o
o
o oo
o ooo ooooo o o
o
o
oo o oo o oo oo
o
o
o
o
o
o
o
o
o
o
o
oo
o oo
oo oooo
oo oooooooooooooo
o oo o
oo o
o oo oooo o o
ooo
oo oo oo oooo
o o
o
o
o
o
o ooo ooooo oooo oo o oooooo o
o
oo o o o o oooo o
o oo o oooo ooo
o
oo
oo o o o o o o o o
o o oo o oooooooo ooo ooooooo o o
oo o oo
o o oo o o
o
o
o o ooo o o o ooo oo ooo
o
ooo oooo oo
ooo oo
o
oo o
o o o o oo o
o oo o oo
oo o
o
o o
oo o o
oooo o
o
o o
o
o
o
o o
o
o
o
o
o
o
-4
o o
oo
-1
3
2
1
0
Coordinate 7
-1
-2
oo o o
o
o
o oo
o oo
oo
oo oo
oo o
oo oooo o
o
o
oo
ooo oo
ooo
o
o
oooo o o o
o o
o o
o oo oo
oo o o
o
o oo oo
ooooo
o
oo oo
oo oo o
o
o o o o o oo o
o
oo o
oo
ooo oooo o
ooo o
o
o
oooo o o ooo o o
o o oo
o o oo
o
o
o o
o o ooooo ooo o
ooo
o oo oooooo o ooo ooo ooo o oo oooo
oo o
ooooo
o
o ooo o o oo o o oo oo
oo ooooooooo o oo ooo ooo oooo o
o
ooo oo o o oo o o ooo o
o o
o ooooo
oo
o oo
ooo o o ooo o o o o
o oo
oooooo
o
o
o oo
o ooooooo
o
o
oo oooo
o
oooo o o ooo
o o o ooo
ooo o o oo oooooo oo
o
o ooo
ooo
o
o o
o
oo oooo o o
o o o oo o o ooo o o o
oo oo oo o o
o
o
o
o
o
o
o
o
o
o
oo
oooo o oo oo oo o
oo o o oooooo
oo
oo o
o
oo
ooo
o o o
oo
ooo
oo
ooo
o oooo
oo
o
o
o ooo
o
o ooooooo
o o
oo o o
o
o oo o
o
o
oo o o
oo oo
o o
o
oo
ooo
Coordinate 2
o o
o
o
oo o o
o
o
-4
Coordinate 1
o
o
oo
o
o
o
oo
o
oo
o oo
o
o
o
o ooo o o
o
oo
o o
o
ooooo oo o oo
oo
o o oo
oo
o
o
o
oo
o
o
o
o
oo o
ooo ooo o o
o o o oo
ooooo
ooooo
o
o
o
oooooo ooooo oooo
o oo ooo oo
oooo oooo o oooooo
o oooo oooo o oo
o oooo ooooooooooo o
o o
oo
o o oo
o oo
oo oooooo
ooooo ooo
o ooo o
oo
o
o
o
oo
o ooooooo o oooo
oooooooo o ooo
o o oo o oo o ooo o oo
oo
ooo
o oo o o ooo o oo o ooo oo oooo
ooooo oo oooooo ooooooo o o o
ooo oooo o
o o o ooo o
o oo oo ooooooooooo o oo
o o oo o
o oo
o
o o ooooooooooo
o
o oooo o oooo oo
oo oooo oooooo
o
o
o
o
o
o
o
o
o
o
o
o oo o
ooo o o
o o
oo o
o oo o oo oo
o
oo oo
oo
o o o o oo o o o o o o
oo o o o
oo o
o o
oo
ooo
o oo
o o o oooo o
o
o o
o
o
o
oo
o
ooo
o
oo
oo
o
o
o
o
o
ooo
o
o oo o oo
o
Coordinate 3
-2
o
o
o
o
o
o
o
o
o o
o
o
o
o
oo
o
oo o
o o
o o
oo o oo o
o o
o
o o ooo
oooooo oo
o oo
o
o
o o
o
o oo
o
o o
oooo
oo
o o
o
o ooo
o
o oo
o
o
o
o
o
o
o
o
o o oo o o
o o
oooo oo o o ooo ooo
o
ooooo o ooo o oo o
oo ooo o
o
o oo
o
ooooo o oo
ooo oo ooo
o o
oo
oo
o ooooo ooooooo
o oo oooooo ooo
oo
oo o
oo
oo oo
oo oooo oooo oo ooo
oo
o
oo o
o
o o
o o o oooooo oo
ooo oo o
oo o
oooo o oooo o o oo oo
o
o oooo
oo
o o oo oooo ooo
oooo
oo
o oo ooo o o
oo
ooo o oo o ooooo o o
oo oo o o o
o o oo oooo ooo o oo ooo
oo
o o o ooooo
o
o
o
o
o
o o o oo
oo o ooo ooooo
o o ooooo
o o
o
oo oo oooooo oo o o o
o
o oo o
oo
ooo o o
oo o o
oo o
oo
oo o
o
oooo
o o oo
o oo o o
ooo
o
o
o
o
o oo
o
o oo o o
oo oo o oo
oo
o oooo o o o o
o
oo
o o o
oo
oo o
o
o
ooo
o oo
o
oo
o o o
o
oo
o
o
o
oo
oo o
o
o
oooo
o
o
o
o o
o
-2
Coordinate 3
-2
-1
oo
oo
o
Coordinate 9
FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as
Note as the rank

the
canonical
variates
increase
the
projected
the rankof
of the
canonical
variates increases,
the centroids
become less
spread
out.
In the lower right panel they appear to be superimposed, and the classes most
centroids become
confused.less spread out.
LDA via the Fisher criterion

Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116
4. Linear Methods for Classification
+
+
FIGURE 4.9. Why

Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).
LDA via the Fisher criterion

Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116
+
+
FIGURE 4.9. Why

Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).
The Fisher criterion

W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1
The between-class variance of Z is at Ba
The within-class variance of Z is at W a
Fishers problem amounts to maximizing the Raleigh quotient
max
a
at B a
at W a
or equivalently
max at B a subject to at W a = 1
a

1
max
a
at B a
at W a
or equivalently
a

1
max
a
at B a
at W a
or equivalently
a

a1 = arg max at B a subject to at W a = 1

a
This is a generalized eigenvalue problem with a given by the
largest eigenvalue of W 1 B.
Can be shown that a1 is equal to W 2 v1 defined earlier.

Can find the next direction a2
a2 = arg max
a
at B a
subject to at W a1 = 0
at W a
1
Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .


a

a2 = arg max
a
at B a
at W a
1


a

a2 = arg max
a
at B a
at W a
1


a

a2 = arg max
a
at B a
at W a
1


a

a2 = arg max
a
at B a
at W a
1
118
Classification in the reduced subspace
The al s are referred to as discriminant coordinates or
canonical
variates.
Classification
in Reduced Subspace
ooo
ooo
ooooo
oo
oo
o o oo
ooo
oo
oo ooo o
o o o
ooooo ooooo o oooo
ooo
oo
o o
oo
o o
o ooo
o
ooooo
o
o
o
o
o
o
oo
o
ooooo
o
o ooooo oo
o
o o o ooo o
oo o
o
oo oo o o oo
o
ooo
o oo o
oo o o o
oo o o o
oooooo o oo o oooo oo o
o o oo o
oooo o o
o oo o
o oo o o o o
o
oo o o
o
oo
oo
o
ooo o oooo
o
o
oo ooo o o
ooo o
oooo oooo
ooooo o o
oooo o o
o o
o
o
o
o
o
o
o
o
o
ooooo
o
o
ooooo oo o o
oo
o
o
o o oooo o o
oo
ooo oo oo o oo oo o o
o o oo
ooo o o oo
o
o
oo o o
oo o o
oo o o
oo
o
o
ooooo o
ooo o
ooo
o
o
o
o
o
o
o
o
o
oo o
o
oo oo
oooo
o oo o
o
o o
o
o o
o
o o o o
o o o oo o
o
o
o
o o
o o
o o o oo
oo
o oo o o o
o
o
o
o ooo
oo
o
oo
oo
oo
oooo
o
o
o
o
o
o
o o o oooo
oo
o
o
o
o o o oo
o
o
oo
o o
oo
o
oo
o
o o
oo
o
o
oooo
oo o
o o ooo
oo o o o
o
o
oo o
o o
oo
o
ooo
o o
o
oo o
ooo
o
oo o oo o o
o
o
o
o
Canonical Coordinate 2
o o
o
Canonical Coordinate 1
FIGURE 4.11. Decision boundaries for the vowel training data, in the two-dimensional subspace spanned by the first two canonical variates. Note that in
any higher-dimensional subspace, the decision boundaries are higher-dimensional
affine planes, and could not be represented as lines.

classes with 10
dimensional input
vectors.
The decision boundaries
based on using basic
linear discrimination in
the low dimensional
space given by the first
2 canonical variates.
Logistic Regression
Logistic regression
Arises from trying to model the posterior probabilities of the
K classes using linear functions in x while ensuring they sum

to one.
The simple model used is for k = 1, . . . , K 1
P (G = k|X = x) =
and k = K
P (G = K|X = x) =
exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)
1+
PK1
l=1
1
exp(l0 + lt x)
These posterior probabilities clearly sum to one.
Logistic regression
Arises from trying to model the posterior probabilities of the
K classes using linear functions in x while ensuring they sum

to one.
The simple model used is for k = 1, . . . , K 1
P (G = k|X = x) =
and k = K
P (G = K|X = x) =
exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)
1+
PK1
l=1
1
exp(l0 + lt x)
These posterior probabilities clearly sum to one.
Logistic regression
This model: k = 1, . . . , K 1
P (G = k|X = x) =
and k = K
P (G = K|X = x) =
exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)
1+
PK1
l=1
1
exp(l0 + lt x)
induces linear decision boundaries between classes as

{x : P (G = k|X = x) = P (G = l|X = x)}
is the same as
{x : (k0 l0 ) + (k l )t x = 0}
for 1 k < K and 1 l < K.
Fitting Logistic regression models

To simplify notation let
1
= {10 , 1t , 20 , 2t , . . .} and
P (G = k|X = x) = pk (x; )
Given training data {(xi , gi )}n

i=1 one usually fits the logistic
regression model by maximum likelihood.
The log-likelihood for the n observations is
`() = log
n
Y
i=1
pgi (xi ; )
n
X
log(pgi (xi ; ))
i=1
in my opinion this is an abuse of terminology as the posterior

probabilities are being used...
Fitting Logistic regression models: The two class case
p1 (x; ) =
exp( t x)
and p2 (x; ) = 1 p1 (x; )
1 + exp( t x)
Let = = (10 , 1t ) and assume xi s include the constant term 1.

A convenient way to write the likelihood for one sample (xi , gi ) is:
Code the two-class gi as a {0, 1} response yi where
(
1 if gi = 1
yi =
0 if gi = 2
Then one can write
pgi (xi ; ) = yi p1 (xi ; ) + (1 yi )(1 p1 (xi ; ))

Similarly
log pgi (xi ; ) = yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))
The log-likelihood of the data becomes
`() =
=
n
X
[yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))]
i=1
n h
X
i=1
n h
X
i=1
yi t xi yi log(1 + e
yi t xi log(1 + e
tx
tx
) (1 yi ) log(1 + e
tx
`() =
n h
X
i=1
yi t xi log(1 + e
tx
To maximize the log-likelihood set its derivatives to zero to get

n
`() X
exp( t xi )
=
xi yi xi
1 + exp( t xi )
i=1

n
X
exp( t xi )
=
xi yi
1 + exp( t xi )
i=1
=
n
X
i=1
xi (yi p1 (xi ; )) = 0
These are (p + 1) equations non-linear equations in .

Must solve iteratively and in the book they use the
Newton-Raphson algorithm.
The two class case: Iterative optimization

Newton-Raphson requires both the gradient
n
`() X
=
xi (yi p1 (xi ; ))
i=1
and Hessian matrix

n
X
`()
=
xi xti p1 (xi ; )(1 p1 (xi ; ))
t
i=1
Starting with old , a single Newton update step is

new =
old
`()
t
1
where the derivatives are calculated at old .
`()
Iterative optimization in matrix notation

Write the Hessian and gradient in matrix notation. Let
X be the N (p + 1) matrix with (1, xti ) on each row,
p = (p1 (x1 ; old ), p1 (x2 ; old ), . . . , p1 (xn ; old ))t
W is n n diagonal matrix with ith diagonal element
p1 (x1 ; old )(1 p1 (x1 ; old )).
Then
`()
= Xt (y p)
and
`()
= Xt WX
t
Iterative optimization as iterative weighted ls

The Newton step is then
new = old + (Xt WX)1 Xt (y p)

= (Xt WX)1 Xt W X old + W1 (y p)
= (Xt WX)1 Xt Wz
Have re-expressed the Newton step as a weighted least squares step

new = arg min (z X)t W(z X)
with response
z = X old + W1 (y p)
known as the adjusted response. Note at iteration each W, p and
z change.
An toy example
Two class problem with 2 dimensional input vectors.

Use Logistic Regression to find a decision boundary
Illustration of the optimization process
The current estimate cur
Quantities involved in the weighted least sqs
Size p1 (xi ; cur )
Size 1/Wii
Size p1 (xi ; cur )(1 p1 (xi ; cur )) = Wii
Update the estimate of cur
Size p1 (xi ; cur )
Size 1/Wii
Size p1 (xi ; cur )
Size 1/Wii

Logistic regression converges to this decision boundary.
L1 regularized logistic regression
L1 regularized logistic regression

The L1 penalty can be used for variable selection in logistic
regression by maximizing a penalized version of the log-likelihood
p
n h
X
i
X
t
yi (0 + t xi ) log(1 + e0 + xi )
|j |
max
0 ,1
i=1
j=1
Note:
the intercept, 0 , is not included in the penalty term,

the predictors should be standardized to ensure the penalty is
meaningful,
the above cost function is concave and a solution can be
found using non-linear programming methods.
Separating Hyperplanes
Directly estimating separating hyperplanes

In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to
separate the data into different classes as well as possible.
A hyperplane is defined as
{x : 0 + t x = 0}

{x : 0 + t x = 0}

{x : 0 + t x = 0}
Review of some vector algebra

130
x0
x
0 + T x = 0
FIGURE 4.15. The linear algebra of a hyperplane (affine set).
Above is shown a hyperplane L defined by
ature in the late 1950s (Rosenblatt,

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0
If x1 , x L then (x x ) = 0 = = /kk is normal to L
If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence
= /|||| is the vector normal to the surface of L.
1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

130
x0
x
0 + T x = 0

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
1
2
T
Here we list some
0 properties:
0
If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k

130
x0
x
0 + T x = 0

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
1
2
T
Here we list some
0 properties:
0

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k

130
x0
x
0 + T x = 0

Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
1
2
T
Here we list some
0 properties:
0

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
Perceptron Learning
Rosenblatts Perceptron Learning Algorithm

Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM
where M is the index of the misclassified points.

boundary.
yi {1, 1}.
yi (0 + t xi ) < 0
X
D(, 0 ) =
yi (xti + 0 )
iM

boundary.
yi {1, 1}.
yi (0 + t xi ) < 0
X
D(, 0 ) =
yi (xti + 0 )
iM

boundary.
yi {1, 1}.
yi (0 + t xi ) < 0
X
D(, 0 ) =
yi (xti + 0 )
iM

boundary.
yi {1, 1}.
yi (0 + t xi ) < 0
X
D(, 0 ) =
yi (xti + 0 )
iM
Perceptron Learning: The Objective Function

Want to find 0 and which minimize
D(, 0 ) =
iM
yi (xti + 0 ) =
yi f,0 (xi )
iM
D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified
points to the decision boundary defined by 0 + t x = 0.
Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )
Can we say anything about the form of D(, 0 )?

D(, 0 ) =
iM
yi (xti + 0 ) =
yi f,0 (xi )
iM
Questions:

D(, 0 ) =
iM
yi (xti + 0 ) =
yi f,0 (xi )
iM
Questions:

D(, 0 ) =
iM
yi (xti + 0 ) =
yi f,0 (xi )
iM
Questions:

D(, 0 ) =
iM
yi (xti + 0 ) =
yi f,0 (xi )
iM
Questions:
Perceptron Learning: Optimizing the Objective Function

The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,
iM
X
D(, 0 )
=
yi
0
iM
Stochastic gradient descent is used to minimize D(, 0 )
so an update step is made after each observation is visited.
Identify a misclassified example wrt the current estimate of
and 0 and make the update

+ yi xi
and
0 0 + yi
where is the learning rate.

Repeat this step until no points are misclassified.

X
D(, 0 )
=
yi xi ,
iM
X
D(, 0 )
=
yi
0
iM

+ yi xi
and
0 0 + yi


X
D(, 0 )
=
yi xi ,
iM
X
D(, 0 )
=
yi
0
iM

+ yi xi
and
0 0 + yi


X
D(, 0 )
=
yi xi ,
iM
X
D(, 0 )
=
yi
0
iM

+ yi xi
and
0 0 + yi

Perceptron Learning: An Example
Want to find a separating hyperplane between the red and blue

points.
Perceptron Learning: One Iteration
Current estimate
(0)
Point misclassified
by (0)
Use gradient at point

to get (1)
Perceptron Learning: Sequence of iterations
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(17)
Is this the best separating hyperplane we could have found?
Perceptron Learning Algorithm: Properties

Pros
If the classes are linearly separable, the algorithm converges to
a separating hyperplane in a finite number of steps.
Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Pros
Cons

Pros
Cons

Pros
Cons

Pros
Cons
Optimal Separating Hyperplanes
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
$',2(2-.')&2*2$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
$',2(2-.')&2*2$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=
The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
Bad a hyperplane
passing
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
7
"
<
<
"
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!
er the problem of finding

a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
1}.
a&.3)+$(
&.3)+$( definition of the separating hyperplane
unique
"
"
6: 6:
6: 6:
T
h
la
m
th
? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)
@/9#:1:
@/9#:1:
:/-8#*
:/-8#*
67 67
Which separating
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
2)(,$&%*3'#)-$$-4561'",
2)(,$&%*3'#)-$$-4561'",
7-8,1*.9:*;")<-$1)#0
7-8,1*.9:*;")<-$1)#0
6767
43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes
Which of the infinite hyperplanes should we choose?

a decision boundary that generalizes well.
margin=>=>
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
$',2(2-.')&2*2$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
$',2(2-.')&2*2$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=
The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
Bad a hyperplane
passing
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
7
"
<
<
"
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!
er the problem of finding

a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
1}.
a&.3)+$(
&.3)+$( definition of the separating hyperplane
unique
"
"
6: 6:
6: 6:
T
h
la
m
th
? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)
@/9#:1:
@/9#:1:
:/-8#*
:/-8#*
67 67
Which separating
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
2)(,$&%*3'#)-$$-4561'",
2)(,$&%*3'#)-$$-4561'",
7-8,1*.9:*;")<-$1)#0
7-8,1*.9:*;")<-$1)#0
6767
43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes
Which of the infinite hyperplanes should we choose?

a decision boundary that generalizes well.
margin=>=>
Stating the optimization problem

A first attempt
max
M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n
, 0 , kk = 1
The conditions ensure all the training points are a signed
distance M from the decision boundary defined by and 0 .
Want to find the largest such M and its associated and 0 .

Remove the constraint kk = 1 by adjusting the constraints
on the training data as follows:
max M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n

, 0
For any and 0 fulfilling the above constraints then and
0 with > 0 also fulfills the constraints.
Therefore can arbitrarily set kk = 1/M .

Then the above optimization problem is equivalent to
1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2


, 0

1
, 0 2


, 0

1
, 0 2


, 0

1
, 0 2
"A*B0?-$?/,"-1*CDE
"+"-%&).%&+(/0"#1&2%)34&%,5/%44&")&(4&(&67#$)"*#&*6&
With this formulation of the problem
&:"(4&*6&).%&4%5(/()"#0&.;5%/52(#%
Optimal separating hyperplanes
.+"/0%+1.%2)(+'-*.%&.+3..-%'%4#)-+%5%'-2%'%46'-.%730&8%)(
=
3 min
5 !&
, 0
1
kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
2
ress the margin in terms of w and b of the separating hyperplane.
$'6%1/4."46'-.%1'(%)-:)-)+.%(#6;+)#-(%&/%()$46/%(*'6)-,%+1.%
(0%3.%*1##(.%+1.%(#6;+)#-%:#"%31)*1%+1.%2)(*")$)-'-+%
The margin has thickness 1/kk as shown
distance between a point x and a plane (w, b)
.%:#"%+1.%+"')-)-,%.5'$46.(%*6#(.(+%+#%+1.%&#;-2'"/
slightly different).
3 5) ! & " @
=
,=
1.%!"#$#%!"&'1/4."46'-.
A
3
*.%:"#$%+1.%*6#(.(+%
2'"/%)(
3=5 ! &
3
"
$.(
$"
A
3
@
3
&
3
3=5 ! &
3
,<
in figure
(notation
|wT x+b|
is
The solution to this constrained optimization problem

1
, 0 2
This is a convex optimization problem - quadratic objective
function with linear inequality constraints.
Its associated primal Lagrangian function is

n
X
1
Lp (, 0 , ) = kk2 +
i yi (1 t xi 0 )
2
i=1
and 0 is a minimum point of the cost function stated at
the top if...

1
, 0 2

n
X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1
the top if...

1
, 0 2

n
X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1
the top if...

1
, 0 2
The Karush-Kuhn-Tucker conditions state that 1 = (0 , ) is a
minimum of this cost function if a unique s.t.
1
1 Lp (1 , ) = 0
j 0 for j = 1, . . . , n
j (1 yj (0 + xtj )) = 0 for j = 1, . . . , n
(1 yj (0 + xtj )) 0 for j = 1, . . . , n
Plus positive definite constraints on 1 1 Lp (1 , )
Lets check what the KKT conditions imply

Active constraints and Inactive constraints:
Let A be the set of indices with j > 0 then
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA
Condition KKT 1, 1 Lp (1 , ) = 0, implies
j yj xj
and 0 =
jA
j yj
jA
Condition KKT 3, j (1 yj (0 + xtj )) = 0, implies

1
yj (0 + xtj ) = 1 for all j A,
if yi (0 + xti ) > 1 then i = 0 and i

/A
Lp (1 , ) = .5k k2 .
Lets check what the KKT conditions imply

Active constraints and Inactive constraints:
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA
Condition KKT 1, 1 Lp (1 , ) = 0, implies
j yj xj
and 0 =
jA
j yj
jA
Condition KKT 3, j (1 yj (0 + xtj )) = 0, implies

1
yj (0 + xtj ) = 1 for all j A,
if yi (0 + xti ) > 1 then i = 0 and i

/A
Lp (1 , ) = .5k k2 .
To summarize
As we have a convex optimization problem it has one local
minimum.
At this minimum 1 there exist a unique s.t. 1 and
fulfill the KKT conditions.

1
if i A then yi (0 + xti ) = 1 and therefore xi lies on the

boundary of the margin.
xi is called a support vector.
And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA
To summarize
minimum.

1

And if i
margin.
X
=
j yj xj
jA
To summarize
minimum.

1

And if i
margin.
X
=
j yj xj
jA
To summarize
minimum.

1

And if i
margin.
X
=
j yj xj
jA
To summarize
minimum.

1

And if i
margin.
X
=
j yj xj
jA
;<0(0"#>(./#(&>(&>#(&%(0"#(05&("3-#$-.)>#<(0")0(
3()0(0"#<#("3-#$-.)>#<(0"#(0#$,(3/45!+/6789:(
g
To summarize
"#(:9))'.,$;#&,'.2
(!/12(
0&$<(*&>0$/7;0#(
)>#
<>
!/ 3 / + /
:9))'.,$
;#&,'.2$?!@AB
>=(%$&,(0"#(II!
0"#(<;--&$0(D#*0&$<
#0(*&;.=(7#(
#*0&$<'()>=(
;.=(7#(0"#(<),#
If i
const
<=
==
X
Thus the SVMin
fact
only depends only a sm
=
yj x j
j
jA
How do I calculate ?
You have seen that the optimal solution is a weighted sum of
the support vectors.
But how can we calculate these weights?

Most common approach is to solve the Dual Lagrange
problem as opposed to the Primal Lagrange problem. (The
solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)
This Dual problem is an easier constrained optimization and is
also convex. It has the form
max
n
X
1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1
subject to i 0 i


max
n
X
1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1
subject to i 0 i


max
n
X
1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1
subject to i 0 i


max
n
X
1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1
subject to i 0 i

Lecture 3 - Linear Methods For Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 - Linear Methods For Classification

Uploaded by

Copyright:

Available Formats

Chapter 4: Linear Methods for Classification

March 23, 2012

Focus on linear classification

The boundaries between these regions are termed the

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification

The boundaries between these regions are termed the

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification

The boundaries between these regions are termed the

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification

The boundaries between these regions are termed the

When these decision boundaries are linear we term the

classification method as linear.

An example when a linear decision boundaries arises

G(x) = arg max k (x)

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

An example when a linear decision boundaries arises

G(x) = arg max k (x)

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

An example when a linear decision boundaries arises

G(x) = arg max k (x)

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

Examples of discriminant functions

indicator variables. Then the discriminant functions are

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Examples of discriminant functions

indicator variables. Then the discriminant functions are

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Examples of discriminant functions

indicator variables. Then the discriminant functions are

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Can directly learn the linear decision boundary

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,

consider these today.

Can directly learn the linear decision boundary

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for