You are on page 1of 176

Chapter 4: Linear Methods for Classification

DD3364

March 23, 2012

Introduction

Focus on linear classification


Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

The boundaries between these regions are termed the

decision boundaries.

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification


Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

The boundaries between these regions are termed the

decision boundaries.

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification


Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

The boundaries between these regions are termed the

decision boundaries.

When these decision boundaries are linear we term the

classification method as linear.

Focus on linear classification


Want to learn a predictor G : Rp G = {1, . . . , K}
G divides input space into regions labelled according to their

classification.

The boundaries between these regions are termed the

decision boundaries.

When these decision boundaries are linear we term the

classification method as linear.

An example when a linear decision boundaries arises


Learn a discriminant function k (x) for each class k and set

G(x) = arg max k (x)


k

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

An example when a linear decision boundaries arises


Learn a discriminant function k (x) for each class k and set

G(x) = arg max k (x)


k

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

An example when a linear decision boundaries arises


Learn a discriminant function k (x) for each class k and set

G(x) = arg max k (x)


k

This generates a linear decision boundary when some

monotone transformation g of k (x) which is linear.

That is g is a monotone function s.t.

g(k (x)) = k0 + kt x

Examples of discriminant functions


Example 1: Fit a linear regression model to the class

indicator variables. Then the discriminant functions are


k (x) = k0 + kt x

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Examples of discriminant functions


Example 1: Fit a linear regression model to the class

indicator variables. Then the discriminant functions are


k (x) = k0 + kt x

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Examples of discriminant functions


Example 1: Fit a linear regression model to the class

indicator variables. Then the discriminant functions are


k (x) = k0 + kt x

Example 2: Use the posterior probabilities P (G = k | X = x)

as the discriminant functions k (x)

A popular model when there are two classes is:

exp(0 + t x)
1 + exp(0 + t x)
1
P (G = 2|X = x) =
1 + exp(0 + t x)
P (G = 1|X = x) =

g(p) = log(p/(1 p)) can be applied as a monotonic function

to k (x) = P (G = 1|X = x) to make it linear.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Can directly learn the linear decision boundary


For a two class problem with p-dimensional inputs this =

modelling the decision boundary as a hyperplane.

This chapter looks at two methods which explicitly look for

the separating hyperplane. These are

Perceptron model and algorithm of Rosenblatt,


SVM model and algorithm of Vapnik
In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly
separable.

There are fixes for the non-separable case but we will not

consider these today.

Linear decision boundaries can be made non-linear


Can expand the variable set X1 , X2 , . . . , Xp by including their

squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

This adds p(p + 1)/2 additional variables.


Linear decision boundaries in the augmented space

corresponds to quadratic decision boundaries in the original


space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Linear decision boundaries can be made non-linear


Can expand the variable set X1 , X2 , . . . , Xp by including their

squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

This adds p(p + 1)/2 additional variables.


Linear decision boundaries in the augmented space

corresponds to quadratic decision boundaries in the original


space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Linear decision boundaries can be made non-linear


Can expand the variable set X1 , X2 , . . . , Xp by including their

squares and cross-products X12 , X22 , . . . , Xp2 , X1 X2 , X1 X2 , . . .

This adds p(p + 1)/2 additional variables.


Linear decision boundaries in the augmented space

corresponds to quadratic decision boundaries in the original


space.
4.2 Linear Regression of an Indicator Matrix
103

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22
1
2 22 2
3
2 22
33 3 3 3
22 2
2
12
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
3 3
22 2 2 2
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
3333 33
2 2 2222 22 22 22 22222222 22 2
13
3 3 33
2
2
2
2
2
3
1
2
2
2
1 2 222 222222 2222
1
3 3
33333
21222212
1
2
2
1
1 3 33
1 1
33 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3
22 2 2
1
2
1
11
1 1 1133333 33
11 1 1
2211
12 2
3 33
2
1
1
2
1
1
1
1
1
2
1
1
3 33 3333
1
1
1
22
12
11 1 1 1
22
1 11 11 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1 1 1111 1 11 1 3333333
111 1 1
11 1 1
33333333
3
1
1
3
1
1 1
3333 3
1 1 11 1 11 111 111 1 1 11 11 1 111 3
313 1
111 1 1
33 33
1 11 1
1 1 3333333
1
1
31 33 3 3
1 11 1
1 333333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Linear Regression of an Indicator Matrix

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

Use linear regression to find discriminant functions


p
Have training data {(xi , gi )}n
i=1 where each xi R and

gi {1, . . . , K}.

For each k construct a linear discriminant k (x) via:


1 For i = 1, . . . , n set
(
0 if gi 6= k
yi =
1 if gi = k
0k , k ) = arg min Pn (yi 0 t xi )2
2 Compute (
k
i=1
0 ,k

Define

k (x) = 0k + kt x
Classify a new point x with

G(x) = arg max k (x)


k

3 class example

Use linear regression of an indicator matrix to find the discriminant


functions for the above 3-classes.

Construct K linear regression problems


For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0
0

10

20

10

0
0

10

20

10

0
0

10

20

For each k fit a hyperplane that minimizes the RSS

0.5

0.5
10

0
0

10

20

0.5
10

0
0

10

20

10

0
0

10

20

Construct K discriminant functions


For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0
0

10

20

10

0
0

10

20

10

0
0

10

20

The k discriminant fns defined by the least square hyperplanes

0.5

0.5

1 (x)

1.5

0.5

0.5

2 (x)

1.5

0.5

0.5

3 (x)

1.5

The decision boundary defined by these discriminant fns

This approach will fail in this case


The training data from 3 classes

The discriminant functions learnt via regression

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

The resulting decision boundary

The discriminant functions learnt via regression

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

The resulting decision boundary


In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

This example is extreme but for large K and small p such

maskings occur naturally.

The other methods in this chapter are based on linear decision

functions of x, but they are learnt in a smarter why...

The resulting decision boundary


In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

This example is extreme but for large K and small p such

maskings occur naturally.

The other methods in this chapter are based on linear decision

functions of x, but they are learnt in a smarter why...

The resulting decision boundary


In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

This example is extreme but for large K and small p such

maskings occur naturally.

The other methods in this chapter are based on linear decision

functions of x, but they are learnt in a smarter why...

The resulting decision boundary


In this last example masking has occurred.
This occurs because of the rigid nature of the linear

discriminant functions.

This example is extreme but for large K and small p such

maskings occur naturally.

The other methods in this chapter are based on linear decision

functions of x, but they are learnt in a smarter why...

Linear Discriminant Analysis

Optimal classification requires the posterior


To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

Optimal classification requires the posterior


To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

Optimal classification requires the posterior


To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

Optimal classification requires the posterior


To perform optimal classification need to know P (G | X). Let
fk (x) represent the class-conditional P (X | G = k) and
k be the prior probability of class k with

A simple application of Bayes Rule gives

PK

k=1

fk (x)k
P (G = k | X = x) = PK
l=1 fl (x)l

k = 1

Therefore for classification having fk (x) is almost equivalent

to having P (G = k | X = x).

How to model the class densities


Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

mixture of Gaussian distributions produce non-linear decision

boundaries,

non-parametric density estimates which allow the most

flexibility,

Naive Bayes where fk (X) =

Qp

j=1

fkj (Xj ).

How to model the class densities


Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

mixture of Gaussian distributions produce non-linear decision

boundaries,

non-parametric density estimates which allow the most

flexibility,

Naive Bayes where fk (X) =

Qp

j=1

fkj (Xj ).

How to model the class densities


Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

mixture of Gaussian distributions produce non-linear decision

boundaries,

non-parametric density estimates which allow the most

flexibility,

Naive Bayes where fk (X) =

Qp

j=1

fkj (Xj ).

How to model the class densities


Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

mixture of Gaussian distributions produce non-linear decision

boundaries,

non-parametric density estimates which allow the most

flexibility,

Naive Bayes where fk (X) =

Qp

j=1

fkj (Xj ).

How to model the class densities


Many methods are based on specific models of fk (x)
linear and quadratic discriminant functions use Gaussian

distributions,

mixture of Gaussian distributions produce non-linear decision

boundaries,

non-parametric density estimates which allow the most

flexibility,

Naive Bayes where fk (X) =

Qp

j=1

fkj (Xj ).

Multivariate Gaussian class densities


Model each fk (x) as a multivariate Gaussian

Linear discriminant functions

1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k

class distributions

decision boundary

partition

In this
lecture,
no assumptions,
One gets
linear
decision
boundaries. made about the underlying de

Multivariate Gaussian class densities


Model each fk (x) as a multivariate Gaussian

Linear discriminant functions

1
p
fk (x) =
exp {.5(x k )t 1
k (x k )}
p
2 |k |
Similar discriminant functions were derived where each p(x
distributed
with(LDA)
equal arises
covariance
LinearNormally
Discriminant
Analysis
in thematrices.
special
case when
k = for all k

class distributions

decision boundary

partition

In this
lecture,
no assumptions,
One gets
linear
decision
boundaries. made about the underlying de

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

terms to cancel out.

From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

terms to cancel out.

From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

LDA
Can see this as
log

P (G = k | X = x)
fk (x)
k
= log
+ log
P (G = l | X = x)
fl (x)
l
k
t
1
= log
.5 k k + .5 tl 1 l
l
+ xt 1 (k l )
= xt a + b

a linear function

t 1
The equal covariance matrices allow the xt 1
k x and x l x

terms to cancel out.

From the log-odds function we see that the linear discriminant

functions

k (x) = xt 1 k .5 tk 1 k + log k
are an equivalent description of the decision rule with
G(x) = arg max k (x)
k

LDA: Some practicalities


In practice dont know the parameters of the Gaussian distributions
and estimate these from the training data.
Let nk be the number of class k observations then

k = nk /n
P

k = gi =k xi /nk
k =

PK P

k=1

(xi
k )(xi Analysis
k )t /(n109

Linear
Discriminant
g4.3
i =k

13
3
3
3
33
3
2 2
13
2
3
3
3
31 3
3 22
2
1
3 3
2
3
33
11 23 33 1
22 2 2
2
3
2
1 1 1 1 22
13
3
2
1 31 1
3
1 11
2 22
11
22
1
1
2
2
1 1 2
1
1
1 2
1
2
2
2
1 3
33

K)

When k s are not all equal

Bivariate example

If the k are not assumed to be equal then the quadratic

terms remain and we get quadratic discriminant functions


(QDA)
Have a two class problem with

t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,

.4

.3

.4
.2

In this case the decision boundary between classes are

described by a quadratic equation {x : k (x) = l (x)}.

class distributions

decision boundaries

partition

When k s are not all equal

Bivariate example

If the k are not assumed to be equal then the quadratic

terms remain and we get quadratic discriminant functions


(QDA)
Have a two class problem with

t 1
1 log | |
.9 .5 .4
2.6
k (x)
= .5
(x
2k =
1 =
, 1 =k
, P(
=k.5 (x
2=
k )1 )
k ) + log,

.4

.3

.4
.2

In this case the decision boundary between classes are

described by a quadratic equation {x : k (x) = l (x)}.

class distributions

decision boundaries

partition

Best way to compute a quadratic discriminant function?


Left plot shows the quadratic decision boundaries found using
2
2
LDA in the five dimensional space4.3X
.
1 , XDiscriminant
2 , X1 , XAnalysis
2 , X1 X2111
Linear

2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 313 1
111 1 1
333333
3
1 11 1
1
3
3
3
3
1
3
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

2
1
1
2
1
3
22 2
1
2 22
3
2 22
33 3 3 3
22 2
2
1
2
2
33
3 3
2 2 22 2 22 2
3 3
3
2 222 2
1
22
2
22
3
22
1
2
2
2
3 3
2 2
22
22 2
22 22222 2 2
3 3 33
2222 2222 2222
33
2 2222 2 2 2 2222222 21
3 33 333
222 2 2
2
3 333 3333 3 3
222 222 22 2 2 2222 2 2
333 33
2 2 2222 22 22 22 22222222 22 2
13
222 222 22
3 333 3 3
1 2 22222 2122
1
2
3 3
33333
2122222122
1
2
2
1
1 3 33
2
1 1
3 3
2 122122 22
1
22
1
2 1 1222 1
11 1 33
3 3 33
22 2 2 2
1
1
11
2
1
1
333 333
1
3
1
1
2
11 1 11
2211
3 3
2
1
2121 2
1 111
3 3 3333
11 1
1
112
1 121 11 1
222
1 1 1 1 1 11 1 13313333333 33 3
211 2 1
3
3
1
3
2 111
1
3
1
1
3
1
1
1
33333333
1
1
111 1 1
11
1 11 33333
333
1
1 1 1
33333
1 1 11 1 11 1111 111 1 1 1111 1111 1
11 31 1
111 1 1
333333
33333
1 11 1
1
3
3
1
3
1 3133 3 3
1
1
1 11
1 33333333333
11 1 1 1 1 111111 11 111 1 1 1 1
1 11 1
1
33
3 3 333333
11 3
1
1
1 1
33333 3
11
1
3
1 1 11
3 33
1
1
3
333 3
3
3 3

FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows
the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA
in the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows the
quadratic decision boundaries found by QDA. The differences are small, as is
usually the case.

Right plot shows the quadratic decision boundaries found by QDA.

between the discriminant functions where K is some pre-chosen class (here


we have chosen the last), and each difference requires p + 1 parameters3 .

LDA and QDA summary


These methods can be surprisingly effective.
Can explain this

Reduced-Rank Linear Discriminant Analysis

Affine subspace defined by centroids of the classes


Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

If x has been whitened with respect to the common covariance

matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

x does not change with j , therefore to locate the closest

centroid can ignore it.

Affine subspace defined by centroids of the classes


Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

If x has been whitened with respect to the common covariance

matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

x does not change with j , therefore to locate the closest

centroid can ignore it.

Affine subspace defined by centroids of the classes


Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

If x has been whitened with respect to the common covariance

matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

x does not change with j , therefore to locate the closest

centroid can ignore it.

Affine subspace defined by centroids of the classes


Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

If x has been whitened with respect to the common covariance

matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

x does not change with j , therefore to locate the closest

centroid can ignore it.

Affine subspace defined by centroids of the classes


Have K centroids in a p-dimensional input space: 1 , . . . , K
These centroids define an K 1 dimensional affine subspace

HK1 where if u HK1 then

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )
= 1 + 1 d1 + 2 d2 + + K1 dK1

If x Rp then it can be written as

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1
.

If x has been whitened with respect to the common covariance

matrix then the Mahalhobnis distance to centroid j

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

= k21 + 1 d1 + + (j1 1) dj1 + + K1 dK1 + x k

x does not change with j , therefore to locate the closest

centroid can ignore it.

To summarize
K centroids in p-dimensional input space lie in an affine

subspace of dimension K 1.

If p  K this is a big drop in dimension.


To locate the closest centroid can ignore the directions

orthogonal to this subspace if the data has been sphered.

Therefore can just project X onto this centroid-spanning

subspace HK1 and make comparisons there.

LDA thus performs dimensionality reduction and one need

only consider the data in a subspace of dimension at most


K 1.

What about a subspace of dimension L < K 1?


If K > 3 can ask the question:

Which subspace of dimensional L < K 1 should we project


onto for optimality w.r.t. LDA?

Fisher defined optimal as the projected centroids are spread

out as much as possible in terms of variance.


4.3 Linear Discriminant Analysis

107

Find the principal component subspace of the centroids.


4

Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

Coordinate 2 for Training Data

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

Coordinate 1 for Training Data

In this example have 11


classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

What about a subspace of dimension L < K 1?


If K > 3 can ask the question:

Which subspace of dimensional L < K 1 should we project


onto for optimality w.r.t. LDA?

Fisher defined optimal as the projected centroids are spread

out as much as possible in terms of variance.


4.3 Linear Discriminant Analysis

107

Find the principal component subspace of the centroids.


4

Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

Coordinate 2 for Training Data

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

Coordinate 1 for Training Data

In this example have 11


classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

What about a subspace of dimension L < K 1?


If K > 3 can ask the question:

Which subspace of dimensional L < K 1 should we project


onto for optimality w.r.t. LDA?

Fisher defined optimal as the projected centroids are spread

out as much as possible in terms of variance.


4.3 Linear Discriminant Analysis

107

Find the principal component subspace of the centroids.


4

Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

Coordinate 2 for Training Data

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

Coordinate 1 for Training Data

In this example have 11


classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

What about a subspace of dimension L < K 1?


If K > 3 can ask the question:

Which subspace of dimensional L < K 1 should we project


onto for optimality w.r.t. LDA?

Fisher defined optimal as the projected centroids are spread

out as much as possible in terms of variance.


4.3 Linear Discriminant Analysis

107

Find the principal component subspace of the centroids.


4

Linear Discriminant Analysis

2
0

o o oo
ooo
oo
o o ooo
o o o
oo o
oooooo ooooo o oooo
oo o
o o
oo
o o
o ooo
oo ooo
o o o o oo
o
o
oo
oooooo
o
oo
o o o ooo o
ooooo o
o o o
o
oo o o oo
o
o
o
o
o
o
oo
o oo
oo o o o
ooo
o o oooo oo o
oo o o o
o
oo o o o o o
o ooo o
o
oooo o o
o o
o
o
o o
o
oo o oo
o o
oo
o
o o
o
ooo o
o
o
oooo o o o o
o
o
o
o
o
o
o
o
o oo o oo o o
o
o
o
o ooooo
ooo ooo
o o oo
o
o ooo
o
ooooo
oooooooo ooo
oo
o
o
o
o
o
o
o o o
o o oo o
oo
oo o o
o o oo
ooo o o oo
oo
o oo o o o
o o oo o
o
o
o
o
o o
o
o oo o
ooo
oo o
oooooooo o
oo oo o
o
o
o
oo oo
o
oo
o
o
o
o oo
oo o o
o
o o
o o o
o o o oo o
o
o
o
oo o
o
o

o
o o o
oo
o
o
o
o
o
o
o oooo o o
oo
o
oo
o o
o o ooo o oo o ooo
o
o
o
o
o
o
o o
o
o ooo o
o
o
o o o
oo
o o
oo
o
oo
o
o o
oo
o
o
o
o
o
o
oo
o
o
o
o
oo o
oo o o
o o
oo oo
ooo
o o
o
o
o
o
o
oo o
ooo
o oo o
oo
o o
o
o
o
o

o o
o

o
-6

Coordinate 2 for Training Data

-2
-4

oooo
oo

oooo
o

oo
oo

o
-4

-2

o
0

Coordinate 1 for Training Data

In this example have 11


classes with 10 dimensional
input vectors.
The bold dots correspond
to the centroids projected
onto the top 2 principal
directions.

The optimal sequence of subspaces


To find the sequences of optimal subspaces for LDA:
1

Compute the K p matrix of class centroids M and the


common covariance matrix W - the within-class variance.
1

2
3

Compute M = M W 2 using the eigen-decomposition of W


Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1

The lth discriminant variable is given by Zl = vl W 2 X

The optimal sequence of subspaces


To find the sequences of optimal subspaces for LDA:
1

Compute the K p matrix of class centroids M and the


common covariance matrix W - the within-class variance.
1

2
3

Compute M = M W 2 using the eigen-decomposition of W


Compute B the covariance matrix of M - the between-class
variance.
B s eigen-decomposition is B = V DB V . The columns of
vl of V define basis of the optimal subspace.
1

The lth discriminant variable is given by Zl = vl W 2 X

-4

-2

-6

-2

2
o
o

1
-2

-3

Coordinate 10

-2

0
Coordinate 1

o
o
oo
o o o
o o
o
o o oooo o
o
o
o o o o
o
o oooo o o o oo
o
oo
o
o o
o
oo o oo
o
oo
o
o
oo
o o
o
o
ooooo o
o oo o o o o
ooo
o
o
o
o
o
o
o
o o
oo
o ooo ooooooo ooo oooo
o
o o oo ooo ooooo
o
o
o
o
o
o
o oo oooooo o o
o
o
o o
o o
oo ooo o o o
oo
ooo
o oo
oooooooooo oo
o oo o oooo
ooo o
o
o
o
o
o oo
o o o o oooo oo o
ooooo
o
o ooooo o
oo o
o
o oo
o ooo ooooo o o
o
o
oo o oo o oo oo
o
o
o
o
o
o
o
o
o
o
o
oo
o oo
oo oooo
oo oooooooooooooo
o oo o
oo o
o oo oooo o o
ooo
oo oo oo oooo
o o
o
o
o
o
o ooo ooooo oooo oo o oooooo o
o
oo o o o o oooo o
o oo o oooo ooo
o
oo
oo o o o o o o o o
o o oo o oooooooo ooo ooooooo o o
oo o oo
o o oo o o
o
o
o o ooo o o o ooo oo ooo
o
ooo oooo oo
ooo oo
o
oo o
o o o o oo o
o oo o oo
oo o
o
o o
oo o o
oooo o
o
o o
o
o
o
o o
o
o
o

o
o
o

-4

o o
oo

-1

3
2
1
0

Coordinate 7

-1
-2

oo o o
o
o
o oo
o oo
oo
oo oo
oo o
oo oooo o
o
o
oo
ooo oo
ooo
o
o
oooo o o o
o o
o o
o oo oo
oo o o
o
o oo oo
ooooo
o
oo oo
oo oo o
o
o o o o o oo o
o
oo o
oo
ooo oooo o
ooo o
o
o
oooo o o ooo o o
o o oo
o o oo
o
o
o o
o o ooooo ooo o
ooo
o oo oooooo o ooo ooo ooo o oo oooo
oo o
ooooo
o
o ooo o o oo o o oo oo
oo ooooooooo o oo ooo ooo oooo o
o
ooo oo o o oo o o ooo o
o o
o ooooo
oo
o oo
ooo o o ooo o o o o
o oo
oooooo
o
o
o oo
o ooooooo
o
o
oo oooo
o
oooo o o ooo
o o o ooo
ooo o o oo oooooo oo
o
o ooo
ooo
o
o o
o
oo oooo o o
o o o oo o o ooo o o o
oo oo oo o o
o
o
o
o
o
o
o
o
o
o
oo
oooo o oo oo oo o
oo o o oooooo
oo
oo o
o
oo
ooo
o o o
oo
ooo
oo
ooo
o oooo
oo
o
o
o ooo
o
o ooooooo
o o
oo o o
o
o oo o
o
o
oo o o
oo oo
o o
o
oo
ooo

Coordinate 2

o o
o
o
oo o o
o
o

-4

Coordinate 1

o
o
oo
o
o

o
oo
o
oo
o oo
o
o
o
o ooo o o
o
oo
o o
o
ooooo oo o oo
oo
o o oo
oo
o
o
o
oo
o
o
o
o
oo o
ooo ooo o o
o o o oo
ooooo
ooooo
o
o
o
oooooo ooooo oooo
o oo ooo oo
oooo oooo o oooooo
o oooo oooo o oo
o oooo ooooooooooo o
o o
oo
o o oo
o oo
oo oooooo
ooooo ooo
o ooo o
oo
o
o
o
oo
o ooooooo o oooo
oooooooo o ooo
o o oo o oo o ooo o oo
oo
ooo
o oo o o ooo o oo o ooo oo oooo
ooooo oo oooooo ooooooo o o o
ooo oooo o
o o o ooo o
o oo oo ooooooooooo o oo
o o oo o
o oo
o
o o ooooooooooo
o
o oooo o oooo oo
oo oooo oooooo
o
o
o
o
o
o
o
o
o
o
o
o oo o
ooo o o
o o
oo o
o oo o oo oo
o
oo oo
oo
o o o o oo o o o o o o
oo o o o
oo o
o o
oo
ooo
o oo
o o o oooo o
o
o o
o
o
o
oo
o
ooo
o
oo
oo
o
o
o
o
o
ooo
o
o oo o oo
o

Coordinate 3

-2

o
o
o

o
o
o
o
o
o o
o
o
o
o
oo
o
oo o
o o
o o
oo o oo o
o o
o
o o ooo
oooooo oo
o oo
o
o
o o
o
o oo
o
o o
oooo
oo
o o
o
o ooo
o
o oo
o
o
o
o
o
o
o
o
o o oo o o
o o
oooo oo o o ooo ooo
o
ooooo o ooo o oo o
oo ooo o
o
o oo
o
ooooo o oo
ooo oo ooo
o o
oo
oo
o ooooo ooooooo
o oo oooooo ooo
oo
oo o
oo
oo oo
oo oooo oooo oo ooo
oo
o
oo o
o
o o
o o o oooooo oo
ooo oo o
oo o
oooo o oooo o o oo oo
o
o oooo
oo
o o oo oooo ooo
oooo
oo
o oo ooo o o
oo
ooo o oo o ooooo o o
oo oo o o o
o o oo oooo ooo o oo ooo
oo
o o o ooooo
o
o
o
o
o
o o o oo
oo o ooo ooooo
o o ooooo
o o
o
oo oo oooooo oo o o o
o
o oo o
oo
ooo o o
oo o o
oo o
oo
oo o
o
oooo
o o oo
o oo o o
ooo
o
o
o
o
o oo
o
o oo o o
oo oo o oo
oo
o oooo o o o o
o
oo
o o o
oo
oo o
o
o
ooo
o oo
o
oo
o o o
o
oo
o
o
o
oo
oo o
o
o
oooo
o
o
o
o o
o

-2

Coordinate 3

Linear Discriminant Analysis

-2

-1

oo
oo
o

Coordinate 9

FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as

Note as the rank


the
canonical
variates
increase
the
projected
the rankof
of the
canonical
variates increases,
the centroids
become less
spread
out.
In the lower right panel they appear to be superimposed, and the classes most
centroids become
confused.less spread out.

LDA via the Fisher criterion


Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116

4. Linear Methods for Classification

+
+

FIGURE 4.9. Why


Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).

LDA via the Fisher criterion


Fisher arrived at this decomposition via a different route. He
posed the problem
Find the linear combination Z = aX such that the
between-class variance is maximized relative to the
within-class variance.
116

4. Linear Methods for Classification

+
+

FIGURE 4.9. Why


Althoughthis
the line
joining themakes
centroidssense
defines the direction of
criterion
greatest centroid spread, the projected data overlap because of the covariance
(left panel). The discriminant direction minimizes this overlap for Gaussian data
(right panel).

The Fisher criterion


W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

The between-class variance of Z is at Ba

The within-class variance of Z is at W a

Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

The Fisher criterion


W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

The between-class variance of Z is at Ba

The within-class variance of Z is at W a

Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

The Fisher criterion


W is the common covariance matrix of the original data X.
B is the covariance matrix of the centroid matrix M
Then for the projected data Z
1

The between-class variance of Z is at Ba

The within-class variance of Z is at W a

Fishers problem amounts to maximizing the Raleigh quotient

max
a

at B a
at W a

or equivalently
max at B a subject to at W a = 1
a

The Fisher criterion


Fishers problem amounts to maximizing the Raleigh quotient

a1 = arg max at B a subject to at W a = 1


a

This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

Can be shown that a1 is equal to W 2 v1 defined earlier.


Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

The Fisher criterion


Fishers problem amounts to maximizing the Raleigh quotient

a1 = arg max at B a subject to at W a = 1


a

This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

Can be shown that a1 is equal to W 2 v1 defined earlier.


Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

The Fisher criterion


Fishers problem amounts to maximizing the Raleigh quotient

a1 = arg max at B a subject to at W a = 1


a

This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

Can be shown that a1 is equal to W 2 v1 defined earlier.


Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

The Fisher criterion


Fishers problem amounts to maximizing the Raleigh quotient

a1 = arg max at B a subject to at W a = 1


a

This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

Can be shown that a1 is equal to W 2 v1 defined earlier.


Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

The Fisher criterion


Fishers problem amounts to maximizing the Raleigh quotient

a1 = arg max at B a subject to at W a = 1


a

This is a generalized eigenvalue problem with a given by the

largest eigenvalue of W 1 B.

Can be shown that a1 is equal to W 2 v1 defined earlier.


Can find the next direction a2

a2 = arg max
a

at B a
subject to at W a1 = 0
at W a
1

Once again a2 = W 2 v2 .
In a similar fashion can find a3 , a4 , . . .

118

4. Linear Methods for Classification

Classification in the reduced subspace

The al s are referred to as discriminant coordinates or

canonical
variates.
Classification
in Reduced Subspace
ooo
ooo

ooooo

oo
oo

o o oo
ooo
oo
oo ooo o
o o o
ooooo ooooo o oooo
ooo
oo
o o
oo
o o
o ooo
o
ooooo
o
o
o

o
o
o
oo
o
ooooo
o
o ooooo oo
o
o o o ooo o
oo o
o
oo oo o o oo
o
ooo
o oo o
oo o o o
oo o o o
oooooo o oo o oooo oo o
o o oo o
oooo o o
o oo o
o oo o o o o
o
oo o o
o
oo
oo
o
ooo o oooo
o
o
oo ooo o o
ooo o
oooo oooo
ooooo o o
oooo o o
o o
o
o
o
o
o
o
o
o
o
ooooo
o
o
ooooo oo o o
oo
o
o
o o oooo o o
oo
ooo oo oo o oo oo o o
o o oo
ooo o o oo
o
o
oo o o
oo o o
oo o o
oo
o
o
ooooo o
ooo o
ooo
o
o
o
o
o
o
o
o
o
oo o
o
oo oo
oooo
o oo o
o
o o
o
o o
o
o o o o
o o o oo o
o
o
o
o o
o o
o o o oo
oo
o oo o o o
o
o
o
o ooo
oo
o
oo
oo
oo
oooo
o
o
o
o
o
o
o o o oooo
oo
o
o
o
o o o oo
o
o
oo
o o
oo
o
oo
o
o o
oo
o
o
oooo
oo o
o o ooo
oo o o o
o
o
oo o
o o
oo
o
ooo
o o
o
oo o
ooo
o
oo o oo o o
o
o
o
o

Canonical Coordinate 2

o o
o

Canonical Coordinate 1

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-dimensional subspace spanned by the first two canonical variates. Note that in
any higher-dimensional subspace, the decision boundaries are higher-dimensional
affine planes, and could not be represented as lines.

In this example have 11


classes with 10
dimensional input
vectors.
The decision boundaries
based on using basic
linear discrimination in
the low dimensional
space given by the first
2 canonical variates.

Logistic Regression

Logistic regression
Arises from trying to model the posterior probabilities of the

K classes using linear functions in x while ensuring they sum


to one.

The simple model used is for k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

These posterior probabilities clearly sum to one.

Logistic regression
Arises from trying to model the posterior probabilities of the

K classes using linear functions in x while ensuring they sum


to one.

The simple model used is for k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

These posterior probabilities clearly sum to one.

Logistic regression
This model: k = 1, . . . , K 1

P (G = k|X = x) =
and k = K
P (G = K|X = x) =

exp(k0 + kt x)
P
t
1 + K1
l=1 exp(l0 + l x)

1+

PK1
l=1

1
exp(l0 + lt x)

induces linear decision boundaries between classes as


{x : P (G = k|X = x) = P (G = l|X = x)}
is the same as
{x : (k0 l0 ) + (k l )t x = 0}
for 1 k < K and 1 l < K.

Fitting Logistic regression models


To simplify notation let
1

= {10 , 1t , 20 , 2t , . . .} and

P (G = k|X = x) = pk (x; )

Given training data {(xi , gi )}n


i=1 one usually fits the logistic

regression model by maximum likelihood.

The log-likelihood for the n observations is

`() = log

n
Y
i=1

pgi (xi ; )

n
X

log(pgi (xi ; ))

i=1

in my opinion this is an abuse of terminology as the posterior


probabilities are being used...

Fitting Logistic regression models: The two class case

p1 (x; ) =

exp( t x)
and p2 (x; ) = 1 p1 (x; )
1 + exp( t x)

Let = = (10 , 1t ) and assume xi s include the constant term 1.


A convenient way to write the likelihood for one sample (xi , gi ) is:
Code the two-class gi as a {0, 1} response yi where
(
1 if gi = 1
yi =
0 if gi = 2
Then one can write

pgi (xi ; ) = yi p1 (xi ; ) + (1 yi )(1 p1 (xi ; ))

Fitting Logistic regression models: The two class case


Similarly
log pgi (xi ; ) = yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))
The log-likelihood of the data becomes
`() =
=

n
X

[yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))]

i=1
n h
X
i=1

n h
X
i=1

yi t xi yi log(1 + e
yi t xi log(1 + e

tx

tx

) (1 yi ) log(1 + e

tx

Fitting Logistic regression models: The two class case

`() =

n h
X
i=1

yi t xi log(1 + e

tx

To maximize the log-likelihood set its derivatives to zero to get



n 
`() X
exp( t xi )
=
xi yi xi

1 + exp( t xi )
i=1


n
X
exp( t xi )
=
xi yi
1 + exp( t xi )
i=1
=

n
X
i=1

xi (yi p1 (xi ; )) = 0

These are (p + 1) equations non-linear equations in .


Must solve iteratively and in the book they use the

Newton-Raphson algorithm.

The two class case: Iterative optimization


Newton-Raphson requires both the gradient
n

`() X
=
xi (yi p1 (xi ; ))

i=1

and Hessian matrix


n

X
`()
=
xi xti p1 (xi ; )(1 p1 (xi ; ))
t

i=1

Starting with old , a single Newton update step is


new =

old

`()
t

1

where the derivatives are calculated at old .

`()

Iterative optimization in matrix notation


Write the Hessian and gradient in matrix notation. Let
X be the N (p + 1) matrix with (1, xti ) on each row,
p = (p1 (x1 ; old ), p1 (x2 ; old ), . . . , p1 (xn ; old ))t
W is n n diagonal matrix with ith diagonal element

p1 (x1 ; old )(1 p1 (x1 ; old )).

Then
`()
= Xt (y p)

and
`()
= Xt WX
t

Iterative optimization as iterative weighted ls


The Newton step is then
new = old + (Xt WX)1 Xt (y p)


= (Xt WX)1 Xt W X old + W1 (y p)
= (Xt WX)1 Xt Wz

Have re-expressed the Newton step as a weighted least squares step


new = arg min (z X)t W(z X)

with response
z = X old + W1 (y p)
known as the adjusted response. Note at iteration each W, p and
z change.

An toy example

Two class problem with 2 dimensional input vectors.


Use Logistic Regression to find a decision boundary

Illustration of the optimization process

The current estimate cur

Quantities involved in the weighted least sqs

Size p1 (xi ; cur )

Size 1/Wii

Size p1 (xi ; cur )(1 p1 (xi ; cur )) = Wii

Update the estimate of cur

The current estimate cur

Quantities involved in the weighted least sqs

Size p1 (xi ; cur )

Size 1/Wii

Size p1 (xi ; cur )(1 p1 (xi ; cur )) = Wii

Update the estimate of cur

The current estimate cur

Quantities involved in the weighted least sqs

Size p1 (xi ; cur )

Size 1/Wii

Size p1 (xi ; cur )(1 p1 (xi ; cur )) = Wii

Update the estimate of cur

The current estimate cur


Logistic regression converges to this decision boundary.

L1 regularized logistic regression

L1 regularized logistic regression


The L1 penalty can be used for variable selection in logistic
regression by maximizing a penalized version of the log-likelihood

p
n h

X
i
X
t
yi (0 + t xi ) log(1 + e0 + xi )
|j |
max

0 ,1
i=1

j=1

Note:

the intercept, 0 , is not included in the penalty term,


the predictors should be standardized to ensure the penalty is

meaningful,

the above cost function is concave and a solution can be

found using non-linear programming methods.

Separating Hyperplanes

Directly estimating separating hyperplanes


In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

Directly estimating separating hyperplanes


In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

Directly estimating separating hyperplanes


In this section describe separating hyperplane classifiers - will
only consider separable training data.
Construct linear decision boundaries that explicitly try to

separate the data into different classes as well as possible.

A hyperplane is defined as

{x : 0 + t x = 0}

Review of some vector algebra


130

4. Linear Methods for Classification

x0

x
0 + T x = 0

FIGURE 4.15. The linear algebra of a hyperplane (affine set).

Above is shown a hyperplane L defined by

ature in the late 1950s (Rosenblatt,


Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

= /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

Review of some vector algebra


130

4. Linear Methods for Classification

x0

x
0 + T x = 0

FIGURE 4.15. The linear algebra of a hyperplane (affine set).

Above is shown a hyperplane L defined by

ature in the late 1950s (Rosenblatt,


Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

If x1 , x L then (x x ) = 0 = = /kk is normal to L

If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

= /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

Review of some vector algebra


130

4. Linear Methods for Classification

x0

x
0 + T x = 0

FIGURE 4.15. The linear algebra of a hyperplane (affine set).

Above is shown a hyperplane L defined by

ature in the late 1950s (Rosenblatt,


Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

If x1 , x L then (x x ) = 0 = = /kk is normal to L


If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

= /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

Review of some vector algebra


130

4. Linear Methods for Classification

x0

x
0 + T x = 0

FIGURE 4.15. The linear algebra of a hyperplane (affine set).

Above is shown a hyperplane L defined by

ature in the late 1950s (Rosenblatt,


Perceptrons
f (x) =1958).
0 +
t x = set
0 the foundations
for the neural network models of the 1980s and 1990s.
Before we continue,
let
us
digress
slightly
and
review
some
t
vector algebra.
2Figure 4.15 depicts a hyperplane
1
2 or affine set L defined by the equation
2
T
f (x) = 0 + t x = 0; since we are in IR this is a line.
Here we list some
0 properties:
0

If x1 , x L then (x x ) = 0 = = /kk is normal to L


If x0 L then x = .
1. For
any two points
x1 and x2xlying
The signed
distance
of point
to inLL,is T (x1 x2 ) = 0, and hence

= /|||| is the vector normal to the surface of L.

1 T t
1
x0 =x
2.tFor
(xany
point
x0 ) x=
+0 .0 ) =
f (x) f (x)
0 in L, (
kk
kf 0 (x)k
3. The signed distance of any point x to L is given by

Perceptron Learning

Rosenblatts Perceptron Learning Algorithm


Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

where M is the index of the misclassified points.

Rosenblatts Perceptron Learning Algorithm


Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

where M is the index of the misclassified points.

Rosenblatts Perceptron Learning Algorithm


Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

where M is the index of the misclassified points.

Rosenblatts Perceptron Learning Algorithm


Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

where M is the index of the misclassified points.

Rosenblatts Perceptron Learning Algorithm


Perceptron learning algorithm tries to find a separating hyperplane
by minimizing the distance of misclassified points to the decision
boundary.
The Objective Function
Have labelled training data {(xi , yi )} with xi Rp and
yi {1, 1}.
A point xi is misclassified if sign(0 + t xi ) 6= yi
This can be re-stated as: a point xi is misclassified if
yi (0 + t xi ) < 0
The goal is to find 0 and which minimize
X
D(, 0 ) =
yi (xti + 0 )
iM

where M is the index of the misclassified points.

Perceptron Learning: The Objective Function


Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

Can we say anything about the form of D(, 0 )?

Perceptron Learning: The Objective Function


Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

Can we say anything about the form of D(, 0 )?

Perceptron Learning: The Objective Function


Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

Can we say anything about the form of D(, 0 )?

Perceptron Learning: The Objective Function


Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

Can we say anything about the form of D(, 0 )?

Perceptron Learning: The Objective Function


Want to find 0 and which minimize
D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.
D(, 0 ) is proportional to the distance of the misclassified

points to the decision boundary defined by 0 + t x = 0.

Questions:
Is there a unique , 0 which minimizes D(, 0 ) (disregarding
re-scaling of and 0 )

Can we say anything about the form of D(, 0 )?

Perceptron Learning: Optimizing the Objective Function


The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

Stochastic gradient descent is used to minimize D(, 0 )

so an update step is made after each observation is visited.

Identify a misclassified example wrt the current estimate of

and 0 and make the update


+ yi xi

and

0 0 + yi

where is the learning rate.


Repeat this step until no points are misclassified.

Perceptron Learning: Optimizing the Objective Function


The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

Stochastic gradient descent is used to minimize D(, 0 )

so an update step is made after each observation is visited.

Identify a misclassified example wrt the current estimate of

and 0 and make the update


+ yi xi

and

0 0 + yi

where is the learning rate.


Repeat this step until no points are misclassified.

Perceptron Learning: Optimizing the Objective Function


The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

Stochastic gradient descent is used to minimize D(, 0 )

so an update step is made after each observation is visited.

Identify a misclassified example wrt the current estimate of

and 0 and make the update


+ yi xi

and

0 0 + yi

where is the learning rate.


Repeat this step until no points are misclassified.

Perceptron Learning: Optimizing the Objective Function


The gradient, assuming a fixed M, is given by
X
D(, 0 )
=
yi xi ,

iM

X
D(, 0 )
=
yi
0
iM

Stochastic gradient descent is used to minimize D(, 0 )

so an update step is made after each observation is visited.

Identify a misclassified example wrt the current estimate of

and 0 and make the update


+ yi xi

and

0 0 + yi

where is the learning rate.


Repeat this step until no points are misclassified.

Perceptron Learning: An Example

Want to find a separating hyperplane between the red and blue


points.

Perceptron Learning: One Iteration

Current estimate
(0)

Point misclassified
by (0)

Use gradient at point


to get (1)

Perceptron Learning: Sequence of iterations

(2)

Perceptron Learning: Sequence of iterations

(3)

Perceptron Learning: Sequence of iterations

(4)

Perceptron Learning: Sequence of iterations

(5)

Perceptron Learning: Sequence of iterations

(6)

Perceptron Learning: Sequence of iterations

(7)

Perceptron Learning: Sequence of iterations

(8)

Perceptron Learning: Sequence of iterations

(9)

Perceptron Learning: Sequence of iterations

(10)

Perceptron Learning: Sequence of iterations

(11)

Perceptron Learning: Sequence of iterations

(12)

Perceptron Learning: Sequence of iterations

(13)

Perceptron Learning: Sequence of iterations

(14)

Perceptron Learning: Sequence of iterations

(15)

Perceptron Learning: Sequence of iterations

(16)

Perceptron Learning: Sequence of iterations

(17)

Perceptron Learning: Sequence of iterations

(17)
Is this the best separating hyperplane we could have found?

Perceptron Learning Algorithm: Properties


Pros
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Perceptron Learning Algorithm: Properties


Pros
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Perceptron Learning Algorithm: Properties


Pros
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Perceptron Learning Algorithm: Properties


Pros
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Perceptron Learning Algorithm: Properties


Pros
If the classes are linearly separable, the algorithm converges to

a separating hyperplane in a finite number of steps.

Cons
All separating hyperplanes are considered equally valid.
One found depends on the initial guess for and 0 .
The finite number of steps can be very large.
If the data is non-separable, the algorithm will not converge.

Optimal Separating Hyperplanes

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
$',2(2-.')&2*2$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
$',2(2-.')&2*2$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=

The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
Bad a hyperplane
passing
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!

er the problem of finding


a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
1}.
a&.3)+$(
&.3)+$( definition of the separating hyperplane
unique
"

"

6: 6:

6: 6:

T
h
la
m
th

? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)

@/9#:1:
@/9#:1:
:/-8#*
:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
2)(,$&%*3'#)-$$-4561'",
2)(,$&%*3'#)-$$-4561'",
7-8,1*.9:*;")<-$1)#0
7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes

Which of the infinite hyperplanes should we choose?


a decision boundary that generalizes well.

margin=>=>

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D
Optimal
Separating Hyperplane
Intuitively
!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)
?
$',2(2-.')&2*2$'*)456
?8)3!4@78A7=
78383
79856
:8383
:98;856
<8383
<9=8)6!>
$',2(2-.')&2*2$'*)456
9856
98;856
9=8)6!>8)3!4@78A7=

The optimal
separatinghyperplanes
hyperplane
separates the
two too close to t
Optimal
separating
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
Bad a hyperplane
passing
!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%
7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
classes and
maximizes the distance
theand
closes
pointless
from
5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%
to to
noise
probably
likely to gen
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/
(&%*&#0)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(
&1(0#2)%(")%(-/#*#*8%0)(
either class
[Vapnik
1996].
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0
'/-(")0(%'-&:%/..%
5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%
!

er the problem of finding


a separating hyperplane for a linearly
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0
Better
hyperplane
d
le dataset {(x
,
y
),
(x
,
y
),
xi a R
and y far away from all
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
1
1
2
2
>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%
This provides . . . , (xn, yn)} with better
generalization capabilities.
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%
1}.
a&.3)+$(
&.3)+$( definition of the separating hyperplane
unique
"

"

6: 6:

6: 6:

T
h
la
m
th

? ?
,( ,
#: (#:
/. /
%" .%"
+, +
)-,)
, . -,
/* ./
) *)

@/9#:1:
@/9#:1:
:/-8#*
:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
!"#$%&'(#)%"*#%*+,##-$"*.",/01)1
2)(,$&%*3'#)-$$-4561'",
2)(,$&%*3'#)-$$-4561'",
7-8,1*.9:*;")<-$1)#0
7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78
+*'%9.2#(3:%;<<=>
+*'%9.2#(3:%;<<=>
hyperplane?43/-%56"(37+&&78
One which
maximizes

Which of the infinite hyperplanes should we choose?


a decision boundary that generalizes well.

margin=>=>

Stating the optimization problem


A first attempt

max

M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n

, 0 , kk = 1

The conditions ensure all the training points are a signed

distance M from the decision boundary defined by and 0 .

Want to find the largest such M and its associated and 0 .

Stating the optimization problem


Remove the constraint kk = 1 by adjusting the constraints

on the training data as follows:

max M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n


, 0

For any and 0 fulfilling the above constraints then and

0 with > 0 also fulfills the constraints.

Therefore can arbitrarily set kk = 1/M .


Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

Stating the optimization problem


Remove the constraint kk = 1 by adjusting the constraints

on the training data as follows:

max M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n


, 0

For any and 0 fulfilling the above constraints then and

0 with > 0 also fulfills the constraints.

Therefore can arbitrarily set kk = 1/M .


Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

Stating the optimization problem


Remove the constraint kk = 1 by adjusting the constraints

on the training data as follows:

max M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n


, 0

For any and 0 fulfilling the above constraints then and

0 with > 0 also fulfills the constraints.

Therefore can arbitrarily set kk = 1/M .


Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

Stating the optimization problem


Remove the constraint kk = 1 by adjusting the constraints

on the training data as follows:

max M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n


, 0

For any and 0 fulfilling the above constraints then and

0 with > 0 also fulfills the constraints.

Therefore can arbitrarily set kk = 1/M .


Then the above optimization problem is equivalent to

1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2

"A*B0?-$?/,"-1*CDE

Stating the optimization problem

"+"-%&).%&+(/0"#1&2%)34&%,5/%44&")&(4&(&67#$)"*#&*6&
With this formulation of the problem
&:"(4&*6&).%&4%5(/()"#0&.;5%/52(#%

Optimal separating hyperplanes

.+"/0%+1.%2)(+'-*.%&.+3..-%'%4#)-+%5%'-2%'%46'-.%730&8%)(
=
3 min
5 !&

, 0

1
kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
2

ress the margin in terms of w and b of the separating hyperplane.

$'6%1/4."46'-.%1'(%)-:)-)+.%(#6;+)#-(%&/%()$46/%(*'6)-,%+1.%
(0%3.%*1##(.%+1.%(#6;+)#-%:#"%31)*1%+1.%2)(*")$)-'-+%
The margin has thickness 1/kk as shown
distance between a point x and a plane (w, b)
.%:#"%+1.%+"')-)-,%.5'$46.(%*6#(.(+%+#%+1.%&#;-2'"/
slightly different).
3 5) ! & " @
=

,=

1.%!"#$#%!"&'1/4."46'-.

A
3

*.%:"#$%+1.%*6#(.(+%
2'"/%)(
3=5 ! &
3

"

$.(
$"

A
3

@
3

&
3

3=5 ! &
3

,<

in figure
(notation
|wT x+b|

is

The solution to this constrained optimization problem


1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

function with linear inequality constraints.

Its associated primal Lagrangian function is


n

X
1
Lp (, 0 , ) = kk2 +
i yi (1 t xi 0 )
2
i=1

and 0 is a minimum point of the cost function stated at

the top if...

The solution to this constrained optimization problem


1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

function with linear inequality constraints.

Its associated primal Lagrangian function is


n

X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1

and 0 is a minimum point of the cost function stated at

the top if...

The solution to this constrained optimization problem


1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
This is a convex optimization problem - quadratic objective

function with linear inequality constraints.

Its associated primal Lagrangian function is


n

X
1
i yi (1 t xi 0 )
Lp (, 0 , ) = kk2 +
2
i=1

and 0 is a minimum point of the cost function stated at

the top if...

The solution to this constrained optimization problem


1
min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n
, 0 2
The Karush-Kuhn-Tucker conditions state that 1 = (0 , ) is a
minimum of this cost function if a unique s.t.
1

1 Lp (1 , ) = 0

j 0 for j = 1, . . . , n

j (1 yj (0 + xtj )) = 0 for j = 1, . . . , n

(1 yj (0 + xtj )) 0 for j = 1, . . . , n

Plus positive definite constraints on 1 1 Lp (1 , )

Lets check what the KKT conditions imply


Active constraints and Inactive constraints:
Let A be the set of indices with j > 0 then
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA

Condition KKT 1, 1 Lp (1 , ) = 0, implies

j yj xj

and 0 =

jA

j yj

jA

Condition KKT 3, j (1 yj (0 + xtj )) = 0, implies


1

yj (0 + xtj ) = 1 for all j A,

if yi (0 + xti ) > 1 then i = 0 and i


/A

Lp (1 , ) = .5k k2 .

Lets check what the KKT conditions imply


Active constraints and Inactive constraints:
Let A be the set of indices with j > 0 then
X
1
Lp (1 , ) = k k2 +
j (1 yj (0 + xtj )).
2
jA

Condition KKT 1, 1 Lp (1 , ) = 0, implies

j yj xj

and 0 =

jA

j yj

jA

Condition KKT 3, j (1 yj (0 + xtj )) = 0, implies


1

yj (0 + xtj ) = 1 for all j A,

if yi (0 + xti ) > 1 then i = 0 and i


/A

Lp (1 , ) = .5k k2 .

To summarize
As we have a convex optimization problem it has one local

minimum.

At this minimum 1 there exist a unique s.t. 1 and

fulfill the KKT conditions.

Let A be the set of indices with j > 0 then


1

if i A then yi (0 + xti ) = 1 and therefore xi lies on the


boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

At this minimum 1 there exist a unique s.t. 1 and

fulfill the KKT conditions.

Let A be the set of indices with j > 0 then


1

if i A then yi (0 + xti ) = 1 and therefore xi lies on the


boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

At this minimum 1 there exist a unique s.t. 1 and

fulfill the KKT conditions.

Let A be the set of indices with j > 0 then


1

if i A then yi (0 + xti ) = 1 and therefore xi lies on the


boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

At this minimum 1 there exist a unique s.t. 1 and

fulfill the KKT conditions.

Let A be the set of indices with j > 0 then


1

if i A then yi (0 + xti ) = 1 and therefore xi lies on the


boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

To summarize
As we have a convex optimization problem it has one local

minimum.

At this minimum 1 there exist a unique s.t. 1 and

fulfill the KKT conditions.

Let A be the set of indices with j > 0 then


1

if i A then yi (0 + xti ) = 1 and therefore xi lies on the


boundary of the margin.
xi is called a support vector.

And if i
/ A then yi (0 + xti ) > 1 and xi lies outside of the
margin.
is a linear combination of the support vectors
X
=
j yj xj
jA

;<0(0"#>(./#(&>(&>#(&%(0"#(05&("3-#$-.)>#<(0")0(
3()0(0"#<#("3-#$-.)>#<(0"#(0#$,(3/45!+/6789:(

g
To summarize

"#(:9))'.,$;#&,'.2

(!/12(
0&$<(*&>0$/7;0#(
)>#

<>

!/ 3 / + /
:9))'.,$
;#&,'.2$?!@AB

>=(%$&,(0"#(II!
0"#(<;--&$0(D#*0&$<

#0(*&;.=(7#(
#*0&$<'()>=(
;.=(7#(0"#(<),#

If i
const

<=
==

X
Thus the SVMin
fact
only depends only a sm

=
yj x j
j

jA

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

the support vectors.

But how can we calculate these weights?


Most common approach is to solve the Dual Lagrange

problem as opposed to the Primal Lagrange problem. (The

solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

This Dual problem is an easier constrained optimization and is

also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

the support vectors.

But how can we calculate these weights?


Most common approach is to solve the Dual Lagrange

problem as opposed to the Primal Lagrange problem. (The


solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

This Dual problem is an easier constrained optimization and is

also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

the support vectors.

But how can we calculate these weights?


Most common approach is to solve the Dual Lagrange

problem as opposed to the Primal Lagrange problem. (The


solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

This Dual problem is an easier constrained optimization and is

also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i

How do I calculate ?
You have seen that the optimal solution is a weighted sum of

the support vectors.

But how can we calculate these weights?


Most common approach is to solve the Dual Lagrange

problem as opposed to the Primal Lagrange problem. (The


solutions to these problems are the same because of the original quadratic
cost function and linear inequality constraints.)

This Dual problem is an easier constrained optimization and is

also convex. It has the form

max

n
X

1 XX
i
i k yi yk xti xk
2
i=1
i=1
k=1

subject to i 0 i