2 views

Uploaded by missinu

Linear Methods for Classification

- Forecasting PVT Properties of Crude Oil Systems Based on Support Vector Machines Modeling Scheme, 2009
- 26 (1)
- 2
- Efficient Additive Kernels via Explicit Feature Maps---Vedaldi10
- 5_petroleum_reservoir parameters prediction.pdf
- Use of Machine Learning Techniques to Help in Predicting Fertilizer Usage in Agriculture Production
- High Performance clustering for Big Data Mining using Hadoop
- IDS+SVM
- Face Detection Using Neural Network & Gabor Wavelet Transform
- Employee Attrition Risk Assessment Using Logistic Regression Analysis
- Action Detection in Complex Scenes With Spatial and Temporal Ambiguities - Hu Et Al. - Proceedings of IEEE International Conference on Computer Vision - 2009
- 10.1.1.102
- skima2014_submission_139_1.pdf
- comparison of algorithms.pdf
- Inverse Kinematuc Leranig
- A Clustering Algorithm for Classification of Network Traffic using Semi Supervised Data
- Video Indexing Using Shot Boundary Detection Approach and Search Tracks
- v23i12
- Face Detection using ANN.pdf
- CMU-CS-04-109

You are on page 1of 176

DD3364

Introduction

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Want to learn a predictor G : Rp G = {1, . . . , K}

G divides input space into regions labelled according to their

classification.

decision boundaries.

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Learn a discriminant function k (x) for each class k and set

k

g(k (x)) = k0 + kt x

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

Example 1: Fit a linear regression model to the class

k (x) = k0 + kt x

exp(0 + t x)

1 + exp(0 + t x)

1

P (G = 2|X = x) =

1 + exp(0 + t x)

P (G = 1|X = x) =

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

For a two class problem with p-dimensional inputs this =

SVM model and algorithm of Vapnik

In the forms quoted both these algorithms find separating

hyperplanes if they exist and fail of the points are not linearly

separable.

There are fixes for the non-separable case but we will not

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

2

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

Can expand the variable set X1 , X2 , . . . , Xp by including their

Linear decision boundaries in the augmented space

space.

4.2 Linear Regression of an Indicator Matrix

103

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

1

1

2

1

3

22

1

2 22 2

3

2 22

33 3 3 3

22 2

2

12

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

3 3

22 2 2 2

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

3333 33

2 2 2222 22 22 22 22222222 22 2

13

3 3 33

2

2

2

2

2

3

1

2

2

2

1 2 222 222222 2222

1

3 3

33333

21222212

1

2

2

1

1 3 33

1 1

33 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3

22 2 2

1

2

1

11

1 1 1133333 33

11 1 1

2211

12 2

3 33

2

1

1

2

1

1

1

1

1

2

1

1

3 33 3333

1

1

1

22

12

11 1 1 1

22

1 11 11 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1 1 1111 1 11 1 3333333

111 1 1

11 1 1

33333333

3

1

1

3

1

1 1

3333 3

1 1 11 1 11 111 111 1 1 11 11 1 111 3

313 1

111 1 1

33 33

1 11 1

1 1 3333333

1

1

31 33 3 3

1 11 1

1 333333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

p

Have training data {(xi , gi )}n

i=1 where each xi R and

gi {1, . . . , K}.

1 For i = 1, . . . , n set

(

0 if gi 6= k

yi =

1 if gi = k

0k , k ) = arg min Pn (yi 0 t xi )2

2 Compute (

k

i=1

0 ,k

Define

k (x) = 0k + kt x

Classify a new point x with

k

3 class example

functions for the above 3-classes.

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0

0

10

20

10

0

0

10

20

10

0

0

10

20

0.5

0.5

10

0

0

10

20

0.5

10

0

0

10

20

10

0

0

10

20

For each k construct the response vectors from the class labels

0.5

0.5

0.5

10

0

0

10

20

10

0

0

10

20

10

0

0

10

20

0.5

0.5

1 (x)

1.5

0.5

0.5

2 (x)

1.5

0.5

0.5

3 (x)

1.5

The training data from 3 classes

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

0.5

1 (x)

0.5

2 (x)

0.5

3 (x)

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

In this last example masking has occurred.

This occurs because of the rigid nature of the linear

discriminant functions.

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

To perform optimal classification need to know P (G | X). Let

fk (x) represent the class-conditional P (X | G = k) and

k be the prior probability of class k with

PK

k=1

fk (x)k

P (G = k | X = x) = PK

l=1 fl (x)l

k = 1

to having P (G = k | X = x).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Many methods are based on specific models of fk (x)

linear and quadratic discriminant functions use Gaussian

distributions,

boundaries,

flexibility,

Qp

j=1

fkj (Xj ).

Model each fk (x) as a multivariate Gaussian

1

p

fk (x) =

exp {.5(x k )t 1

k (x k )}

p

2 |k |

Similar discriminant functions were derived where each p(x

distributed

with(LDA)

equal arises

covariance

LinearNormally

Discriminant

Analysis

in thematrices.

special

case when

k = for all k

class distributions

decision boundary

partition

In this

lecture,

no assumptions,

One gets

linear

decision

boundaries. made about the underlying de

Model each fk (x) as a multivariate Gaussian

1

p

fk (x) =

exp {.5(x k )t 1

k (x k )}

p

2 |k |

Similar discriminant functions were derived where each p(x

distributed

with(LDA)

equal arises

covariance

LinearNormally

Discriminant

Analysis

in thematrices.

special

case when

k = for all k

class distributions

decision boundary

partition

In this

lecture,

no assumptions,

One gets

linear

decision

boundaries. made about the underlying de

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

LDA

Can see this as

log

P (G = k | X = x)

fk (x)

k

= log

+ log

P (G = l | X = x)

fl (x)

l

k

t

1

= log

.5 k k + .5 tl 1 l

l

+ xt 1 (k l )

= xt a + b

a linear function

t 1

The equal covariance matrices allow the xt 1

k x and x l x

functions

k (x) = xt 1 k .5 tk 1 k + log k

are an equivalent description of the decision rule with

G(x) = arg max k (x)

k

In practice dont know the parameters of the Gaussian distributions

and estimate these from the training data.

Let nk be the number of class k observations then

k = nk /n

P

k = gi =k xi /nk

k =

PK P

k=1

(xi

k )(xi Analysis

k )t /(n109

Linear

Discriminant

g4.3

i =k

13

3

3

3

33

3

2 2

13

2

3

3

3

31 3

3 22

2

1

3 3

2

3

33

11 23 33 1

22 2 2

2

3

2

1 1 1 1 22

13

3

2

1 31 1

3

1 11

2 22

11

22

1

1

2

2

1 1 2

1

1

1 2

1

2

2

2

1 3

33

K)

Bivariate example

(QDA)

Have a two class problem with

t 1

1 log | |

.9 .5 .4

2.6

k (x)

= .5

(x

2k =

1 =

, 1 =k

, P(

=k.5 (x

2=

k )1 )

k ) + log,

.4

.3

.4

.2

class distributions

decision boundaries

partition

Bivariate example

(QDA)

Have a two class problem with

t 1

1 log | |

.9 .5 .4

2.6

k (x)

= .5

(x

2k =

1 =

, 1 =k

, P(

=k.5 (x

2=

k )1 )

k ) + log,

.4

.3

.4

.2

class distributions

decision boundaries

partition

Left plot shows the quadratic decision boundaries found using

2

2

LDA in the five dimensional space4.3X

.

1 , XDiscriminant

2 , X1 , XAnalysis

2 , X1 X2111

Linear

2

1

1

2

1

3

22 2

1

2 22

3

2 22

33 3 3 3

22 2

2

1

2

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

2

3 3

2 2

22

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

333 33

2 2 2222 22 22 22 22222222 22 2

13

222 222 22

3 333 3 3

1 2 22222 2122

1

2

3 3

33333

2122222122

1

2

2

1

1 3 33

2

1 1

3 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3 3 33

22 2 2 2

1

1

11

2

1

1

333 333

1

3

1

1

2

11 1 11

2211

3 3

2

1

2121 2

1 111

3 3 3333

11 1

1

112

1 121 11 1

222

1 1 1 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1

3

1

1

1

33333333

1

1

111 1 1

11

1 11 33333

333

1

1 1 1

33333

1 1 11 1 11 1111 111 1 1 1111 1111 1

11 313 1

111 1 1

333333

3

1 11 1

1

3

3

3

3

1

3

3

1 3133 3 3

1

1

1 11

1 33333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

2

1

1

2

1

3

22 2

1

2 22

3

2 22

33 3 3 3

22 2

2

1

2

2

33

3 3

2 2 22 2 22 2

3 3

3

2 222 2

1

22

2

22

3

22

1

2

2

2

3 3

2 2

22

22 2

22 22222 2 2

3 3 33

2222 2222 2222

33

2 2222 2 2 2 2222222 21

3 33 333

222 2 2

2

3 333 3333 3 3

222 222 22 2 2 2222 2 2

333 33

2 2 2222 22 22 22 22222222 22 2

13

222 222 22

3 333 3 3

1 2 22222 2122

1

2

3 3

33333

2122222122

1

2

2

1

1 3 33

2

1 1

3 3

2 122122 22

1

22

1

2 1 1222 1

11 1 33

3 3 33

22 2 2 2

1

1

11

2

1

1

333 333

1

3

1

1

2

11 1 11

2211

3 3

2

1

2121 2

1 111

3 3 3333

11 1

1

112

1 121 11 1

222

1 1 1 1 1 11 1 13313333333 33 3

211 2 1

3

3

1

3

2 111

1

3

1

1

3

1

1

1

33333333

1

1

111 1 1

11

1 11 33333

333

1

1 1 1

33333

1 1 11 1 11 1111 111 1 1 1111 1111 1

11 31 1

111 1 1

333333

33333

1 11 1

1

3

3

1

3

1 3133 3 3

1

1

1 11

1 33333333333

11 1 1 1 1 111111 11 111 1 1 1 1

1 11 1

1

33

3 3 333333

11 3

1

1

1 1

33333 3

11

1

3

1 1 11

3 33

1

1

3

333 3

3

3 3

FIGURE 4.6. Two methods for fitting quadratic boundaries. The left plot shows

the quadratic decision boundaries for the data in Figure 4.1 (obtained using LDA

in the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows the

quadratic decision boundaries found by QDA. The differences are small, as is

usually the case.

we have chosen the last), and each difference requires p + 1 parameters3 .

These methods can be surprisingly effective.

Can explain this

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

Have K centroids in a p-dimensional input space: 1 , . . . , K

These centroids define an K 1 dimensional affine subspace

u = 1 + 1 (2 1 ) + 2 (3 1 ) + + K1 (K 1 )

= 1 + 1 d1 + 2 d2 + + K1 dK1

x = 1 + 1 d1 + 2 d2 + + K1 dK1 + x ,

where x HK1

.

kx j k = k1 + 1 d1 + 2 d2 + + K1 dK1 + x j k

To summarize

K centroids in p-dimensional input space lie in an affine

subspace of dimension K 1.

To locate the closest centroid can ignore the directions

K 1.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

o o oo

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

If K > 3 can ask the question:

onto for optimality w.r.t. LDA?

4.3 Linear Discriminant Analysis

107

4

2

0

ooo

oo

o o ooo

o o o

oo o

oooooo ooooo o oooo

oo o

o o

oo

o o

o ooo

oo ooo

o o o o oo

o

o

oo

oooooo

o

oo

o o o ooo o

ooooo o

o o o

o

oo o o oo

o

o

o

o

o

o

oo

o oo

oo o o o

ooo

o o oooo oo o

oo o o o

o

oo o o o o o

o ooo o

o

oooo o o

o o

o

o

o o

o

oo o oo

o o

oo

o

o o

o

ooo o

o

o

oooo o o o o

o

o

o

o

o

o

o

o

o oo o oo o o

o

o

o

o ooooo

ooo ooo

o o oo

o

o ooo

o

ooooo

oooooooo ooo

oo

o

o

o

o

o

o

o o o

o o oo o

oo

oo o o

o o oo

ooo o o oo

oo

o oo o o o

o o oo o

o

o

o

o

o o

o

o oo o

ooo

oo o

oooooooo o

oo oo o

o

o

o

oo oo

o

oo

o

o

o

o oo

oo o o

o

o o

o o o

o o o oo o

o

o

o

oo o

o

o

o o o

oo

o

o

o

o

o

o

o oooo o o

oo

o

oo

o o

o o ooo o oo o ooo

o

o

o

o

o

o

o o

o

o ooo o

o

o

o o o

oo

o o

oo

o

oo

o

o o

oo

o

o

o

o

o

o

oo

o

o

o

o

oo o

oo o o

o o

oo oo

ooo

o o

o

o

o

o

o

oo o

ooo

o oo o

oo

o o

o

o

o

o

o o

o

o

-6

-2

-4

oooo

oo

oooo

o

oo

oo

o

-4

-2

o

0

classes with 10 dimensional

input vectors.

The bold dots correspond

to the centroids projected

onto the top 2 principal

directions.

To find the sequences of optimal subspaces for LDA:

1

common covariance matrix W - the within-class variance.

1

2

3

Compute B the covariance matrix of M - the between-class

variance.

B s eigen-decomposition is B = V DB V . The columns of

vl of V define basis of the optimal subspace.

1

To find the sequences of optimal subspaces for LDA:

1

common covariance matrix W - the within-class variance.

1

2

3

Compute B the covariance matrix of M - the between-class

variance.

B s eigen-decomposition is B = V DB V . The columns of

vl of V define basis of the optimal subspace.

1

-4

-2

-6

-2

2

o

o

1

-2

-3

Coordinate 10

-2

0

Coordinate 1

o

o

oo

o o o

o o

o

o o oooo o

o

o

o o o o

o

o oooo o o o oo

o

oo

o

o o

o

oo o oo

o

oo

o

o

oo

o o

o

o

ooooo o

o oo o o o o

ooo

o

o

o

o

o

o

o

o o

oo

o ooo ooooooo ooo oooo

o

o o oo ooo ooooo

o

o

o

o

o

o

o oo oooooo o o

o

o

o o

o o

oo ooo o o o

oo

ooo

o oo

oooooooooo oo

o oo o oooo

ooo o

o

o

o

o

o oo

o o o o oooo oo o

ooooo

o

o ooooo o

oo o

o

o oo

o ooo ooooo o o

o

o

oo o oo o oo oo

o

o

o

o

o

o

o

o

o

o

o

oo

o oo

oo oooo

oo oooooooooooooo

o oo o

oo o

o oo oooo o o

ooo

oo oo oo oooo

o o

o

o

o

o

o ooo ooooo oooo oo o oooooo o

o

oo o o o o oooo o

o oo o oooo ooo

o

oo

oo o o o o o o o o

o o oo o oooooooo ooo ooooooo o o

oo o oo

o o oo o o

o

o

o o ooo o o o ooo oo ooo

o

ooo oooo oo

ooo oo

o

oo o

o o o o oo o

o oo o oo

oo o

o

o o

oo o o

oooo o

o

o o

o

o

o

o o

o

o

o

o

o

o

-4

o o

oo

-1

3

2

1

0

Coordinate 7

-1

-2

oo o o

o

o

o oo

o oo

oo

oo oo

oo o

oo oooo o

o

o

oo

ooo oo

ooo

o

o

oooo o o o

o o

o o

o oo oo

oo o o

o

o oo oo

ooooo

o

oo oo

oo oo o

o

o o o o o oo o

o

oo o

oo

ooo oooo o

ooo o

o

o

oooo o o ooo o o

o o oo

o o oo

o

o

o o

o o ooooo ooo o

ooo

o oo oooooo o ooo ooo ooo o oo oooo

oo o

ooooo

o

o ooo o o oo o o oo oo

oo ooooooooo o oo ooo ooo oooo o

o

ooo oo o o oo o o ooo o

o o

o ooooo

oo

o oo

ooo o o ooo o o o o

o oo

oooooo

o

o

o oo

o ooooooo

o

o

oo oooo

o

oooo o o ooo

o o o ooo

ooo o o oo oooooo oo

o

o ooo

ooo

o

o o

o

oo oooo o o

o o o oo o o ooo o o o

oo oo oo o o

o

o

o

o

o

o

o

o

o

o

oo

oooo o oo oo oo o

oo o o oooooo

oo

oo o

o

oo

ooo

o o o

oo

ooo

oo

ooo

o oooo

oo

o

o

o ooo

o

o ooooooo

o o

oo o o

o

o oo o

o

o

oo o o

oo oo

o o

o

oo

ooo

Coordinate 2

o o

o

o

oo o o

o

o

-4

Coordinate 1

o

o

oo

o

o

o

oo

o

oo

o oo

o

o

o

o ooo o o

o

oo

o o

o

ooooo oo o oo

oo

o o oo

oo

o

o

o

oo

o

o

o

o

oo o

ooo ooo o o

o o o oo

ooooo

ooooo

o

o

o

oooooo ooooo oooo

o oo ooo oo

oooo oooo o oooooo

o oooo oooo o oo

o oooo ooooooooooo o

o o

oo

o o oo

o oo

oo oooooo

ooooo ooo

o ooo o

oo

o

o

o

oo

o ooooooo o oooo

oooooooo o ooo

o o oo o oo o ooo o oo

oo

ooo

o oo o o ooo o oo o ooo oo oooo

ooooo oo oooooo ooooooo o o o

ooo oooo o

o o o ooo o

o oo oo ooooooooooo o oo

o o oo o

o oo

o

o o ooooooooooo

o

o oooo o oooo oo

oo oooo oooooo

o

o

o

o

o

o

o

o

o

o

o

o oo o

ooo o o

o o

oo o

o oo o oo oo

o

oo oo

oo

o o o o oo o o o o o o

oo o o o

oo o

o o

oo

ooo

o oo

o o o oooo o

o

o o

o

o

o

oo

o

ooo

o

oo

oo

o

o

o

o

o

ooo

o

o oo o oo

o

Coordinate 3

-2

o

o

o

o

o

o

o

o

o o

o

o

o

o

oo

o

oo o

o o

o o

oo o oo o

o o

o

o o ooo

oooooo oo

o oo

o

o

o o

o

o oo

o

o o

oooo

oo

o o

o

o ooo

o

o oo

o

o

o

o

o

o

o

o

o o oo o o

o o

oooo oo o o ooo ooo

o

ooooo o ooo o oo o

oo ooo o

o

o oo

o

ooooo o oo

ooo oo ooo

o o

oo

oo

o ooooo ooooooo

o oo oooooo ooo

oo

oo o

oo

oo oo

oo oooo oooo oo ooo

oo

o

oo o

o

o o

o o o oooooo oo

ooo oo o

oo o

oooo o oooo o o oo oo

o

o oooo

oo

o o oo oooo ooo

oooo

oo

o oo ooo o o

oo

ooo o oo o ooooo o o

oo oo o o o

o o oo oooo ooo o oo ooo

oo

o o o ooooo

o

o

o

o

o

o o o oo

oo o ooo ooooo

o o ooooo

o o

o

oo oo oooooo oo o o o

o

o oo o

oo

ooo o o

oo o o

oo o

oo

oo o

o

oooo

o o oo

o oo o o

ooo

o

o

o

o

o oo

o

o oo o o

oo oo o oo

oo

o oooo o o o o

o

oo

o o o

oo

oo o

o

o

ooo

o oo

o

oo

o o o

o

oo

o

o

o

oo

oo o

o

o

oooo

o

o

o

o o

o

-2

Coordinate 3

-2

-1

oo

oo

o

Coordinate 9

FIGURE 4.8. Four projections onto pairs of canonical variates. Notice that as

the

canonical

variates

increase

the

projected

the rankof

of the

canonical

variates increases,

the centroids

become less

spread

out.

In the lower right panel they appear to be superimposed, and the classes most

centroids become

confused.less spread out.

Fisher arrived at this decomposition via a different route. He

posed the problem

Find the linear combination Z = aX such that the

between-class variance is maximized relative to the

within-class variance.

116

+

+

Althoughthis

the line

joining themakes

centroidssense

defines the direction of

criterion

greatest centroid spread, the projected data overlap because of the covariance

(left panel). The discriminant direction minimizes this overlap for Gaussian data

(right panel).

Fisher arrived at this decomposition via a different route. He

posed the problem

Find the linear combination Z = aX such that the

between-class variance is maximized relative to the

within-class variance.

116

+

+

Althoughthis

the line

joining themakes

centroidssense

defines the direction of

criterion

greatest centroid spread, the projected data overlap because of the covariance

(left panel). The discriminant direction minimizes this overlap for Gaussian data

(right panel).

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

W is the common covariance matrix of the original data X.

B is the covariance matrix of the centroid matrix M

Then for the projected data Z

1

max

a

at B a

at W a

or equivalently

max at B a subject to at W a = 1

a

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

Fishers problem amounts to maximizing the Raleigh quotient

a

largest eigenvalue of W 1 B.

Can find the next direction a2

a2 = arg max

a

at B a

subject to at W a1 = 0

at W a

1

Once again a2 = W 2 v2 .

In a similar fashion can find a3 , a4 , . . .

118

canonical

variates.

Classification

in Reduced Subspace

ooo

ooo

ooooo

oo

oo

o o oo

ooo

oo

oo ooo o

o o o

ooooo ooooo o oooo

ooo

oo

o o

oo

o o

o ooo

o

ooooo

o

o

o

o

o

o

oo

o

ooooo

o

o ooooo oo

o

o o o ooo o

oo o

o

oo oo o o oo

o

ooo

o oo o

oo o o o

oo o o o

oooooo o oo o oooo oo o

o o oo o

oooo o o

o oo o

o oo o o o o

o

oo o o

o

oo

oo

o

ooo o oooo

o

o

oo ooo o o

ooo o

oooo oooo

ooooo o o

oooo o o

o o

o

o

o

o

o

o

o

o

o

ooooo

o

o

ooooo oo o o

oo

o

o

o o oooo o o

oo

ooo oo oo o oo oo o o

o o oo

ooo o o oo

o

o

oo o o

oo o o

oo o o

oo

o

o

ooooo o

ooo o

ooo

o

o

o

o

o

o

o

o

o

oo o

o

oo oo

oooo

o oo o

o

o o

o

o o

o

o o o o

o o o oo o

o

o

o

o o

o o

o o o oo

oo

o oo o o o

o

o

o

o ooo

oo

o

oo

oo

oo

oooo

o

o

o

o

o

o

o o o oooo

oo

o

o

o

o o o oo

o

o

oo

o o

oo

o

oo

o

o o

oo

o

o

oooo

oo o

o o ooo

oo o o o

o

o

oo o

o o

oo

o

ooo

o o

o

oo o

ooo

o

oo o oo o o

o

o

o

o

Canonical Coordinate 2

o o

o

Canonical Coordinate 1

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-dimensional subspace spanned by the first two canonical variates. Note that in

any higher-dimensional subspace, the decision boundaries are higher-dimensional

affine planes, and could not be represented as lines.

classes with 10

dimensional input

vectors.

The decision boundaries

based on using basic

linear discrimination in

the low dimensional

space given by the first

2 canonical variates.

Logistic Regression

Logistic regression

Arises from trying to model the posterior probabilities of the

to one.

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

Logistic regression

Arises from trying to model the posterior probabilities of the

to one.

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

Logistic regression

This model: k = 1, . . . , K 1

P (G = k|X = x) =

and k = K

P (G = K|X = x) =

exp(k0 + kt x)

P

t

1 + K1

l=1 exp(l0 + l x)

1+

PK1

l=1

1

exp(l0 + lt x)

{x : P (G = k|X = x) = P (G = l|X = x)}

is the same as

{x : (k0 l0 ) + (k l )t x = 0}

for 1 k < K and 1 l < K.

To simplify notation let

1

= {10 , 1t , 20 , 2t , . . .} and

P (G = k|X = x) = pk (x; )

i=1 one usually fits the logistic

`() = log

n

Y

i=1

pgi (xi ; )

n

X

log(pgi (xi ; ))

i=1

probabilities are being used...

p1 (x; ) =

exp( t x)

and p2 (x; ) = 1 p1 (x; )

1 + exp( t x)

A convenient way to write the likelihood for one sample (xi , gi ) is:

Code the two-class gi as a {0, 1} response yi where

(

1 if gi = 1

yi =

0 if gi = 2

Then one can write

Similarly

log pgi (xi ; ) = yi log p1 (xi ; ) + (1 yi ) log(1 p1 (xi ; ))

The log-likelihood of the data becomes

`() =

=

n

X

i=1

n h

X

i=1

n h

X

i=1

yi t xi yi log(1 + e

yi t xi log(1 + e

tx

tx

) (1 yi ) log(1 + e

tx

`() =

n h

X

i=1

yi t xi log(1 + e

tx

n

`() X

exp( t xi )

=

xi yi xi

1 + exp( t xi )

i=1

n

X

exp( t xi )

=

xi yi

1 + exp( t xi )

i=1

=

n

X

i=1

xi (yi p1 (xi ; )) = 0

Must solve iteratively and in the book they use the

Newton-Raphson algorithm.

Newton-Raphson requires both the gradient

n

`() X

=

xi (yi p1 (xi ; ))

i=1

n

X

`()

=

xi xti p1 (xi ; )(1 p1 (xi ; ))

t

i=1

new =

old

`()

t

1

`()

Write the Hessian and gradient in matrix notation. Let

X be the N (p + 1) matrix with (1, xti ) on each row,

p = (p1 (x1 ; old ), p1 (x2 ; old ), . . . , p1 (xn ; old ))t

W is n n diagonal matrix with ith diagonal element

Then

`()

= Xt (y p)

and

`()

= Xt WX

t

The Newton step is then

new = old + (Xt WX)1 Xt (y p)

= (Xt WX)1 Xt W X old + W1 (y p)

= (Xt WX)1 Xt Wz

new = arg min (z X)t W(z X)

with response

z = X old + W1 (y p)

known as the adjusted response. Note at iteration each W, p and

z change.

An toy example

Use Logistic Regression to find a decision boundary

Size 1/Wii

Size 1/Wii

Size 1/Wii

Logistic regression converges to this decision boundary.

The L1 penalty can be used for variable selection in logistic

regression by maximizing a penalized version of the log-likelihood

p

n h

X

i

X

t

yi (0 + t xi ) log(1 + e0 + xi )

|j |

max

0 ,1

i=1

j=1

Note:

the predictors should be standardized to ensure the penalty is

meaningful,

Separating Hyperplanes

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

In this section describe separating hyperplane classifiers - will

only consider separable training data.

Construct linear decision boundaries that explicitly try to

A hyperplane is defined as

{x : 0 + t x = 0}

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1 T t

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

130

x0

x

0 + T x = 0

Perceptrons

f (x) =1958).

0 +

t x = set

0 the foundations

for the neural network models of the 1980s and 1990s.

Before we continue,

let

us

digress

slightly

and

review

some

t

vector algebra.

2Figure 4.15 depicts a hyperplane

1

2 or affine set L defined by the equation

2

T

f (x) = 0 + t x = 0; since we are in IR this is a line.

Here we list some

0 properties:

0

If x0 L then x = .

1. For

any two points

x1 and x2xlying

The signed

distance

of point

to inLL,is T (x1 x2 ) = 0, and hence

1

x0 =x

2.tFor

(xany

point

x0 ) x=

+0 .0 ) =

f (x) f (x)

0 in L, (

kk

kf 0 (x)k

3. The signed distance of any point x to L is given by

Perceptron Learning

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Perceptron learning algorithm tries to find a separating hyperplane

by minimizing the distance of misclassified points to the decision

boundary.

The Objective Function

Have labelled training data {(xi , yi )} with xi Rp and

yi {1, 1}.

A point xi is misclassified if sign(0 + t xi ) 6= yi

This can be re-stated as: a point xi is misclassified if

yi (0 + t xi ) < 0

The goal is to find 0 and which minimize

X

D(, 0 ) =

yi (xti + 0 )

iM

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

Want to find 0 and which minimize

D(, 0 ) =

iM

yi (xti + 0 ) =

yi f,0 (xi )

iM

D(, 0 ) is non-negative.

D(, 0 ) is proportional to the distance of the misclassified

Questions:

Is there a unique , 0 which minimizes D(, 0 ) (disregarding

re-scaling of and 0 )

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

The gradient, assuming a fixed M, is given by

X

D(, 0 )

=

yi xi ,

iM

X

D(, 0 )

=

yi

0

iM

+ yi xi

and

0 0 + yi

Repeat this step until no points are misclassified.

points.

Current estimate

(0)

Point misclassified

by (0)

to get (1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(17)

Is this the best separating hyperplane we could have found?

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

Cons

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

Pros

If the classes are linearly separable, the algorithm converges to

All separating hyperplanes are considered equally valid.

One found depends on the initial guess for and 0 .

The finite number of steps can be very large.

If the data is non-separable, the algorithm will not converge.

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

Optimal

Separating Hyperplane

Intuitively

!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

?

$',2(2-.')&2*2$'*)456

?8)3!4@78A7=

78383

79856

:8383

:98;856

<8383

<9=8)6!>

$',2(2-.')&2*2$'*)456

9856

98;856

9=8)6!>8)3!4@78A7=

The optimal

separatinghyperplanes

hyperplane

separates the

two too close to t

Optimal

separating

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

Bad a hyperplane

passing

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

classes and

maximizes the distance

theand

closes

pointless

from

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

to to

noise

probably

likely to gen

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(

&1(0#2)%(")%(-/#*#*8%0)(

either class

[Vapnik

1996].

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0

'/-(")0(%'-&:%/..%

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%

!

a separating hyperplane for a linearly

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

Better

hyperplane

d

le dataset {(x

,

y

),

(x

,

y

),

xi a R

and y far away from all

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

1

1

2

2

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

This provides . . . , (xn, yn)} with better

generalization capabilities.

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

1}.

a&.3)+$(

&.3)+$( definition of the separating hyperplane

unique

"

"

6: 6:

6: 6:

T

h

la

m

th

? ?

,( ,

#: (#:

/. /

%" .%"

+, +

)-,)

, . -,

/* ./

) *)

@/9#:1:

@/9#:1:

:/-8#*

:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

2)(,$&%*3'#)-$$-4561'",

2)(,$&%*3'#)-$$-4561'",

7-8,1*.9:*;")<-$1)#0

7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78

+*'%9.2#(3:%;<<=>

+*'%9.2#(3:%;<<=>

hyperplane?43/-%56"(37+&&78

One which

maximizes

a decision boundary that generalizes well.

margin=>=>

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

6?#)@,/*1-?,$,#)"A*B0?-$?/,"-1*C=D

Optimal

Separating Hyperplane

Intuitively

!

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

!"#$%&'()*+'),("-.'/)"0)0%#&%#1)2)$',2(2*%#1)+3,'(,.2#')0"()2).%#'2(.3)

?

$',2(2-.')&2*2$'*)456

?8)3!4@78A7=

78383

79856

:8383

:98;856

<8383

<9=8)6!>

$',2(2-.')&2*2$'*)456

9856

98;856

9=8)6!>8)3!4@78A7=

The optimal

separatinghyperplanes

hyperplane

separates the

two too close to t

Optimal

separating

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

Bad a hyperplane

passing

!"#$"%&'%(")%#*'#*#()%"+,)-,./*)0%0"&1.2%3)%$"&&0)4%

7

"

<

<

"

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

classes and

maximizes the distance

theand

closes

pointless

from

5*(1#(#6).+7%/%"+,)-,./*)%("/(%,/00)0%(&&%$.&0)%(&%(")%(-/#*#*8%)9/:,.)0%3#..%;)%0)*0#(#6)%

to to

noise

probably

likely to gen

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/

(&%*�)%/*27%(")-)'&-)7%.)00%.#<).+%(&%8)*)-/.#=)%3)..%'&-%2/(/&1(0#2)%(")%(-/#*#*8%0)(

&1(0#2)%(")%(-/#*#*8%0)(

either class

[Vapnik

1996].

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0

'/-(")0(%'-&:%/..%

5*0()/27%#(%0)):0%-)/0&*/;.)%(&%)9,)$(%("/(%/%"+,)-,./*)%("/(%#0 '/-(")0(%'-&:%/..%

!

a separating hyperplane for a linearly

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

(-/#*#*8%)9/:,.)0%3#..%"/6)%;)(()-%8)*)-/.#=/(#&*%$/,/;#.#(#)0

Better

hyperplane

d

le dataset {(x

,

y

),

(x

,

y

),

xi a R

and y far away from all

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

1

1

2

2

>")-)'&-)7%(")%&,(#:/.%0),/-/(#*8%"+,)-,./*)%3#..%;)%(")%&*)%3#("%(")%./-8)0(%

This provides . . . , (xn, yn)} with better

generalization capabilities.

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

/2(1%#7%!"#$"%#&%'()#*('%+&%,"(%-#*#-.-%'#&,+*$(%/)%+*%(0+-12(%,/%,"(%'($#&#/*%

1}.

a&.3)+$(

&.3)+$( definition of the separating hyperplane

unique

"

"

6: 6:

6: 6:

T

h

la

m

th

? ?

,( ,

#: (#:

/. /

%" .%"

+, +

)-,)

, . -,

/* ./

) *)

@/9#:1:

@/9#:1:

:/-8#*

:/-8#*

67 67

Which separating

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

!"#$%&'(#)%"*#%*+,##-$"*.",/01)1

2)(,$&%*3'#)-$$-4561'",

2)(,$&%*3'#)-$$-4561'",

7-8,1*.9:*;")<-$1)#0

7-8,1*.9:*;")<-$1)#0

6767

43/-%56"(37+&&78

+*'%9.2#(3:%;<<=>

+*'%9.2#(3:%;<<=>

hyperplane?43/-%56"(37+&&78

One which

maximizes

a decision boundary that generalizes well.

margin=>=>

A first attempt

max

M subject to yi ( t xi + 0 ) M kk, i = 1, . . . , n

, 0 , kk = 1

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

Remove the constraint kk = 1 by adjusting the constraints

, 0

Then the above optimization problem is equivalent to

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

"A*B0?-$?/,"-1*CDE

"+"-%&).%&+(/0"#1&2%)34&%,5/%44&")&(4&(&67#$)"*#&*6&

With this formulation of the problem

&:"(4&*6&).%&4%5(/()"#0&.;5%/52(#%

.+"/0%+1.%2)(+'-*.%&.+3..-%'%4#)-+%5%'-2%'%46'-.%730&8%)(

=

3 min

5 !&

, 0

1

kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

2

$'6%1/4."46'-.%1'(%)-:)-)+.%(#6;+)#-(%&/%()$46/%(*'6)-,%+1.%

(0%3.%*1##(.%+1.%(#6;+)#-%:#"%31)*1%+1.%2)(*")$)-'-+%

The margin has thickness 1/kk as shown

distance between a point x and a plane (w, b)

.%:#"%+1.%+"')-)-,%.5'$46.(%*6#(.(+%+#%+1.%&#;-2'"/

slightly different).

3 5) ! & " @

=

,=

1.%!"#$#%!"&'1/4."46'-.

A

3

*.%:"#$%+1.%*6#(.(+%

2'"/%)(

3=5 ! &

3

"

$.(

$"

A

3

@

3

&

3

3=5 ! &

3

,<

in figure

(notation

|wT x+b|

is

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

Lp (, 0 , ) = kk2 +

i yi (1 t xi 0 )

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

i yi (1 t xi 0 )

Lp (, 0 , ) = kk2 +

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

This is a convex optimization problem - quadratic objective

n

X

1

i yi (1 t xi 0 )

Lp (, 0 , ) = kk2 +

2

i=1

1

min kk2 subject to yi ( t xi + 0 ) 1, i = 1, . . . , n

, 0 2

The Karush-Kuhn-Tucker conditions state that 1 = (0 , ) is a

minimum of this cost function if a unique s.t.

1

1 Lp (1 , ) = 0

j 0 for j = 1, . . . , n

j (1 yj (0 + xtj )) = 0 for j = 1, . . . , n

(1 yj (0 + xtj )) 0 for j = 1, . . . , n

Active constraints and Inactive constraints:

Let A be the set of indices with j > 0 then

X

1

Lp (1 , ) = k k2 +

j (1 yj (0 + xtj )).

2

jA

j yj xj

and 0 =

jA

j yj

jA

1

/A

Lp (1 , ) = .5k k2 .

Active constraints and Inactive constraints:

Let A be the set of indices with j > 0 then

X

1

Lp (1 , ) = k k2 +

j (1 yj (0 + xtj )).

2

jA

j yj xj

and 0 =

jA

j yj

jA

1

/A

Lp (1 , ) = .5k k2 .

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

And if i

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

To summarize

As we have a convex optimization problem it has one local

minimum.

1

boundary of the margin.

xi is called a support vector.

/ A then yi (0 + xti ) > 1 and xi lies outside of the

margin.

is a linear combination of the support vectors

X

=

j yj xj

jA

;<0(0"#>(./#(&>(&>#(&%(0"#(05&("3-#$-.)>#<(0")0(

3()0(0"#<#("3-#$-.)>#<(0"#(0#$,(3/45!+/6789:(

g

To summarize

"#(:9))'.,$;#&,'.2

(!/12(

0&$<(*&>0$/7;0#(

)>#

<>

!/ 3 / + /

:9))'.,$

;#&,'.2$?!@AB

>=(%$&,(0"#(II!

0"#(<;--&$0(D#*0&$<

#0(*&;.=(7#(

#*0&$<'()>=(

;.=(7#(0"#(<),#

If i

const

<=

==

X

Thus the SVMin

fact

only depends only a sm

=

yj x j

j

jA

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

How do I calculate ?

You have seen that the optimal solution is a weighted sum of

Most common approach is to solve the Dual Lagrange

solutions to these problems are the same because of the original quadratic

cost function and linear inequality constraints.)

max

n

X

1 XX

i

i k yi yk xti xk

2

i=1

i=1

k=1

subject to i 0 i

- Forecasting PVT Properties of Crude Oil Systems Based on Support Vector Machines Modeling Scheme, 2009Uploaded byAnonymous Xy309m9Sm9
- 26 (1)Uploaded byYahiaoui Yahia
- 2Uploaded byMd Akhtar
- Efficient Additive Kernels via Explicit Feature Maps---Vedaldi10Uploaded byXianbiao Qi
- 5_petroleum_reservoir parameters prediction.pdfUploaded byeoteles
- Use of Machine Learning Techniques to Help in Predicting Fertilizer Usage in Agriculture ProductionUploaded byIRJET Journal
- High Performance clustering for Big Data Mining using HadoopUploaded byijaert
- IDS+SVMUploaded byChuột Nhắt
- Face Detection Using Neural Network & Gabor Wavelet TransformUploaded byFayçal Belkessam
- Employee Attrition Risk Assessment Using Logistic Regression AnalysisUploaded byNilotpal Paul
- Action Detection in Complex Scenes With Spatial and Temporal Ambiguities - Hu Et Al. - Proceedings of IEEE International Conference on Computer Vision - 2009Uploaded byzukun
- 10.1.1.102Uploaded bycsrinivas.se7878
- skima2014_submission_139_1.pdfUploaded byMd. Saidul Islam
- comparison of algorithms.pdfUploaded byn4d13t3scribe
- Inverse Kinematuc LeranigUploaded byJulio Vega Angeles
- A Clustering Algorithm for Classification of Network Traffic using Semi Supervised DataUploaded byIJSTE
- Video Indexing Using Shot Boundary Detection Approach and Search TracksUploaded byIAEME Publication
- v23i12Uploaded byEdgar Solorio Ornelas
- Face Detection using ANN.pdfUploaded byBudi Purnomo
- CMU-CS-04-109Uploaded bypostscript
- 2. Format. Engg - Remote Process Automation of Monitoring Using Nagios _1_ _1Uploaded byImpact Journals
- SSRN-id770805Uploaded bymishuk77
- Theoretical Bioinformatics and Machine Learning.pdfUploaded byRajesh Kumar
- Neural NetworkUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- aim10Uploaded byJorge Nereu
- Active Passive Appreciation ArticleUploaded byaabbot
- Regression through ExcelUploaded byPrince Malik
- lecture10_MMDtheoryUploaded byDom DeSicilia
- New Microsoft Office Word Document (3) SYLLABUS 8TH SEM BIOMEDICAL RGPVUploaded byS Shruti Shekhar
- wagstaff2018.pdfUploaded byCserah Marl Enano

- Stata RUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- AnkiUploaded bymissinu
- MulticollinearityUploaded bymissinu
- Lecture 1 - Overview of Supervised LearningUploaded bymissinu
- Midterm Microeconomics 1 2012-13Uploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Exit Questionnaire 2005 FinalUploaded bymissinu
- Lecture 4 - Basis Expansion and RegularizationUploaded bymissinu
- Exam4135 2004 SolutionsUploaded bymissinu
- Balabolka SampleUploaded bymissinu
- Lecture 2 - Some Course AdminUploaded bymissinu
- Bai Giang Toan C2 (2009)Uploaded bymissinu
- Cuc BVTVUploaded bymissinu
- Bai Tap Giai Tich 2 Chuong 2Uploaded bymissinu
- Giao Trinh VBA_GXDUploaded byYumi Ling
- Tom Tat Cong Thuc XSTKUploaded byTuấn Lê
- Thuchanh CH 141030Uploaded bymissinu
- Bai Giang Toan Kinh Te Quang 2012 1171Uploaded bynicksforums
- Regulations Livestock in VN SummaryUploaded bymissinu
- Manure Estimates.pdfUploaded bymissinu
- 3-Vu Trong Khai - Tich Tu Ruong DatUploaded byematn
- Introduction to Microeconomic Theory and GE Theory (2015)Uploaded bymissinu
- GRE VocabularyUploaded byKoksiong Poon
- Giai Tich 2 2014 Chuong 5Uploaded bymissinu
- Visual BasicUploaded byxuananh
- Vocabulary IELTS Speaking Theo TopicUploaded byBBBBBBB

- Assignment 2Uploaded byAgrim Singh
- 79881769 Applied Multivariate Statistical Analysis 6th EditionUploaded byEduardo E. Aguiñaga Maldonado
- Burnside 2011Uploaded byFernandaMundim
- Proc MI and Proc MianalyzeUploaded bypritishz
- SurfaceConsistentAmplitudeCorrections CurtisUploaded bydeepsareen
- Wilhelm&Schoebi 07Uploaded byAdrian Mangahas
- Characterising and Displaying Multivariate DataUploaded byShyaam Prasadh R
- Meta-Analysis of Test Accuracy Studies in Stata - V1.1 April 2016Uploaded byEdward Chavez
- Fitting State Space Model EViewsUploaded byFilipe Stona
- Multivariate Analysis of Variance (MANOVA).pdfUploaded byscjofyWFawlroa2r06YFVabfbaj
- portfolio TheoryUploaded byzhongyi87
- Canada Yield CurveUploaded bymk3chan
- A PCA Based Feature Extraction Approach for the Qualitative Assessment of Human SpermatozoaUploaded byijcsis
- 3208 Probabilistic Matrix FactorizationUploaded byAhmed1972
- Q FunctionUploaded byĐình Tuấn
- Statistical Geometric Computation on Tolerances for DimensioningUploaded bySudeep Kumar Singh
- portfolioTheoryMatrix.pdfUploaded byAshraful Ferdous
- Chapter 6 Correlations and CopulasUploaded byLuis Aragonés Ferri
- Lesage Spatial IntroUploaded byrenata_fauzia
- 2 Random VectorsUploaded byslowjams
- Probabilistic Latent Factor Induction and Statistical Factor AnalysisUploaded byStefan Conrady
- TAP Mimo ReviewUploaded byMarwan Hammouda
- Probability Random Variables and Random Processes Part 1Uploaded bytechnocrunch
- Blind Channel EstimatorUploaded byManikandan Arunachalam
- Ordination MethodsUploaded byuvozone
- Review of Probability and Statistics1Uploaded byAbdkabeer Akande
- vc101Uploaded byhamed2001ym
- calm.pdfUploaded byAtiqur Rahaman
- Patent CN102168980AUploaded byVictor Von Doom
- lec2Uploaded byVanitha Shah