You are on page 1of 24

Lecture 5: Multi-class Classification and Regression

C19 Machine Learning Hilary 2013 A. Zisserman

Multi-class Classification
Using binary classifiers Random Forests

Regression
Ridge regression Basis functions

Multi-class Classification

Multi-Class Classification what we would like


Assign input vector x to one of K classes Ck Goal: a decision rule that divides input space into K decision regions separated by decision boundaries

Reminder: K Nearest Neighbour (K-NN) Classifier


Algorithm
For each test point, x, to be classified, find the K nearest samples in the training data Classify the point, x, according to the majority vote of their class labels

e.g. K = 3

naturally applicable to multi-class case

Build from binary classifiers I


Learn: Two-class classifiers for all pairs, K ( K 1 ) /2 of these Classification: choose class with majority vote

C1 C3

R1 ? C1

R3

C3 C2 R2 C2

Build from binary classifiers II


Learn: K two-class 1 vs the rest classifiers fk (x)

?
C1

1 vs 2 & 3

?
C2 C3

?
3 vs 1 & 2 2 vs 1 & 3

Build from binary classifiers II continued


Learn: K two-class 1 vs the rest classifiers fk (x) Classification: choose class with most positive score

1 vs 2 & 3

C1

max fk (x)
k
C2 C3
2 vs 1 & 3

3 vs 1 & 2

Build from binary classifiers II continued


Learn: K two-class 1 vs the rest classifiers fk (x) Classification: choose class with most positive score

C1

max fk (x)
k
C2 C3

Application: hand written digit recognition

Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

Training: learn k=10 two-class 1 vs the rest SVM classifiers fk (x) Classification: choose class with most positive score

f (x) = max fk (x)


k

Example

hand drawn

5 5 4 5

classification

3 5 2 5

2 3 4 5

5 0

Why not learn a multi-class SVM directly?


For example for three classes Learn w = (w1, w2, w3) > using the cost function min ||w||2 subject to
w

w1>xi w2>xi & w1>xi w3>xi w2>xi w3>xi & w2>xi w1>xi w3>xi w1>xi & w3>xi w2>xi

for i class 1 for i class 2 for i class 3

This is a quadratic optimization problem subject to linear constraints and there is a unique minimum Note, a margin can also be included in the constraints
In practice there is a little or no improvement over the binary case

Random Forests

Random Forest Overview


A natural multiple class classifier Fast at test time Start with single tree, and then move onto forest

Generic trees and decision trees


x2

A general tree structure


internal (split)node

rootnode

2 1

x1 6

A decision tree
fa l se
tru

x2 > 1
e

10 11

12

13 14

terminal(leaf)node
Classifyas greenclass
fals e

x1 > 1
e tru

Classifyas blueclass

Classifyas redclass

Testing

choose class with max posterior at leaf node

Choosing the Node test Information gain


Beforesplit
Information gain Shannons entropy Node training

x2

Split2

Split1

Trees vs Forests

lack of margin

x1 1

A single tree may over fit to the training data Instead train multiple trees and combine their predictions Each tree can differ in both training data and node tests Achieve this by injecting randomness into training algorithm

Forests and trees

Aforestisanensembleoftrees.Thetreesareallslightlydifferentfromoneanother.

Classification forest - injecting randomness I


(1) Bagging (randomizing the training set)
The full training set The randomly sampled subset of training data made available for the tree t Foresttraining

Classification forest - injecting randomness II


(2) Node training - random subsets of node tests available at each node
Nodeweaklearner

Nodetestparams

Foresttraining

T1 T2 T3

From a single tree classification .


What do we do at the leaf?

leaf

leaf

Prediction model: probabilistic

leaf

How to combine the predictions of the trees?


Tree t=1 t=2 t=3

The ensemble model


Forest output probability

Random Forests - summary


Node tests are usually chosen to be cheap to evaluate (weak classifiers) e.g. comparing pairs of feature components At run time, classification is very fast (like AdaBoost) Classifier is non-linear in feature space Training can be quite fast as well if node test chosen completely randomly Many parameters: depth of tree, number of tree, type of node tests, random sampling Requires a lot of training data Large memory footprint (cf. k-NN)

Application:
Body tracking in Microsoft Kinect for XBox 360

Kinect random forest classifier


Input data for each frame Output

RGB image

Multi-class classification Depth image inferred body parts

Goal: train a random forest classifier to predict body parts

fit stickman model and track skeleton

Training/test Data
From motion capture system

e.g. 1 million training examples

Node tests

Input depth image

Input data point Output Feature response

(pixel position in 2D image)

(depth difference between two points)

Example result

Input depth image (bg removed)

Inferred body parts (posterior)

body parts in 3D

Regression
y

Suppose we are given a training set of N observations ((x1, y1), . . . , (xN , yN )) with xi Rd, yi R The regression problem is to estimate f (x) from this data such that yi = f (xi)

Learning by optimization
As

in the case of classification, learning a regressor can be formulated as an optimization:

Minimize with respect to f F


i=1
loss function regularization

N X

l (f (xi), yi) + R (f )

There is a choice of both loss functions and regularization


e.g. squared loss, SVM hinge-like loss squared regularizer, lasso regularizer

Primal and dual form Dual form can be kernelized

Choice of regression function non-linear basis functions


Function for regression y (x, w) is a non-linear function of x, but linear in w: f (x, w) = w0 + w1 1 (x) + w2 2 (x) + . . . + wM M (x) = w> (x) For example, for x R, polynomial regression with j (x) = xj : f (x, w) = w0 + w1 1 (x) + w2 2 (x) + . . . + wM M (x) = 1 e.g. for M = 3, x > f (x, w) = (w0 , w1 , w2 , w3 ) x2 = w (x) 3 x 1 4
M X j =0

wj xj

: x (x)

R R

or the basis functions can be Gaussians centred on the training data: j (x) = exp (x xj )2 /2 2 e.g. for 3 points, e (xx2 )2 /22 f (x, w) = (w1 , w2 , w3 ) e = w> (x) 2 2 e(xx3 ) /2
(xx1 )2 /22

: x (x)

R1 R3

Least squares ridge regression


Cost function squared loss:
target value yi

loss function

regularization

xi

Regression function for x (1D):


f (x, w) = w0 + w1 1 (x) + w2 2 (x) + . . . + wM M (x) = w> (x)
NB squared loss arises in Maximum Likelihood estimation for an error model

yi = y i + ni
measured value

ni N (0, 2)

true value

Solving for the weights w


Notation: write the target and regressed values as N -vectors
=

y1 y2 . . yN

(x1)>w (x2)>w . . (xN )>w

1 1(x1) . . . M (x1) 1 1(x2) . . . M (x2) . . . . 1 1(xN ) . . . M (xN )

w0 w1 . . wM

is an N M design matrix
1 = . .

e.g. for polynomial regression with basis functions up to x2 1 x1 x2


2 x1 w x2 2 0 . w1 w2 . x2 N

1 xN

N X 1 e (w ) = E {f (xi, w) yi}2 + kwk2 2 i=1 2

N 2 1 X > = yi w (xi) + kwk2 2 i=1 2 1 = (y w)2 + kwk2 2 2

Now, compute where derivative w.r.t. w is zero for minimum


e (w) E = > (y w) + w = 0 dw

Hence

> + I

1 > = + I > y

w = > y

M basis functions, N data points

w = > + I
=
Mx1

> y
assume N > M

MxM

MxN

Nx1

This shows that there is a unique solution. If = 0 (no regularization), then

w = ( > ) 1 > y = + y
where + is the pseudo-inverse of (pinv in Matlab) Adding the term I improves the conditioning of the inverse, since if is not full rank, then (> + I) will be (for suciently large )
> As , w 1 y 0

Often the regularization is applied only to the inhomogeneous part of w, ) i.e. to w , where w = (w0, w

1 > = + I > y 1 > + I >y

f (x, w) = w>(x) = (x)>w = (x)> = b(x)>y

Output is a linear blend, b(x), of the training values {yi}

Example 1: polynomial basis functions


The red curve is the true function (which is not a polynomial) The data points are samples from the curve with added noise in y. There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization
ideal fit 1.5 Sample points Ideal fit 1

0.5

-0.5

-1

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

f (x, w) =

M X j =0

wj xj = w> (x)

: x (x)

R RM +1

w is a M+1 dimensional vector

N = 9 samples, M = 7
1.5 Sample points Ideal fit lambda = 100
1.5 Sample points Ideal fit lambda = 0.001

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1.5 Sample points Ideal fit lambda = 1e-010

1.5 Sample points Ideal fit lambda = 1e-015

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

M=3
least-squares fit 1.5 Sample points Ideal fit Least-squares solution
1.5

M=5
least-squares fit Sample points Ideal fit Least-squares solution

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5
15 10 5 0 -5 -10

0.1

0.2

Polynomial basis functions

0.1

0.2

0.3

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1
400 300 200 100 0 -100

0.4 0.5 0.6 0.7 Polynomial basis functions x

0.8

0.9

-15

y
-20 -25 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

-200 -300 -400 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

Example 2: Gaussian basis functions


The red curve is the true function (which is not a polynomial) The data points are samples from the curve with added noise in y. Basis functions are centred on the training data (N points) There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization
y ideal fit 1.5 Sample points Ideal fit 1

0.5

-0.5

-1

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

f (x, w) =

N X i=1

wi e(xxi )

/2

= w> (x)

: x (x)

R RN

w is a N-vector

N = 9 samples, sigma = 0.334


1.5 Sample points Ideal fit lambda = 100

1.5 Sample points Ideal fit lambda = 0.001

0.5

0.5

-0.5

y
0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

-0.5

-1

-1

-1.5

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1.5 Sample points Ideal fit lambda = 1e-010

1.5 Sample points Ideal fit lambda = 1e-015

0.5

0.5

-0.5

y
0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

-0.5

-1

-1

-1.5

-1.5

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Choosing lambda using a validation set


6 Ideal fit Validation Training Min error
1.5 Sample points Ideal fit Validation set fit

4 error norm

0.5

y
-10 -5 0

-0.5

-1

-1.5

10

10 log

10

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Sigma = 0.334
1.5 Sample points Ideal fit Validation set fit 1

Sigma = 0.1
1.5 Sample points Ideal fit Validation set fit 1

0.5

0.5

-0.5

-0.5

-1

-1

-1.5

-1.5
0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Gaussian basis functions


Gaussian basis functions

2000 1500 1000 500 0 -500 -1000 -1500 -2000 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 y

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 y

Background reading and more


Other multiple-class classifiers (not covered here):
Neural networks

Bishop, chapters 3, 4.1 4.3 and 14.3 Hastie et al, chapters 10.1 10.6 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like