You are on page 1of 29

Linear Methods for Classification

as. Prof. Dr. abdulhamit subasi

5/16/12 By Hakan

Machine balk stilini dzenlemek Asl alt Learning Presentation

Introduction

Basic setup of a classification problem. Understanding the Bayes classification rule. Understanding the classification approach by linear regression of indicator matrix. Understanding the phenomenon of masking.

5/16/12

Setup for supervised Learning


Training data: {(x1,g1), (x2,g2), ..., (xN,gN)}. The feature vectorX= (X1,X2, ... ,Xp), where each variableXjis quantitative. The response variable G is categorical. G G = {1, 2, ... ,K} Form a predictorG(x) to predictGbased onX.
5/16/12

Setup for Supervised Learning

G(x) divides the input space (feature vector space) into a collection of regions, each labeled by one class. See the

5/16/12

Linear Methods

Decision boundariesare linear: linear methods for classification. Two class problem: The decision boundary between the two classes is a hyperplane in the feature vector space. A hyperplane in thepdimensional input space is the set:
5/16/12

Linear Methods

The two regions separated by a hyperplane:

More than two classes: The decision boundary between any pair of classkandlis a hyperplane How do you choose the hyperplane? 5/16/12

Linear Methods

Example methods for deciding the hyperplane:


Linear regression of an indicator matrix. Linear discriminant analysis. Logistic regression. Rosenblatts perceptron Learning algorithm

5/16/12 Note: Linear decision boundaries are not necessarily

The Bayes Classification Rule

5/16/12

5/16/12

5/16/12

Linear Regression of an Indicator Matrix


g
1 3 2 4

Y1Y2Y3Y4
1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1

5/16/12

Linear Regression Fit to the Class Indicator Variables

Verification of
We want to prove which is equivalent to prove

(Eq. 1) Notice
(Eq. 5/16/12 2)

Linear Regression Fit to the Class Indicator Variables


And the augmented X has

5/16/12 From Eq. 2: we can see that

Linear Regression Fit to the Class Indicator Variables Eq. 1 becomes:

True for any x.


5/16/12

Linear discriminant analysis


density of X in class prior G=k probability fk(x) Gaussian and the class have a common covariance matrix log-ratio : is linear in x decision boundaries are linear discriminant function : classification :
5/16/12

Remarks

with 2 classes, linear discriminant analysis classification with linear least square with more than 2 classes : avoid masking problems if not common covariance matrix, quadratic discriminant analysis
5/16/12

Regularized discriminant analysis (RDA)

Compromise between linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) regularized covariance matrix :
covariance matrix used in LDA determined by crossvalidation
5/16/12

Computations

Simplified by diagonalisation of covariance matrices

(eigen-decomposition) Algorithm :

Sphere the data X (using Eigendecomposition of


5/16/12 common

the covariance matrix)

Reduced-rank linear discriminant analysis

Fisher : Find the linear combination Z=aTX such that the between-class variance is maximized relative to the withinclass variance.

maximizing the Rayleigh quotient B where : : Between-class covariance W : within-class covariance


5/16/12

Logistic regression

model specified by K-1 log-odds or logit transformations :

5/16/12

Fitting logistic regression model

usually, by maximum likelihood (Newton-Raphson algorithm to solve the score equations) example : K =2 (2 groups), write log-likelihood
5/16/12

encode

Example : South african heart disease


correlation between the set of predictors surprising results : some variables not included in the logistic model
5/16/12

Quadratic approximations and inference


weigth s

5/16/12

Differences between LDA and logistic regression

same form BUT differences in the way the coefficients are estimated : logistic regression : more general, less assumptions (arbitrary density function for X), more robust BUT very similar results in practice

5/16/12

Separating hyperplanes

perceptron = classifiers such as : hyperplane or affine set L : defined by the equation


Properties

vector normal to the surface L for any point x0 in L, the signed distance of any point x to L is given by :
5/16/12

Rosenblatts perceptron learning algorithm

try to separate hyperplanes by minimizing the distance of missclassified points to the decisison if is misclassified, then boundary if is misclassified, then M is the index set of missclassified The algorithm uses stochastic gradient descent to minimize points.
this piecewise linear criterion.
5/16/12

minim ize

Optimal separating hyperplanes

find hyperplane that minimizes some measure of overlap in the training data. advantages over Rosenblatts algorithm :

unique solution better classification performance on test data

least square 2 solutions by perceptron algorithm with different 5/16/12

Resources

Celine BUGLI `The elements of Statistical Learning`

5/16/12

Thank you

5/16/12

You might also like