You are on page 1of 279

Model Building Training

Max Kuhn
Kjell Johnson
Global Nonclinical Statistics

Overview
Typical data scenarios
Examples well be using

General approaches to model building


Data pre-processing
Regression-type models
Classification-type models
Other considerations

Typical Data

Response may be continuous or categorical


Predictors may be
continuous, count, and/or binary
dense or sparse
3

observed and/or calculated

Predictive Models
What is a predictive model?
A model whose primary purpose is for prediction
(as opposed to inference)

We would like to know why the model works, as


well as the relationship between predictors and
the outcome, but these are secondary
Examples: blood-glucose monitoring, spam
detection, computational chemistry, etc.
4

What Are They Not Good For?


They are not a substitute for subject specific
knowledge
Science: Hard

(yikes)

Models: Easy

(lets do these instead!)

To make a good model that predicts well on


future samples, you need to know a lot about
Your predictors and how they relate to each other
The mechanism that generated the data (sampling,
technology etc)
5

What Are They Not Good For?


An example:
An oncologist collects some data from a small clinical
trial and wants a model that would use gene expression
data to predict therapeutic response (beneficial or not)
in 4 types of cancer
There were about 54K predictors and data was
collected on ~20 subjects

If there is a lot of knowledge of how the therapy


works (pathways etc), some effort must be put into
using that information to help build the model
6

The Big Picture

In the end, [predictive modeling] is not a


substitute for intuition, but a
compliment
Ian Ayres, in Supercrunchers

References
Statistical Modeling: The Two Cultures by Leo
Breiman (Statistical Science, Vol 16, #3 (2001),
199-231)
The Elements of Statistical Learning by Hastie,
Tibshirani and Friedman
Regression Modeling Strategies by Harrell
Supercrunchers by Ayres
8

Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Support vector machines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests
9

Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis

Modern classification methods


Classification trees
Ensembles of trees
Boosting and random forests

Neural networks
Support vector machines
k-nearest neighbors
Naive Bayes
10

Interesting Models We Dont Have Time For


L1 Penalty methods
The lasso, the elasticnet, nearest shrunken centroids

Other Boosted Models


linear models, generalized additive models, etc

Other Models:
Conditional inference trees, C4.5, C5, Cubist, other tree models
Learned vector quantization
Self-organizing maps
Active learning techniques
11

Example Data Sets

12

Boston Housing Data


This is a classic benchmark data set for regression. It
includes housing data for 506 census tracts of Boston
from the 1970 census.

crim: per capita crime rate

Indus: proportion of non-retail


business acres per town

dis: weighted distances to five


Boston employment centers

rad: index of accessibility to


radial highways

tax: full-value property-tax rate

ptratio: pupil-teacher ratio by


town

b: proportion of minorities

Medv: median value homes


(outcome)

nox: nitric oxides concentration

rm: average number of rooms


per dwelling

13

chas: Charles River dummy


variable (= 1 if tract bounds
river; 0 otherwise)

Age: proportion of owneroccupied units built prior to


1940

Toy Classification Example

A simulated data set will be


used to demonstrate
classification models
two predictors with a correlation
coefficient of 0.5 were simulated
two classes were simulated
(active and inactive)

A probability model was used to


assign a probability of being
active to each sample
the 25%, 50% and 75%
probability lines are shown on
the right

14

Toy Classification Example


The classes were randomly
assigned based on the probability
The training data had 250
compounds (plot on right)
the test set also contained 250
compounds

With two predictors, the class


boundaries can be shown for
each model
this can be a significant aid in
understanding how the models
work
but we acknowledge how
unrealistic this situation is

15

Model Building Training

General Strategies

16

Objective

To construct a model of predictors that


can be used to predict a response
Data
Model
Prediction
17

Model Building Steps


Common steps during model building are:
estimating model parameters (i.e. training models)
determining the values of tuning parameters that
cannot be directly calculated from the data
calculating the performance of the final model that will
generalize to new data

The modeler has a finite amount of data, which


they must "spend" to accomplish these steps
How do we spend the data to find an optimal model?

18

Spending Data
We typically spend data on training and test data sets
Training Set: these data are used to estimate model parameters
and to pick the values of the complexity parameter(s) for the
model.
Test Set (aka validation set): these data can be used to get an
independent assessment of model efficacy. They should not be
used during model training.

The more data we spend, the better estimates well get


(provided the data is accurate). Given a fixed amount of
data,
too much spent in training wont allow us to get a good
assessment of predictive performance. We may find a model that
fits the training data very well, but is not generalizable (overfitting)
too much spent in testing wont allow us to get a good
assessment of model parameters
19

Methods for Creating a Test Set


How should we split the data into a training and
test set?
Often, there will be a scientific rational for the split
and in other cases, the splits can be made
empirically.
Several empirical splitting options:
completely random
stratified random
maximum dissimilarity in predictor space
20

Creating a Test Set: Completely Random Splits

A completely random (CR) split randomly partitions the


data into a training and test set
For large data sets, a CR split has very low bias towards
any characteristic (predictor or response)
For classification problems, a CR split is appropriate for
data that is balanced in the response
However, a CR split is not appropriate for unbalanced
data
A CR split may select too few observations (and perhaps none) of
the less frequent class into one of the splits.

21

Creating a Test Set: Stratified Random Splits


A stratified random split makes a random split
within stratification groups
in classification, the classes are used as strata
in regression, groups based on the quantiles of the
response are used as strata

Stratification attempts to preserve the distribution


of the outcome between the training and test
sets
A SR split is more appropriate for unbalanced data
22

Over-Fitting
Over-fitting occurs when a model has extremely good
prediction for the training data but predicts poorly when
the data are slightly perturbed
new data (i.e. test data) are used

Complex regression and classification models assume


that there are patterns in the data.
Without some control many models can find very intricate
relationships between the predictor and the response
These patterns may not be valid for the entire population.

23

Over-Fitting Example
The plots below show classification boundaries
for two models built on the same data

Predictor B

Predictor B

one of them is over-fit

Predictor A

24

Predictor A

Over-Fitting in Regression
Historically, we evaluate the quality of a
regression model by its mean squared error.
Suppose that are prediction function is
parameterized by some vector

25

Over-Fitting in Regression
MSE can be decomposed into three terms:
irreducible noise
squared bias of the estimator from its expected value
the variance of the estimator

The bias and variance are inversely related


as one increases, the other decreases
different rates of change

26

Over-Fitting in Regression
When the model under-fits,
the bias is generally high and
the variance is low
Over-fitting is typically
characterized by high
variance, low bias estimators
In many cases, small
increases in bias result in
large decreases in variance
27

Over-Fitting in Regression
Generally, controlling the MSE yields a good
trade-off between over- and under-fitting
a similar statement can be made about classification
models, although the metrics are different (i.e. not
MSE)

How can we accurately estimate the MSE from


the training data?
the nave MSE from the training data can be a very
poor estimate

Resampling can help estimate these metrics


28

How Do We Estimate Over-Fitting?


Some models have specific knobs to control
over-fitting
neighborhood size in nearest neighbor models is an
example
the number if splits in a tree model

Often, poor choices for these parameters can


result in over-fitting
Resampling the training compounds allows us
to know when we are making poor choices for the
values of these parameters
29

How Do We Estimate Over-Fitting?


Resampling only affects the training data
the test set is not used in this procedure

Resampling methods try to embed variation in


the data to approximate the models performance
on future compounds
Common resampling methods:
K-fold cross validation
Leave group out cross validation
Bootstrapping
30

K-fold Cross Validation


Here, we randomly split the data into K blocks of
roughly equal size
We leave out the first block of data and fit a
model.
This model is used to predict the held-out block
We continue this process until weve predicted all
K hold-out blocks
The final performance is based on the hold-out
predictions
31

K-fold Cross Validation


The schematic below shows the process for K = 3
groups.
K is usually taken to be 5 or 10
leave one out cross-validation has each sample as a
block

32

Leave Group Out Cross Validation


A random proportion
of data (say 80%) are
used to train a model
The remainder is
used to predict
performance
This process is
repeated many times
and the average
performance is used
33

Bootstrapping
Bootstrapping takes a random sample with
replacement
the random sample is the same size as the original data
set
compounds may be selected more than once
each compound has a 63.2% change of showing up at
least once

Some samples wont be selected


these samples will be used to predict performance

The process is repeated multiple times (say 30)


34

The Bootstrap
With bootstrapping,
the number of heldout samples is
random
Some models, such
as random forest, use
bootstrapping within
the modeling process
to reduce over-fitting

35

Training Models with Tuning Parameters


A single training/test split is
often not enough for models
with tuning parameters
We must use resampling
techniques to get good
estimates of model
performance over multiple
values of these parameters
We pick the complexity
parameter(s) with the best
performance and re-fit the
model using all of the data
36

Simulated Data Example


Lets fit a nearest neighbors model to the
simulated classification data.
The optimal number of neighbors must be chosen
If we use leave group out cross-validation and set
aside 20%, we will fit models to a random 200
samples and predict 50 samples
30 iterations were used

Well train over 11 odd values for the number of


neighbors
we also have a 250 point test set
37

Toy Data Example


The plot on the right shows the
classification accuracy for each
value of the tuning parameter
The grey points are the 30
resampled estimates
The black line shows the average
accuracy
The blue line is the 250 sample
test set

It looks like 7 or more


neighbors is optimal with an
estimated accuracy of 86%

38

Toy Data Example


What if we didnt resample
and used the whole data
set?
The plot on the right
shows the accuracy
across the tuning
parameters
This would pick a model
that over-fits and has
optimistic performance
39

Model Building Training

Data Pre-Processing

40

Why Pre-Process?
In order to get effective and stable results, many
models require certain assumptions about the
data
this is model dependent

We will list each models pre-processing


requirements at the end
In general, pre-processing rarely hurts model
performance, but could make model
interpretation more difficult

41

Common Pre-Processing Steps


For most models, we apply three pre-processing
procedures:
Removal of predictors with variance close to zero
Elimination of highly correlated predictors
Centering and scaling of each predictor

42

Zero Variance Predictors


Most models require that each predictor have at
least two unique values
Why?
A predictor with only one unique value has a variance
of zero and contains no information about the
response.

It is generally a good idea to remove them.

43

Near Zero Variance Predictors


Additionally, if the distributions of the predictors
are very sparse,
this can have a drastic effect on the stability of the
model solution
zero variance descriptors could be induced during
resampling

But what does a near zero variance predictor


look like?

44

Near Zero Variance Predictor


There are two conditions for an NZV predictor
a low number of possible values, and
a high imbalance in the frequency of the values

For example, a low number of possible values


could occur by using fingerprints as predictors
only two possible values can occur (0 or 1)

But what if there are 999 zero values in the data


and a single value of 1?
this is a highly unbalanced case and could be trouble
45

NZV Example
In computational chemistry we
created predictors based on
structural characteristics of
compounds.
As an example, the descriptor
nR11 is the number of 11member rings
The table to the right is the
distribution of nR11 from a
training set
the distinct value percentage is
5/535 = 0.0093
the frequency ratio is 501/23 = 21.8
46

# 11-Member Rings
Value

Frequency

501

23

Detecting NZVs
Two criteria for detecting NZVs are the
Discrete value percentage
Defined as the number of unique values divided by the number of
observations
Rule-of-thumb: discrete value percentage < 20% could indicate a
problem

Frequency ratio
Defined as the frequency of the most common value divided by the
frequency of the second most common value
Rule-of-thumb: > 19 could indicate a problem

If both criteria are violated, then eliminate the predictor


47

Highly Correlated Predictors

Some models can be negatively affected by


highly correlated predictors
certain calculations (e.g. matrix inversion) can become
severely unstable

How can we detect these predictors?


Variance inflation factor (VIF) in linear regression

or, alternatively
1. Compute the correlation matrix of the predictors
2. Predictors with (absolute) pair-wise correlations above
a threshold can be flagged for removal
3. Rule-of-thumb threshold: 0.85
48

Highly Correlated Predictors and


Resampling
Recall that resampling slightly perturbs the
training data set to increase variation
If a model is adversely affected by high
correlations between predictors, the resampling
performance estimates can be poor in
comparison to the test set
In this case, resampling does a better job at predicting
how the model works on future samples

49

Centering and Scaling


Standardizing the predictors can greatly improve
the stability of model calculations.
More importantly, there are several models (e.g.
partial least squares) that implicitly assume that
all of the predictors are on the same scale
Apart from the loss of the original units, there is
no real downside of centering and scaling
50

Model Building Training


Regression-type Models

51

Setting

Response is continuous

52

Objective

To construct a model of predictors that


can be used to predict a response
Data
Model
Prediction
53

Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Support vector machines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests

Each of these methods seek to find a relationship


between the predictors and response that minimizes
error between the observed and predicted response
54

Additive Models
In the beginning there were linear models:

E Y 0 1 X 1 p X p

And Nelder and Wedderburn (1972) said, Let there be


Generalized Linear Models:

g E Y 0 1 X 1 p X p
and link functions appeared.

And Hastie and Tibshirani (1990) said, Let there be


Generalized Additive Models:

E Y f 0 f1 X 1 f p X p

55

and scatterplot smoothers and backfitting


algorithms appeared.

Families of Additive Models

Recursive
Partitioning
(Trees)

Flexibility

GLM

PLS

Bagging
Boosting

Multivariate
Adaptive
Regression
Splines*

Random
Forests

GAM

56

* Additivity depends on model parameters

Neural
Nets
Support
Vector
Machines*

Assessing Model Performance

57

Assessing Model Performance


How well does a regression model perform? Answering this
question depends on how we want to use the model.
Possible goals are:
To understand the relationship between the predictor and the
response.
To use the model to predict future observations response.

In either case, we can use several of different measures to


evaluate model performance. We will focus on two:
Coefficient of determination (R2)
Root mean square error (RMSE)

However, the set of data that we use to evaluate


performance will change depending on our purpose.
58

Which Set of Data to Use to Evaluate Performance?


If we are only interested in understanding the underlying
relationship between the predictor and the response, then
we can compute R2 and RMSE on the data for which the
model was built (i.e the training data).
However, these values will be overly optimistic of the models
ability to predict future observations.

If we are interested in understanding the models ability to


predict future observations, then we need to compute R2
and RMSE on data for which the model was not built (i.e.
a test set or cross-validation set).
For a held-out set of data, R2 is commonly referred to as Q2 and
RMSE is commonly referred to as root mean squared prediction
error (RMSPE)

59

Root Mean Squared Error (RMSE) and


Root Mean Squared Prediction Error (RMSPE)
RMSE measures the average deviation of an observation
to the best-fit plane

SSE
RMSE
n p 1
RMSPE measures the average deviation of an
observation to its predicted value for the test or crossvalidation set
n*

RMSPE

y
i i
i 1

n* = the number of observations in the test or cross-validation set


60

Computing Q2
Process:
Partition the data into
a training and testing set, or
blocks to be used for training and testing

Build the model on the training data and predict the


testing data

Q2 = R2 of the relationship between the observed


and predicted values for the testing data.
61

Multiple Linear Regression:


A Quick Review

62

Multiple Linear Regression

Objective: Find the plane through the data that minimizes the
sum-of-squares error.
63

The Best Plane


To find the best plane, we solve:

min Y X

where Ynx1, Xnx(p+1) and (p+1)x1

The best is:



p
64

X X X Y

Aside: A Bit More About (XTX)


(XTX) is a critical matrix for many statistical
modeling techniques
A few fun facts
(XTX) is proportional to the covariance matrix, S
S contains the variances and covariances of all
predictors
Techniques that depend on (XTX) also require that it is
invertible

65

Assumptions: Diagnostic Plots

66

When Does Regression Fail?


When a plane does not capture the structure in the data
When the variance/covariance matrix is overdetermined
Recall, the plane that minimizes SSE is:

X T X

XTY

To find the best plane, we must compute the inverse of the


variance/covariance matrix
The variance/covariance matrix is not always invertible. Two
common conditions that cause it to be uninvertible are:
Two or more of the predictors are correlated (multicollinearity)
There are more predictors than observations
67

A (Trivial) Example of Multicollinearity


Suppose that we have one observation (3,5), and we wish to find the best line for the
data. In this example, the number of observations (1) is less than the number of
parameters (2: slope and intercept). When the number of parameters is greater than
the number of observations, we can find an infinite number of best solutions.
Solution 1
Solution 2

Solution 3
In the presence of multicollinearity, the best
solution will be unstable.

68

Boston Housing Data


Lets use a linear regression model to predict the median
house price in Boston.
Process:
Split the data into a training set (n = 337) and testing set (n = 169)
For the training set, use the bootstrap to determine the RMSPE
and Q2
For the test data determine RMSPE and Q2

If the underlying model is stable, the values of RMSPE


and Q2 should be similar between the bootstrap and
testing data

69

Results
Training Data
(bootstrap)
Linear Reg

Test Data

RMSE

Q2

RMSE

R2

5.23

0.691

4.53

0.742

The results are fairly similar, at least within the variation of


resampling
One reason you may see differences: multicollinearity
Multicollinearity in the predictors can produce somewhat unstable
solutions for each resample
When the data are slightly changed, the model can drastically
change

The test set is a single, static set of data for verification


The bootstrap estimate of performance may be better with
collinearity
70

Partial Least Squares Regression

71

Solutions for Overdetermined Covariance Matrices


Variable reduction
Try to accomplish this through the pre-processing
steps

Partial least squares (PLS)


Other methods
Apply a generalized inverse
Ridge regression: Adjusts the variance/covariance
matrix so that we can find a unique inverse.
Principal component regression (PCR)
not recommendedbut its a good way to understand PLS
72

Understanding Partial Least Squares:


Principal Components Analysis

PCA seeks to find linear combinations of the


original variables that summarize the maximum
amount of variability in the original data
These linear combinations are often called principal
components or scores.
A principal direction is a vector that points in the
direction of maximum variance.

73

Principal Components Analysis


PCA is inherently an optimization problem, which
is subject to two constraints
1. The principal directions have unit length
2. Either
a.Successively derived scores are uncorrelated to previously
derived scores, OR
b.Successively derived directions are required to be orthogonal
to previously derived directions
In the mathematical formulation, either constraint implies the
other constraint

74

Principal Components Analysis


5

Direction 1

Score

Predictor 2

0
-6

-5

-4

-3

-2

-1

0
-1

-2

-3

-4

Predictor 1
75

http://pfizerpedia/index.php/Image:PCAmovie.gif

Mathematically Speaking
The optimization problem defined by PCA can be solved
through the following formulation:

aTX

arg max Var


a

a a

subject to constraints 2a. or b.


Facts
the ith principal direction, ai, is the eigenvector corresponding to
the ith largest eigenvalue of XTX.
the ith largest eigenvalue is the amount of variability summarized
byT the ith principal component.
ai X

are the ith scores


76

PCA Benefits and Drawbacks


Benefits
Dimension reduction
We can often summarize a large percentage of original variability
with only a few directions

Uncorrelated scores
The new scores are not linearly related to each other

Drawbacks
PCA chases variability
PCA directions will be drawn to predictors with the most variability
Outliers may have significant influence on the directions and
resulting scores.

77

Principal Component Regression

Procedure:
1. Reduce dimension of predictors using PCA
2. Regress scores on response
Notice: The procedure is sequential

78

Principal Component Regression


Dimension reduction is
independent of the objective

Predictor
Variables
PCA

PC Scores
MLR
Response
Variable
79

First Principal Direction

80

Relationship of First Direction with Response


Scatter of First PCA Scores with Response
2.50

2.00

1.50

Response

1.00

0.50

0.00

-0.50

-1.00

-1.50

R2 = 0.001

-2.00
-6.00

81

-4.00

-2.00

0.00

2.00

First PCA Scores

4.00

6.00

8.00

PLS History

H. Wold (1966, 1975)


S. Wold and H. Martens (1983)
Stone and Brooks (1990)
Frank and Friedman (1991, 1993)
Hinkle and Rayens (1994)

82

Latent Variable Model


Predictor1
Predictor2

1
Response1

Predictor3
Predictor4

Response2
Response3

Predictor5
Predictor6

Predictors

Latent Variables

Responses

Note: PLS can handle multiple response variables


83

Comparison with Regression

Predictor1
Predictor2
Predictor3
Predictor4
Predictor5

84

Response1

PLS Optimization
(many predictors, one response)

PLS seeks to find linear combinations of the


independent variables that summarize the
maximum amount of co-variability with the
response.
These linear combinations are often called PLS
components or PLS scores.
A PLS direction is a vector that points in the direction
of maximum co-variance.

85

PLS Optimization
(many predictors, one response)

PLS is inherently an optimization problem, which


is subject to two constraints
1. The PLS directions have unit length
2. Either
a.Successively derived scores are uncorrelated to previously
derived scores, OR
b.Successively derived directions are orthogonal to previously
derived directions
Unlike PCA, either constraint does NOT imply the other
constraint
Constraint 2.a. is most commonly implemented
86

Mathematically Speaking
The optimization problem defined by PLS can be solved
through the following formulation:
2

Cov a X, Y
arg max
,
T
a a
a
subject to constraints 2a. or b.

87

Facts
the ith PLS direction, ai, is the eigenvector
corresponding to the ith largest eigenvalue of ZTZ,
where Z = XTy.
the ith largest eigenvalue is the amount of co-variability
summarized
by the ith PLS component.
T
ai X

are the ith scores

PLS is Simultaneous Dimension Reduction and


Regression

Cov a X, Y
arg max
a Ta
a
var a T X var Y corr 2 a T X, Y
arg max
T
a a
a
var a T X corr 2 a T X, Y
var Y arg max
a Ta
a
2
var scores corr scores, response
var response arg max
a Ta
a
2

88

PLS is Simultaneous Dimension Reduction


and Regression
max Var(scores) Corr2(response,scores)

Dimension Reduction
(PCA)

89

Regression

PLS Benefits and Drawbacks


Benefit
Simultaneous dimension reduction and regression

Drawbacks
Similar to PCA, PLS chases co-variability
PLS directions will be drawn to independent variables with the most
variability (although this will be tempered by the need to also be
related to the response)
Outliers may have significant influence on the directions, resulting
scores, and relationship with the response. Specifically, outliers can
make it appear that there is no relationship between the
predictors and response when there truly is a relationship, or
make it appear that there is a relationship between the
predictors and response when there truly is no relationship
90

Partial Least Squares

Simultaneous dimension
reduction and regression
Predictor
Variables

PLS
Response
Variable
91

First PLS Direction

92

Relationship of First Direction with Response


Scatter of First PLS Scores with Response
2.50

2.00

1.50

Response

1.00

0.50

0.00

-0.50

-1.00

-1.50

R2 = 0.93

-2.00
-2.00

93

-1.50

-1.00

-0.50

0.00

0.50

First PLS Scores

1.00

1.50

2.00

2.50

PLS in Practice
PLS seeks to find latent variables (LVs) that
summarize variability and are highly predictive of
the response.
How do we determine the number of LVs to
compute?
Evaluate RMSPE (or Q2)

The optimal number of components is the


number of components that minimizes RMSPE

94

PLS for the Boston housing data:


Training the PLS Model
Since PLS can handle
highly correlated
variables, we fit the model
using all 12 predictors
The model was trained
with up to 6 components
RMSE drops noticeably
from 1 to 2 components
and some for 2 to 3
components.
Models with 3 or more
components might be
sufficient for these data
95

Training the PLS Model


Roughly the same
profile is seen when
the models are judged
on R2

96

Boston Housing Results


Using the two component model, we can predict
the test set
PLS training statistics are similar to those from
linear regression
Both methods perform about the same in the test
set
Training Data
(bootstrap)

97

Test Data

RMSE

Q2

RMSE

R2

Linear Reg

5.23

0.691

4.53

0.742

PLS

5.25

0.689

4.56

0.739

PLS Model Fit Test Set Results

98

PLS Optimization (2)


(many predictors, many responses)

PLS seeks to find linear combinations of the


independent variables and a linear combination
of the dependent variables that summarize the
maximum amount of co-variability between the
combinations.
These linear combinations are often called PLS Xspace and Y-space components or PLS X-space and
Y-space scores.
Likwise, X-space and Y-space PLS directions point in
the direction of maximum co-variance between the
spaces.
99

PLS Optimization (2)


(many predictors, many responses)

PLS is inherently an optimization problem, which


is subject to two constraints
1. The X-space and Y-space PLS directions have unit
length
2. Either
a.Successively derived scores in each space are uncorrelated
to previously derived scores, OR
b.Successively derived directions in each space are orthogonal
to previously derived directions
Constraint 2.a. is most commonly implemented

100

Mathematically Speaking
The optimization problem defined by PLS can be
solved through the following formulation:
2

Cov a X, b Y
arg max
,
T
T
a a b b
a, b

subject to constraints 2a. or b.

var a X var b Y corr a X, b Y


arg max
T
T
a
a
b
b
a, b
101

PLS is Simultaneous Dimension Reduction


and Regression
max Var(X-scores) Corr2(X-scores,Y-scores)Var(Y-scores)

X-space Dimension
Reduction (PCA)

102

Regression

Y-space Dimension
Reduction (PCA)

Neural Networks

103

Neural Networks
Like PLS or PCR, these models create
intermediary latent variables that are used to
predict the outcome
Neural networks differ from PLS or PCR in a few
ways
the objective function used to derive the new variables
is different
The latent variables are created using flexible, highly
nonlinear functions
The latent variables usually do not have any meaning
104

Network Structures
There are many types of neural network structures
we will concentrate on the single layer, feed-forward network
One hidden layer of
latent variables

Predictor1
Predictor2
Predictor3

Hidden Unit 1

Hidden Unit 2

Predictor4
Predictor5
105

Hidden Unit k

Response1

From Predictors to Hidden Units


The transition from this
sub-model to the hidden
units is nonlinear
sigmoidal functions, such
as the logistic function, are
typically used

106

From Hidden Units to the Outcome


The hidden units are then
used to predict the
outcome using simple
linear combinations

Clearly, the parameters are not identifiable and


the hidden units have no real meaning (unlike
PCA)
107

Training Networks
It is highly recommended that the predictors are
centered and scaled prior to training
The number of hidden units is a tuning
parameter
With many predictors and hidden units, the
number of estimated parameters can become
very large
with a large number of hidden units, these models can
quickly start to overfit

Random starting values are typically used to


initialize the parameter estimates
108

Weight Decay
This is a training technique that attempts to
shrink the parameter estimates towards zero
large parameter estimates are penalized in the model
training

This leads to smoother, less extreme models


the effect of weight decay is demonstrated for
classification models

109

Boston Housing Data


The model seems to
do well with fewer
components (not
typical)
For these data, larger
amounts of weight
decay is better for the
model fit

110

Boston Housing Results


The final model used high value for weight decay
and 1 hidden unit
This model seems to be an improvement
compared to the others
Training Data
(bootstrap)

111

Test Data

RMSE

Q2

RMSE

R2

Linear Reg

5.23

0.691

4.53

0.742

PLS

5.25

0.689

4.56

0.739

Neural Net

4.60

0.757

4.20

0.780

Support Vector Machines

112

Support Vector Machines (SVMs)


SVMs are predictive statistical models developed
in 1963 by Vapnik that were significantly
expanded in the 90s
These models were initially developed for
classification models, but were later adapted for
regression models

113

Objective Functions
Recall that linear
regression estimates
parameters by
calculating:
the model residuals
the total sum of the
squared residuals (SSR)

The parameters with


the smallest SSR are
optimal
114

Objective Functions
Support vector machine
regression models create a
funnel around the
regression line
residuals within the funnel are
not counted in the parameter
estimation
the sum of the residuals
outside the funnel are used as
the objective function (no
squared term)

A funnel size is set to 1 SD


of the outcome is not a bad
place to start
115

The SVM Model Optimization


Like Huber-type robust
regression, outliers have a
linear effect on the
objective function
Overfitting can be
controlled by using a
penalized objective
function (more later)
Quadratic programming
methods are needed to
solve these equations
116

Support Vectors and Data Reduction


The points that are outside
the funnel (or on its
boundary) are the support
vectors
It turns out that the prediction
function only uses the
support vectors
the prediction equation is more
compact and efficient
the model may be more robust
to outliers

117

Support Vectors and Data Reduction


The model fitting routine produces values () that
are non-zero for all of the support vectors
To predict a new sample, the original training data
for the non-zero values are needed:

118

Nonlinear Boundaries
Nonlinear boundaries can be computed using the
kernel trick
The predictor space can be expanded by adding
nonlinear functions of the predictors
Common kernel functions are:

119

Nonlinear Boundaries
The trick is that the computations can operate
only on the inner-products of the extended
predictor set

In this way, the predictor space dimension can be


greatly expanded without much computational
impact
120

Cost functions
Support vector machines also include a regularization
parameter that controls how much the regression line can
adapt to the data
smaller values result in more linear (i.e. flat) surfaces

This parameter is generally referred to as Cost


For example, this link show the effect of the cost function
for a highly nonlinear problem
SvmRegMovieA.gif
This one shows the robustness of SVM regression
models
SvmRegMovieB.gif
121

Boston Housing Data


As previously
mentioned, there is a
way to analytically
estimate the tuning
parameter for the RBF
here, a fixed value of
0.0219 is used

The remaining
parameter (cost) shows
a clear optimum
122

Summary
Currently, the SVM model is best at prediction (but
worst at interpretation)

Training Data
(bootstrap)

123

Test Data

RMSE

Q2

RMSE

R2

Linear Reg

5.23

0.691

4.53

0.742

PLS

5.25

0.689

4.56

0.739

Neural Net

4.60

0.757

4.20

0.780

SVM (radial)

3.79

0.834

3.28

0.861

Multivariate Adaptive Regression Splines

124

Multivariate Adaptive Regression Splines


MARS is a nonlinear statistical model
The model does an exhaustive search across the
predictors (and each distinct value of the
predictor) to find the best way to sub-divide the
data
Based on this split value, MARS creates new
features based on that variable
These artificial features are used to model the
outcome
125

MARS Features
MARS uses hinge functions
that are two connected lines
For a data point x of a
predictor, MARS creates a
function that models the data
on each side of x:

These features are created in


sets of two (switching which
side is zeroed)

126

h(x-6) h(6-x)

10

10

Prediction Equation and Model Selection


The model iteratively adds the two new features and uses
ordinary regression methods to create a prediction
equation. The process then continues iteratively.
MARS also includes a built-in
feature selection routine that
can remove model terms
the maximum number of retained
features (and the feature degree)
are the tuning parameters

The Generalized CrossValidation statistic (GCV) is


used to select the most
important terms
127

Sine Wave Example


As an example, we can use
MARS to model one predictor
with a sinusoidal pattern
The first MARS iteration
produces a split at 4.3
two new features are created
a regression model is fit with
these features
the red line shows the fit

128

Sine Wave Example


On the second iteration, a split
was found at 7.9
two new features are created

However, the model fit on the left


side was already pretty good
one of the new surrogate predictors
was removed by the automatic
feature selection

The model now has three


features

129

Sine Wave Example


The third split occurred at 5.5
Again, only the right-hand
feature was retained in the model
This process would continue until
no more important features are found
the user-defined limit is achieved

130

Higher Order Features


Higher degree features
can also be used
two or more hinge functions
can be multiplied together
to for a new feature
in two dimensions, this
means that three of four
quadrants of the feature can
be zero if some features are
discarded

131

Boston Housing Data


We tried only additive
models
the model could retain
from 4 to 36 model terms

The best model used


18 terms

132

Boston Housing Data


Since the model is additive, we can look at the
prediction profile of each factor while keeping the
others constant

133

Summary
SVMs are still optimal, but the respectable
performance and interpretability of MARS might
make us reconsider
Training Data
(bootstrap)

134

Test Data

RMSE

Q2

RMSE

R2

Linear Reg

5.23

0.691

4.53

0.742

PLS

5.25

0.689

4.56

0.739

Neural Net

4.60

0.757

4.20

0.780

SVM (radial)

3.79

0.834

3.28

0.861

MARS

4.29

0.791

3.98

0.804

Regression Trees

135

Regression Trees
A regression tree searches through each
predictor to find a value of single predictor that
best splits the data into two groups.
the best split minimizes the mean squared error of the
model.

For the two resulting groups, the process is


repeated until a hierarchical structure (a "tree") is
created.
in effect, trees partition the predictor space into
rectangular sections that assign a single average to
compounds within the rectangle.
136

Computational Difficulties
Suppose we have n observations and p
predictors.
For each level of the tree, there are at most p(n-1)
possible splits

As tree depth increases, the number of possible


split combinations multiplies
The total number of possible split combinations is
bounded above by [p(n-1)]depth
Suppose we have 100 observations and 100
dimensions.
The number of possible trees is bounded above by
10400!
137

A Greedy Approach
Instead of trying to find the best global set of
regions for which the responses are similar, we
recursively partition the data to find an optimal
set of decision rules.
A regression tree searches through each
predictor to find a value of a single predictor that
best splits the data into two groups.

138

Objective at Each Split


Let [Xnxp|Ynx1] represent the data matrix
We seek a predictor, Xj, and split point, s, that solve:

min
c

xij R1

c1 min
2

c2

xij R2

c2
2

where xij R1 x xij s and xij R2 x xij s,


for i 1,2,..., n and j 1,2,..., p.
The best c1 and c2 are the average responses for the
observations in each region
For the two resulting groups, the process is repeated
139

Splitting Example Boston Housing


We start with all of the
training data
Searching through all
the data yields the first
split
a lower status value of
9.6% provides the best
decrease in MSE

140

Splitting Example Boston Housing


Searching though the
first left split (), the
best split again uses
the lower status %
In the initial right split
(), the split was
based on the mean
number of rooms
Now, there are 4
possible predicted
values

141

Tree Fitting Process


This process would continue until some criterion
for stopping is met
such as the minimum number of compounds in a node

The largest possible tree may over-fit


Pruning is the process of iteratively removing
terminal nodes
looking for drops in resampling performance
142

Tree Fitting Process


There are many possible pruning paths
how many possible trees are there with 6 terminal
nodes?

We can index the possible trees by a complexity


parameter, Cp.
Cp = 0 is the largest tree possible
as Cp increases, the tree shrinks
there are a discrete set of Cp values for a data set

Algorithmically, we can control the complexity by


setting the maximum tree depth
143

Comparison
For these data, we tried 6
possible tree sizes
For each value, resample the
data and calculate
performance
After a depth of 4, the model
cannot improve performance

Training Data
(bootstrap)
Single Tree
144

Test

RMSE

Q2

RMSE

R2

5.18

0.700

4.28

0.780

Boston Housing Example


A depth of 4 was
optimal (see righthand branch)
This model has a test
set performance of
0.78
so far the best is 0.86

However, we can
clearly get a sense of
what the model is
saying
145

Single Trees
Advantages
can be computed very quickly and have simple
interpretations.
have built-in predictor selection: if a predictor was not
used in any split, the model is completely independent
of that data.

Disadvantages
instability due to high variance: small changes in the
data can drastically affect the structure of a tree
data fragmentation
high order interactions
146

Ensemble Methods

147

Ensemble Methods
Ensembles of trees have been shown to provide
more predictive models than individual trees and
are less variable than individual trees
Common ensemble methods are:
Bagging
Random forests, and
Boosting

148

Bagging Trees
Bootstrap Aggregation
Breiman (1994, 1996)
Bagging is the process of
1. creating bootstrap samples
of the data,
2. fitting models to each
sample
3. aggregating the model
predictions

The largest possible tree is


built for each bootstrap
sample

149

Bagging Model

Prediction of an observation, x:
M

F ( x)

150

f x
m 1

Comparison
Bagging can significantly increase performance of trees
from resampling:
Training Data
(bootstrap)

Test

RMSE

Q2

RMSE

R2

Single Tree

5.18

0.700

4.28

0.780

Bagging

4.32

0.786

3.69

0.825

The cost is computing time and the loss of interpretation


One reason that bagging works is that single trees are
unstable
small changes in the data may drastically change the tree
151

Random Forests
Random forests models are similar to bagging
separate models are built for each bootstrap sample
the largest tree possible is fit for each bootstrap sample

However, when random forests starts to make a


new split, it only considers a random subset of
predictors
The subset size is the (optional) tuning parameter

Random forests defaults to a subset size that is the


square root of the number of predictors and is
typically robust to this parameter
152

Random Predictor Illustration

Randomly select a
subset of variables
from original data

Dataset 1

Dataset 2

Dataset M
|

Build trees

Predict

Predict

Final Prediction
153

Predict

Random Forests Model

Prediction of an observation, x:
M

F ( x)

154

f x
m 1

Properties of Random Forests


Variance reduction
Averaging predictions across many models provides
more stable predictions and model accuracy
(Breiman, 1996)

Robustness to noise
All observations have an equal chance to influence
each model in the ensemble
Hence, outliers have less of an effect on individual
models for the overall predicted values

155

Comparison
Comparing the three methods using resampling:
Training Data
(bootstrap)

Test

RMSE

Q2

RMSE

R2

Single Tree

5.18

0.700

4.28

0.780

Bagging

4.32

0.786

3.69

0.825

Rand Forest

3.55

0.857

3.00

0.885

Both bagging and random forests are memoryless


each bootstrap sample doesnt know anything about the other
samples

156

Boosting Trees
A method to boost weak learning algorithms
(small trees) into strong learning algorithms
Kearns and Valiant (1989), Schapire (1990), Freund
(1995), Freund and Schapire (1996a)

Boosted trees try to improve the model fit over


different trees by considering past fits

157

Boosting Trees
First, an initial tree model is fit (the size of the
tree is controlled by the modeler, but usually the
trees are small (depth < 8))
if a sample was not predicted well, the model residual
will be different from zero
samples that were predicted poorly in the last tree will
be given more weight in the next tree (and vice-versa)

After many iterations, the final prediction is a


weighted average of the prediction form each
tree
158

Boosting Illustration
Stage

Build
weighted
tree

1
n=200
X1 > 5.2
n=90

X1 < 5.2
n=110

2
e
i 32.9

Compute
stage weight

stage 1 = f(32.9)

159

n=200

Compute
error

Reweigh
observations
(wi=1,2,..., n)

...

i 1

Determine weight of ith


observation:
The larger the error,
the higher the weight

X27 > 22.4


n=64

n=200
X27 < 22.4

X6 > 0

X6 < 0

n=136

n=161

n=39

2
e
i 26.7
i 1

stage 2 = f(26.7)
Determine weight of ith
observation

2
e
i 29.5
i 1

stage M = f(29.5)

Boosting Trees
Boosting has three tuning parameters:
number of iterations (i.e. trees)
complexity of the tree (i.e. number of splits)
learning rate: how quickly the algorithm adapts

This implementation is the most computationally


taxing of the tree methods shown here

160

Final Boosting Model

Prediction of an observation, x:
M

F ( x) m f m x
m 1

where the m are constrained to sum to 1.

161

Properties of Boosting
Robust to overfitting
As the number of iterations increases, the test set
error does not increase
Schapire, et al. (1998), Friedman, et al. (2000),
Freund, et al. (2001)

Can be misled by noise in the response


Boosting will be unable to find a predictive model if the
response is too noisy.
Kriegar, et al. (2002), Wyner (2002), Schapire (2002),
Optiz and Maclin (1999)
162

Boosting Trees
One approach to training is
to set the learning rate to a
high value (0.1) and tune
the other two parameters
In the plot to the right, a grid
of 9 combinations of the 2
tuning parameters were
used to optimize the model
The optimal settings were:
500 trees with high complexity

163

Comparison Summary
Comparing the four methods:
Training Data
(bootstrap)

164

Test

RMSE

Q2

RMSE

R2

Single Tree

5.18

0.700

4.28

0.780

Bagging

4.32

0.786

3.69

0.825

Rand Forest

3.55

0.857

3.00

0.885

Boosting

3.64

0.847

3.19

0.870

Model Building Training


Model Comparisons

165

Which Model is Best?


The No Free Lunch Theorem:
over the set of all possible problems, each algorithm
will do on average as well as any other
or, in other words,
if one model is better than another, it is because of the
particular problem at hand; no one method is uniformly
best
Despite this statement, the next slide has some
(subjective) ratings of models
166

Top Level Comparisons

Excellent

167

Very Good

Average

Fair

Poor

Top Level Comparisons

ZV = zero var predictor, NZV = near-zero var predictor,


CS = center+scale, HCP = highly correlated predictor
* Depends on implementation
168

Boston Housing Data

The correlation between the results on the training set


(n=337) via cross-validation and the results from the test
set (n=169) were 0.971 (RMSE) and 0.965 (R2)
169

Some Advice
There is an inverse relationship between
performance and interpretability
We want the best of both worlds: great
performance and a simple, intuitive model

Interpretability
Tree

Regression

PLS

MARS

Try this:
Fit a high performance model to get an
idea of the best possible performance
Move up the line and see if a less
complex model can keep performance
up with some interpretability

NNet
Boosted
Tree

SVM
RF/Bagging

Performance
170

Regression Datasets

171

Internet Move Data Base


IMDB is an on-line resource that catalogs movies and TV
programs from many countries.
Basic information about the program is maintained and
users can rate each program on a five point scale.
We extracted information about movies and captured:
the average vote
the number of votes
basic information: run time, rating (if any), year of release, etc
genre: drama, comedy etc and
keywords: based on novel, female lead, title spoken by character

Can we predict the movie rating based on these data?


172

Tecator Spectroscopy Data


From Statlib:
These data are recorded on a Tecator Infratec Food and Feed
Analyzer working in the wavelength range 850 - 1050 nm by the
Near Infrared Transmission (NIT) principle.
Each sample contains finely chopped pure meat with different
moisture, fat and protein contents.
For each meat sample the data consists of a 100 channel
spectrum of absorbances and the contents of moisture (water), fat
and protein.
The absorbance is -log10 of the transmittance measured by the
spectrometer.
The three contents, measured in percent, are determined by
analytic chemistry.
173

Tecator Spectroscopy Data


The variables are spectral
measurements at specific
wavelengths and are
highly autocorrelated.
We wish to predict the
percent fat for each
sample.

174

Towson Home Sales


Information about homes sold in the Towson, Maryland area (north of
Baltimore) were collected.
The area encompasses the northern border of Baltimore city
(Idlewydle), suburban areas (Annelsie, Rodgers Forge, Wiltondale)
and more expensive areas (Stoneleigh, Ruxton).
Variables include:

The lot size

The sale date and

Square footage

The year built

Number of baths

Can we accurately predict the sale price of a home?


175

Regression Backup Slides

176

SVM Model Fit Test Set Results

177

MARS Model Fit Test Set Results

178

Regression Tree Model Fit Test Set Results

179

Boosting Tree Model Fit Test Set Results

180

Variable Importance for PLS


To understand the
importance of each factor,
we can look at a weighted
sum of the absolute
regression coefficients
the weights are based on
the decrease in error as
more components are
added

We can also look at the


loadings to get a more
detailed assessment
181

Variable Importance for PLS


Here, we can look at the
increase in R2 as model
terms are added
If the variable is never
used in a term, it has an
importance of zero

182

Variable Importance for Regression Trees


Here, we can look at the
decrease in MSE as
model terms are added
If the variable is never
used in a split, it has an
importance of zero

183

Variable Importance for Random Forests


A permutation approach is
used
Each training data for
variable is scrambled in
turn and the % increase in
the out-of-bag MSE is
tracked

184

Boosting, Formally
Boosting fits a forward stagewise additive model
(Hastie, Tibshirani and Friedman, 2001) through
the following steps:

1. Let f 0 x 0

2. For m 1, 2, , M do steps a and b


N

a. m , hm arg min ,h yi f m 1 xi h xi
i 1

where R, and h is a tree.


b. f m x f m 1 x m hm x
185

Boostings Underlying Model


acts as a shrinkage parameter and is called the
learning rate.
a parameter that controls the rate of learning of observations
that overlap on a decision boundary (Friedman, 2001)

Shrinkage boosting can be viewed as fitting this additive


model:

fM x

hm H d

Hd

h x m

m m

m 1

where hm(x) Hd , and Hd represents a dictionary of


trees of depth d. (Hastie, 2001)
186

Linear Regression Pre-Processing


Linear regression models will fail if there are zerovariance predictors included
They will also fail during cross-validation if any nearzero variance predictors are in the data

As just discussed, removing highly correlated


predictors is strongly suggested
Centering and scaling are not required, but can
greatly increase the numerical stability of the
model

187

PLS Pre-Processing
Because of its dimension reduction abilities, PLS
is resistant to zero- and near-zero variance
predictors
Also, since PLS can handle (and perhaps exploit)
correlated predictors, it is not necessary to
remove them
Centering and scaling are extremely important for
PLS models
otherwise, the predictors with large variability can
dominate the selection of components
188

Neural Network Pre-Processing


Neural network models will not fail with zero-variance
predictors
However, these models use a large number of parameters
and near-zero variance predictors may lead to numerical
issues such as a failure to converge
Highly correlated predictors should be removed;
multicollinearity can have a significant effect on model
performance
Centering and scaling are required

189

MARS Pre-Processing
MARS models are resistant to zero- and near-zero
variance predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up

Centering and scaling are not required but are suggested

190

Tree Pre-Processing
A basic regression tree requires very little preprocessing
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results

highly correlated predictors are allowed


Including highly correlated descriptors can cause instability
and make descriptor importance rankings somewhat random

zero- and near-zero variance predictors are allowed

191

Model Building Training

Classification-type Models

192

Setting

Response is categorical
Response may have more than two categories
193

Objective

To construct a model of predictors that


can be used to predict a response
Data
Model
Prediction
194

Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis

Modern classification methods


Tree-based ensemble methods
Boosting and random forests

Neural networks
Support vector machines
k-nearest neighbors
Naive Bayes

Each of these methods seek to find a partitioning of the


data that minimizes classification error
195

Evaluating Classification Model Performance


Like regression models, we desire to understand the
predictive ability of a classification model.
We can evaluate a models performance by using crossvalidation or a test set of data.
For regression models, the measure of performance was
RMSE (or RMSPE)a function of the deviation of the
observed value from the predicted value.
This is a valid measure of performance when the response is
continuous, but not when the response is categorical.

Instead, we need a measure of predictive ability that is


appropriate for categorical data.
196

Objective
Minimize classification error (or maximize accuracy)
Determine how well the model prediction agrees with the
actual classification of observations.

Actual

Predicted

197

Active

Inactive

Total

Active

A+B

Inactive

C+D

Total

A+C

B+D

N=A+B+C+D

Intuition
An intuitive measure of accuracy is
(A + D) / N
When the actual classes are balanced, this is an
appropriate measure of model performance.

But, this measure produces the same values for


different tables:
Active

Inactive

Active

50

50

Inactive

50

4850

vs

Active

Inactive

Active

95

Inactive

95

4805

Accuracy for both tables is 0.98


198

Does one table show more agreement than the other?

Another Measure: Kappa


To provide a measure of agreement for unbalanced
tables, Cohen (1960) proposed comparing the observed
agreement to the expected agreement
To compute Kappa, we need
The observed agreement: O = (A + D) / N
The expected agreement

A C A B B D C D
E
N2

Kappa is defined as: k = (O E) / (1 E)


199

Kappa Properties
Generally: -1 k 1
values close to 0 indicate poor agreement
values close to 1 indicate near perfect agreement
for complete disagreement, k = -1

Values of 0.4 or above are considered to indicate moderate


agreement, and values of 0.8 or higher indicate excellent
agreement. (Stokes, Davis, and Koch, 2001)

Can be generalized to > 2 classes


k = 0.49
Active

Active

Inactive

Active

Inactive

50

k = 0.65
Active

50

95

Inactive

50

4850

Inactive

95

4805

Note: When the observed classes are balanced, kappa = accuracy


200

Another Measure:
Receiver Operating Characteristic (ROC) Curves
ROC curves can be used to assess a
classification models performance or to compare
several models performance
Building an ROC curve requires that the model
produces a continuous prediction
For each predicted value of the response, we
construct a 2x2 table using the predicted value
as the cutoff.

201

ROC Curves

Terminology:
Sensitivity = True Positive Rate = TP / (TP + FN)
Specificity = True Negative Rate = TN / (FP + TN)

An ROC curve is a plot of 1 specificity versus


sensitivity for each predicted value of the response
false positive rate versus true positive rate

A perfect classification model has both a sensitivity and


specificity of 1.
202

ROC Example

All observations with predicted probabilities the cutoff are classified as negative.
203

Classification Model Predictions


Several classification models generate a predicted value
for each class in the original data
PLSDA, FDA, and NN

The class with the largest predicted outcome is the


predicted class
Predictions from the model are generally between 0 and 1, but are
not guaranteed to be within this range.

The softmax technique is used to transform the predicted


outcomes to probability-like values that can be
interpreted as class probabilities
On the [0, 1] scale and add up to 1

204

Softmax Function
Let gik be the classification score of the ith
observation into group k.
The probability that the observation is in group k
g ik
is:
e
K

g ip

p 1

where K is the total number of groups

205

Discriminant Models

206

Classical Discriminant Models


These models form a discriminant function that
can be used to classify samples
The discriminant function is a linear function of the
predictors that attempts to:

This is a latent variable method similar to PLS and


others that we have seen
how the latent variable is created differs between
methods
207

Linear Discriminant Analysis


Assumption: the within group variability is the same for
each group.
For a two-class problem, the classification boundary is a
straight line

The function uses the within-class means and the overall


covariance structure to create the latent variable

Because it uses the covariance matrix, there must be


at least as many compounds as predictors
no zero-variance or linearly dependent predictors

LDA is not optimal for groups separated by curvature


208

Example where LDA works

The plot on the right


shows a three class
example where a linear
method like LDA is most
effective

209

Aside: LDA and Logistic Regression

It turns out that LDA and logistic regression are fitting models that are
very similar
LDA assumes that the predictors are measured with error and that the
classification of the observations is known
LR assumes that the predictors are known and that the classification of
the observations are measured with error

210

Assuming that the response error is Normal, the optimal separating


plane for logistic regression is:

LDA estimates a large number of parameters and has fairly strict


constraints on the data

Also, logistic models may be more forgiving of skewed predictor


distributions

Example Data
For our example data
set, LDA doesnt do a
very good job since
the boundary is
nonlinear
The linear predictor is
determined to be
(1.18 Predictor A)
(0.25 Predictor B)

211

Aside: LDA and Large Number of Predictors


Some classification models are not drastically
affected by large numbers of predictors
In many cases, a number of predictors will be noise

LDA has the potential to overfit


LDA class probability estimates become more extreme
as the number of predictors becomes large even when
there is no underlying difference

A similar issue occurs in LR


For LR, at some point a random predictor will perfectly
split the classes
212

Aside: LDA and Large Number of Predictors


For example, we simulated a
data set that was complete noise
For a small number of predictors,
the posterior probabilities were
grouped around 0.50
As the number of predictors was
increased, the certainty of
these probabilities became more
extreme
213

PLS for Discrimination


In regression PLS seeks to find linear
combinations of the original variables
(scores) that are highly correlated with
the response.
For classification problems we can use
PLS to find linear combinations of the
original variables that optimally
separate the data.
Unlike regression, the response for
classification is a binary matrix, with each
column indicating the class of the
observation

214

Response Matrix

1 0 0
1 0 0

0 1 0

0 1 0
0 0 1
0 0 1

PLS Optimization
(many predictors, many responses)

Like the regression setting, we must solve an


optimization problem that is subject to
constraints:
1. The X-space and Y-space PLS directions have unit
length
2. Either
a.Successively derived scores in each space are uncorrelated
to previously derived scores, OR
b.Successively derived directions in each space are orthogonal
to previously derived directions

215

Solution:
Same as PLS for Regression

The optimization problem defined by PLS can be


solved through the following formulation:
2

Cov a X, b Y
arg max
,
T
T
a a b b
a, b

subject to constraints 2a. or b.

var a X var b Y corr a X, b Y


arg max
T
T
a
a
b
b
a, b
216

Facts
Barker and Rayens (2003) showed:
The PLS directions are the eigenvectors of a modified
between-class covariance matrix, B.
Coding of the response matrix does not matter
either g columns or g-1 columns provides the same answer

The constraint in the Y-space does not make sense


Why constrain a response that denotes class membership?

If the Y-space constraint is removed, the PLS


directions are exactly the eigenvectors of the betweenclass covariance matrix, B.
LDA is optimal if dimension reduction is not necessary
The optimal directions for LDA are the eigenvectors of W-1B.
217

PLS Discriminant Analysis Example 1

The softmax function is used to determine classification boundaries.


218

PLS Discriminant Analysis Example 2


PLSDA

219

LDA

Quadratic Discriminant Analysis


Assumption: the within group variability is different for
each group.
The decision rule is

where k represents group k.


The class with the largest score is the predicted class
A function of squared distance of each observation from each
groups center

The decision rule depends on the covariance matrix for


each group
220

Quadratic Discriminant Analysis


QDA extends the LDA
model by using quadratic
(i.e nonlinear) classification
boundaries
However, the data
requirements are more
stringent
at least as many compounds
as predictors in each class
no zero-variance or linearly
dependent predictors
221

Regularized Discriminant Analysis


The method tries to split the difference between LDA and
QDA.
It uses two tuning parameters, gamma and lambda:
gamma controls the correlation assumption for the predictors
as gamma 1 the model assumes less predictor correlations

lambda toggles between linear and quadratic boundaries


gamma = 0 & lambda = 1 LDA
gamma = 0 & lambda = 0 QDA

Other combinations of gamma and lambda produce


models that are compromises between LDA and QDA
222

Regularized Discriminant Analysis


To see the effect of changing gamma:
RdaMovieA.gif

To see the effect of changing lambda:


RdaMovieB.gif

We can find the optimal gamma and lambda by


cross-validation

223

Flexible Discriminant Analysis


FDA generalizes LDA to highly nonlinear boundaries
In addition to the original predictors, nonlinear functions of
the predictors are added to the data
This is known as a basis expansion of the original data

This procedure essentially builds a set of one versus all


classification models
a 0/1 outcome is used for each model
the softmax function is used to convert the model output to class
probabilities

224

Flexible Discriminant Analysis


For example, the MARS hinge functions can be
used
For each 0/1 outcome, the best predictor/split of
the data is determined and two hinge functions
are added
Hinge functions are added until a pre-specified
number of terms is reached
Like the MARS model, the number of features is
reduced until the fit begins to suffer
225

FDA Example
FDA uses the MARS procedure to determine new
hinge features
for these data, 3 sets of features were used in to
discriminate the classes

226

Modern Classification Methods

227

Classification Trees
Like regression trees, classification trees search
through each predictor to find a value of single
predictor that splits the data into two (or more)
groups that are more pure than the original
group.
For each partition, each predictor is evaluated at
all possible split points and the best predictor
and split are selected.
Process continues until some criterion for stopping is
met (like minimum number of observations in a node)
228

Splitting Example
Pred A
A > Thresh 1

A Thresh 1

Pred B

B > Thresh 2

Pred D

B Thresh 2

Pred A
A > Thresh 3

229

D > Thresh 4

D Thresh 4

A Thresh 3

Impurity Measures
There are several measures for determining the
purity of the split. For a two-class, two common
measures are
Misclassification error

Gini index

230

Impurity Measure Definitions

a
c
p1 min
,

ac ac
d
b
p2 min
,

bd bd
ac
bd
w1
, w2
n
n

Misclassification error: w1p1 + w2p2


When w1 = w2= 0.5, ME = 0.5*(p1 + p2)

Gini index: w1p1(1-p1) + w2p2(1-p2)


When w1 = w2= 0.5, GI = 0.5*(p1(1-p1) + p2(1-p2))
231

Impurity Measure Comparison

232

Simple Example

10

In this example a few


possible partitions clearly
stand out:

x1 = 5,

How does each impurity


measure rank these
partitions?

x2

x2 = 1.5

x2 = 7.5, or

6
x1

233

10

Classification Results

234

Ensemble Methods
Like individual regression trees, single
classification trees
are not optimal classification methods.
have high variabilitysmall changes in the data can
drastically affect the structure of the tree.

Bagging, random forests, and boosting can also


be implemented for classification problems

235

Bagging, Random Forests, and Boosting


Each of these ensemble methods are
implemented in the same way as in regression.
The objective is to minimize misclassification
error
The loss function changes to exponential loss rather
than squared error loss.

Tuning parameters for these methods are the


same as in regression
236

Neural Networks
Like PLS, neural networks for classification
translate the classes to a set of binary (zero/one)
variables.
The binary variables are modeled using the
predictors and the softmax technique is used to
make sure that the model outputs behave like
probabilities

237

Fitting Neural Networks


As in regression models, there are two
complexity parameters:
The number of hidden units
The amount of weight decay

The second parameter helps determine the


smoothness of the classification boundaries
For the example data:
nnetMovie.gif

238

Support Vector Machines (SVMs)


SVMs for classification use
a completely different
objective function:
the margin

Suppose we have two


predictors and a bunch of
compounds
We may want to classify
compounds as active or
inactive
Lets further suppose that
these two predictors
completely separate these
classes
239

The Margin
There are an infinite
number of straight lines
that we can use to
separate these two
groups
some must be better than
others

The margin is a defined


by equally spaced
boundaries on each side
of the line
240

The Margin
To maximize the
margin, we try to make
it as large as possible
without capturing any
compounds

As the margin
increases, the solution
becomes more robust
SVMs maximize the
margin to estimate
parameters
241

Support Vectors and Data Reduction


When the classes overlap, points are allowed within the
margin
the number of points is controlled by a cost parameter

The points that are within the margin (or on its


boundary) are the support vectors
It turns out that the prediction function only uses the
support vectors
the prediction equation is more compact and efficient
the model may be more robust to outliers
242

Nonlinear Boundaries
Similar to regression models, the kernel trick
can be used to generate highly nonlinear class
boundaries
For classification, there are two common kernel
functions
polynomial (3 tuning variables)
radial basis functions (2 parameters)

243

SVM Example Class Boundary


RBF Kernel

244

79 SVs (31.6%)

The Effect of the Cost Parameter


As the cost parameter is increased, the model will
work very hard to correctly classify the compounds
This can lead to over-fitting

To see the effect of the cost parameter, the link


below shows an animation for a radial basis
function SVM
SvmMovieB.gif

Note that, as the boundary becomes more


complicated, the #SV decreases
The margin is becoming very small

245

Nearest Neighbor Classifiers


To predict the class of a new compound, this
procedure uses the most frequent class of the
closest k neighbors
if a tie, randomly pick from the most frequent classes

k, the number of neighbors, is the tuning


parameter
Since distance is used to define the nearest
points, the predictors should be centered and
scaled

246

Nearest Neighbor Classifiers


For the simulated data,
the model was tuned
across k values from 1 to
20
7 neighbors was found to
be optimal

k-NN class boundaries


tend to be somewhat
jagged but smooth out as
k increases
247

Nave Bayes
Recall Bayes theorem:

Of course, the predictor distributions are usually


multivariate and these probabilities would involve
multidimensional integration
248

Nave Bayes
In nave Bayes, aka Idiots Bayes, the
relationships between predictors are ignored
i.e all predictors are treated as uncorrelated

249

Nave Bayes
Despite this assumption, this model usually is
very competitive, even with strong correlations
How do we estimate continuous predictor
distributions?
parametrically: assume normality and use the sample
mean and variance
non-parametrically: use a nonparametric density
estimator

250

Nave Bayes
For example, looking at only the distribution of
predictor A in our example, we see a slight shift
between the distributions of the predictor for
each class:

251

Nave Bayes

If a new sample has a


value of predictor A =
-1, it is more likely to
be active
active density ~ 0.40
inactive density ~ 0.17

252

Nave Bayes
For predictor B, the
inactive probability is much
larger for values between
-0.5 and 0.5
For each predictor, the
distributions are modeled
class probabilities can be
computed for each predictor

The final class probability


is calculated by multiplying
all the probabilities
together
253

A Tale of Two Samples

Sample 1

254

Sample 2

Pred A

Pred B

Pred B Pred A

-1

Total

-1

-1

Total

Active

0.40

0.14

0.06

0.40

0.30

0.12

Inactive

0.17

0.62

0.10

0.17

0.08

0.01

Nave Bayes and Many Predictors


Like LDA, nave Bayes
models can overfit when
many noisy predictors are
included in the model
As with LDA, we simulated
noise data and were able
to see class separation
increase as the number of
predictors went up

255

Nave Bayes Classifiers


Class boundaries for
nave Bayes models
can show circular or
elliptical islands
Since the predictors
are treated as
uncorrelated, there
cannot be any
diagonal ellipses

256

Example: Prediction of Spam


These data were collect by HP. 4,601 e-mails were
classified as spam or not spam.
Predictor variables are derived form the emails related to
the frequency of words or characters in the e-mail.
Variables include:
A set of word frequency variables. For example, the variable make
measures the relative frequency of that word in the email
Variables related to numbers: words that start with numbers are
also measured. For example, the variable num415 measures how
often the number 415 appears
Other variables relate to special characters (e.g. the variable
charExclamation) or capital letters (capitalAve)
257

Example: Prediction of Spam


We would like to classify emails as being spam with an
emphasis on high specificity, i.e. a low probability of nonspam being labeled as spam
For training, an 80% split was used via stratified random
sampling

258

Method Comparison

259

Method Comparison

260

ROC Comparison

261

Classification Datasets

262

Glaucoma Data
62 variables are derived from a confocal laser scanning
image of the optic nerve head, describing its morphology.
Observations are from normal and glaucomatous eyes,
respectively. Examples of variables are:
as: superior area
vbss: volume below surface temporal
mhcn: mean height contour nasal
vari: volume above reference inferior, etc

We would like to predict whether a subject has glaucoma


given their imaging data
263

Predicting Diabetes in Pima Indians


These data are from Pima Indian women living in Arizona.
Several variables were collected, such as:
pregnant: number of
pregnancies
glucose: plasma glucose
levels
pressure: diastolic BP

mass: body mass index


pedigree: diabetic pedigree
function,
age
diabetes: negative or positive

triceps: skin fold thickness


insulin: serum insulin

We would like to predict a new Indian woman's diabetic


status given their other information.
264

Classification Backup Slides

265

FDA Pre-Processing
FDA models often use the MARS hinge functions, so they
share similar properties.
FDA models are resistant to zero- and near-zero variance
predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up

Centering and scaling are not required but are suggested

266

Tree Pre-Processing
Same as for regression
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results

highly correlated predictors are allowed


Including highly correlated predictors can cause
instability and make predictor importance rankings
somewhat random

zero- and near-zero variance predictors are


allowed
267

RDA Pre-Processing
RDA models are cannot deal with zero- and near-zero
variance predictors
they must be removed

Highly correlated predictors are allowed, but not


suggested
However, perfectly correlated predictors will cause the model to fail

Centering and scaling are not required but are suggested


Additionally, there cannot be linear dependencies between
predictors

268

Neural Network Pre-Processing


Neural network models will not fail with zero-variance
predictors
However, these models use a large number of parameters
and near-zero variance predictors may lead to numerical
issues such as a failure to converge
Highly correlated predictors should be removed.
Centering and scaling are required

269

Nearest Neighbor Pre-Processing


These models are resistant to zero- and near-zero
variance predictors as well as highly correlated predictors
Centering and scaling are required

270

Nave Bayes Pre-Processing


These model will not fail with zero-variance predictors
Highly correlated predictors are also allowed.
Centering and scaling are not required

271

Model Building Training


Other Considerations

272

Variables to Select
Variables thought to be related to the response
should be included in the model
Sometimes we dont know if a set of variables are
related to the response
Should these be included in the analysis?
If the variables are not related to the response,
then we are including noise into our predictor set
What happens to the performance of the
techniques when noise is added?
Can we still find signal?
273

Illustration
To the blood brain barrier data of Mente and Lombardo
(2005), we have added 10, 50, 100, and 200 random
predictors
For each of these new data sets, we have built each
regression model, using cross-validation to determine the
optimal parameter settings
The results are on the following slides
Keep in mind that these results are for one example
Methods may have different rankings for other examples

274

Performance Comparison
R2: CV for Training Set
0.5

0.4

0.3

0.2

0.1

0
0

50

Noise

275

100

150

200

Performance Comparison
R2: Test Set
0.5

0.4

0.3

0.2

0.1

50

Noise

276

100

150

200

Variables to Select
Hopefully, weve demonstrated that resampling is
a good way to avoid over-fitting
Realize that predictor selection is part of the
modeling process
Doing predictor selection outside of crossvalidation can lead to sever predictor selection
bias
and potential over-fitting (but you wont know until a
test set)

277

Effects of Categorizing a Continuous Response


A majority of responses are measured on a continuous
scale
The continuous scale allows us to compare observations
on their original scale
Sometimes the continuous response naturally falls into
two or more modes
If the relative distance between these modes is not relevant, then
the response can be binned
However, if the distance between modes is relevant, then we lose
information by binning the response

Binning a continuous response that does not have natural


modes will make us lose even more information and will
degrade model

278

Thanks
Thanks for sitting through all this
More thanks to:
Benevolent overlords David Potter and Ed
Kadyszewski
Nathan Coulter and Gautam Bhola for computing
support
Pfizer Chemistry for feedback on earlier versions of
this training

279

You might also like