Mulitvariate Random Trees

Model Building Training
Max Kuhn
Kjell Johnson
Global Nonclinical Statistics
Overview
Typical data scenarios
Examples well be using
General approaches to model building

Data pre-processing
Regression-type models
Classification-type models
Other considerations
Typical Data
Response may be continuous or categorical

Predictors may be
continuous, count, and/or binary
dense or sparse
3
observed and/or calculated
Predictive Models
What is a predictive model?
A model whose primary purpose is for prediction
(as opposed to inference)
We would like to know why the model works, as

well as the relationship between predictors and
the outcome, but these are secondary
Examples: blood-glucose monitoring, spam
detection, computational chemistry, etc.
4
What Are They Not Good For?

They are not a substitute for subject specific
knowledge
Science: Hard
(yikes)
Models: Easy
(lets do these instead!)
To make a good model that predicts well on

future samples, you need to know a lot about
Your predictors and how they relate to each other
The mechanism that generated the data (sampling,
technology etc)
5
What Are They Not Good For?

An example:
An oncologist collects some data from a small clinical
trial and wants a model that would use gene expression
data to predict therapeutic response (beneficial or not)
in 4 types of cancer
There were about 54K predictors and data was
collected on ~20 subjects
If there is a lot of knowledge of how the therapy

works (pathways etc), some effort must be put into
using that information to help build the model
6
The Big Picture
In the end, [predictive modeling] is not a

substitute for intuition, but a
compliment
Ian Ayres, in Supercrunchers
References
Statistical Modeling: The Two Cultures by Leo
Breiman (Statistical Science, Vol 16, #3 (2001),
199-231)
The Elements of Statistical Learning by Hastie,
Tibshirani and Friedman
Regression Modeling Strategies by Harrell
Supercrunchers by Ayres
8
Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Support vector machines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests
9
Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis
Modern classification methods

Classification trees
Ensembles of trees
Boosting and random forests
Neural networks
k-nearest neighbors
Naive Bayes
10
Interesting Models We Dont Have Time For

L1 Penalty methods
The lasso, the elasticnet, nearest shrunken centroids
Other Boosted Models

linear models, generalized additive models, etc
Other Models:
Conditional inference trees, C4.5, C5, Cubist, other tree models
Learned vector quantization
Self-organizing maps
Active learning techniques
11
Example Data Sets
12
Boston Housing Data

This is a classic benchmark data set for regression. It
includes housing data for 506 census tracts of Boston
from the 1970 census.
crim: per capita crime rate
Indus: proportion of non-retail

business acres per town
dis: weighted distances to five

Boston employment centers
rad: index of accessibility to

radial highways
tax: full-value property-tax rate
ptratio: pupil-teacher ratio by

town
b: proportion of minorities
Medv: median value homes

(outcome)
nox: nitric oxides concentration
rm: average number of rooms

per dwelling
13
chas: Charles River dummy

variable (= 1 if tract bounds
river; 0 otherwise)
Age: proportion of owneroccupied units built prior to

1940
Toy Classification Example
A simulated data set will be

used to demonstrate
classification models
two predictors with a correlation
coefficient of 0.5 were simulated
two classes were simulated
(active and inactive)
A probability model was used to

assign a probability of being
active to each sample
the 25%, 50% and 75%
probability lines are shown on
the right
14
Toy Classification Example

The classes were randomly
assigned based on the probability
The training data had 250
compounds (plot on right)
the test set also contained 250
compounds
With two predictors, the class

boundaries can be shown for
each model
this can be a significant aid in
understanding how the models
work
but we acknowledge how
unrealistic this situation is
15
General Strategies
16
Objective
To construct a model of predictors that

can be used to predict a response
Data
Model
Prediction
17
Model Building Steps

Common steps during model building are:
estimating model parameters (i.e. training models)
determining the values of tuning parameters that
cannot be directly calculated from the data
calculating the performance of the final model that will
generalize to new data
The modeler has a finite amount of data, which

they must "spend" to accomplish these steps
How do we spend the data to find an optimal model?
18
Spending Data
We typically spend data on training and test data sets
Training Set: these data are used to estimate model parameters
and to pick the values of the complexity parameter(s) for the
model.
Test Set (aka validation set): these data can be used to get an
independent assessment of model efficacy. They should not be
used during model training.
The more data we spend, the better estimates well get

(provided the data is accurate). Given a fixed amount of
data,
too much spent in training wont allow us to get a good
assessment of predictive performance. We may find a model that
fits the training data very well, but is not generalizable (overfitting)
too much spent in testing wont allow us to get a good
assessment of model parameters
19
Methods for Creating a Test Set

How should we split the data into a training and
test set?
Often, there will be a scientific rational for the split
and in other cases, the splits can be made
empirically.
Several empirical splitting options:
completely random
stratified random
maximum dissimilarity in predictor space
20
Creating a Test Set: Completely Random Splits
A completely random (CR) split randomly partitions the

data into a training and test set
For large data sets, a CR split has very low bias towards
any characteristic (predictor or response)
For classification problems, a CR split is appropriate for
data that is balanced in the response
However, a CR split is not appropriate for unbalanced
data
A CR split may select too few observations (and perhaps none) of
the less frequent class into one of the splits.
21
Creating a Test Set: Stratified Random Splits

A stratified random split makes a random split
within stratification groups
in classification, the classes are used as strata
in regression, groups based on the quantiles of the
response are used as strata
Stratification attempts to preserve the distribution

of the outcome between the training and test
sets
A SR split is more appropriate for unbalanced data
22
Over-Fitting
Over-fitting occurs when a model has extremely good
prediction for the training data but predicts poorly when
the data are slightly perturbed
new data (i.e. test data) are used
Complex regression and classification models assume

that there are patterns in the data.
Without some control many models can find very intricate
relationships between the predictor and the response
These patterns may not be valid for the entire population.
23
Over-Fitting Example
The plots below show classification boundaries
for two models built on the same data
Predictor B
Predictor B
one of them is over-fit
Predictor A
24
Predictor A
Over-Fitting in Regression
Historically, we evaluate the quality of a
regression model by its mean squared error.
Suppose that are prediction function is
parameterized by some vector
25
MSE can be decomposed into three terms:
irreducible noise
squared bias of the estimator from its expected value
the variance of the estimator
The bias and variance are inversely related

as one increases, the other decreases
different rates of change
26
When the model under-fits,
the bias is generally high and
the variance is low
Over-fitting is typically
characterized by high
variance, low bias estimators
In many cases, small
increases in bias result in
large decreases in variance
27
Generally, controlling the MSE yields a good
trade-off between over- and under-fitting
a similar statement can be made about classification
models, although the metrics are different (i.e. not
MSE)
How can we accurately estimate the MSE from

the training data?
the nave MSE from the training data can be a very
poor estimate
Resampling can help estimate these metrics

28
How Do We Estimate Over-Fitting?

Some models have specific knobs to control
over-fitting
neighborhood size in nearest neighbor models is an
example
the number if splits in a tree model
Often, poor choices for these parameters can

result in over-fitting
Resampling the training compounds allows us
to know when we are making poor choices for the
values of these parameters
29
How Do We Estimate Over-Fitting?

Resampling only affects the training data
the test set is not used in this procedure
Resampling methods try to embed variation in

the data to approximate the models performance
on future compounds
Common resampling methods:
K-fold cross validation
Leave group out cross validation
Bootstrapping
30
K-fold Cross Validation

Here, we randomly split the data into K blocks of
roughly equal size
We leave out the first block of data and fit a
model.
This model is used to predict the held-out block
We continue this process until weve predicted all
K hold-out blocks
The final performance is based on the hold-out
predictions
31
K-fold Cross Validation

The schematic below shows the process for K = 3
groups.
K is usually taken to be 5 or 10
leave one out cross-validation has each sample as a
block
32
Leave Group Out Cross Validation

A random proportion
of data (say 80%) are
used to train a model
The remainder is
used to predict
performance
This process is
repeated many times
and the average
performance is used
33
Bootstrapping
Bootstrapping takes a random sample with
replacement
the random sample is the same size as the original data
set
compounds may be selected more than once
each compound has a 63.2% change of showing up at
least once
Some samples wont be selected

these samples will be used to predict performance
The process is repeated multiple times (say 30)

34
The Bootstrap
With bootstrapping,
the number of heldout samples is
random
Some models, such
as random forest, use
bootstrapping within
the modeling process
to reduce over-fitting
35
Training Models with Tuning Parameters

A single training/test split is
often not enough for models
with tuning parameters
We must use resampling
techniques to get good
estimates of model
performance over multiple
values of these parameters
We pick the complexity
parameter(s) with the best
performance and re-fit the
model using all of the data
36
Simulated Data Example

Lets fit a nearest neighbors model to the
simulated classification data.
The optimal number of neighbors must be chosen
If we use leave group out cross-validation and set
aside 20%, we will fit models to a random 200
samples and predict 50 samples
30 iterations were used
Well train over 11 odd values for the number of

neighbors
we also have a 250 point test set
37
Toy Data Example

The plot on the right shows the
classification accuracy for each
value of the tuning parameter
The grey points are the 30
resampled estimates
The black line shows the average
accuracy
The blue line is the 250 sample
test set
It looks like 7 or more

neighbors is optimal with an
estimated accuracy of 86%
38
Toy Data Example

What if we didnt resample
and used the whole data
set?
The plot on the right
shows the accuracy
across the tuning
parameters
This would pick a model
that over-fits and has
optimistic performance
39
Data Pre-Processing
40
Why Pre-Process?
In order to get effective and stable results, many
models require certain assumptions about the
data
this is model dependent
We will list each models pre-processing

requirements at the end
In general, pre-processing rarely hurts model
performance, but could make model
interpretation more difficult
41
Common Pre-Processing Steps

For most models, we apply three pre-processing
procedures:
Removal of predictors with variance close to zero
Elimination of highly correlated predictors
Centering and scaling of each predictor
42
Zero Variance Predictors

Most models require that each predictor have at
least two unique values
Why?
A predictor with only one unique value has a variance
of zero and contains no information about the
response.
It is generally a good idea to remove them.
43
Near Zero Variance Predictors

Additionally, if the distributions of the predictors
are very sparse,
this can have a drastic effect on the stability of the
model solution
zero variance descriptors could be induced during
resampling
But what does a near zero variance predictor

look like?
44
Near Zero Variance Predictor

There are two conditions for an NZV predictor
a low number of possible values, and
a high imbalance in the frequency of the values
For example, a low number of possible values

could occur by using fingerprints as predictors
only two possible values can occur (0 or 1)
But what if there are 999 zero values in the data

and a single value of 1?
this is a highly unbalanced case and could be trouble
45
NZV Example
In computational chemistry we
created predictors based on
structural characteristics of
compounds.
As an example, the descriptor
nR11 is the number of 11member rings
The table to the right is the
distribution of nR11 from a
training set
the distinct value percentage is
5/535 = 0.0093
the frequency ratio is 501/23 = 21.8
46
# 11-Member Rings
Value
Frequency
501
23
Detecting NZVs
Two criteria for detecting NZVs are the
Discrete value percentage
Defined as the number of unique values divided by the number of
observations
Rule-of-thumb: discrete value percentage < 20% could indicate a
problem
Frequency ratio
Defined as the frequency of the most common value divided by the
frequency of the second most common value
Rule-of-thumb: > 19 could indicate a problem
If both criteria are violated, then eliminate the predictor

47
Highly Correlated Predictors
Some models can be negatively affected by

highly correlated predictors
certain calculations (e.g. matrix inversion) can become
severely unstable
How can we detect these predictors?

Variance inflation factor (VIF) in linear regression
or, alternatively
1. Compute the correlation matrix of the predictors
2. Predictors with (absolute) pair-wise correlations above
a threshold can be flagged for removal
3. Rule-of-thumb threshold: 0.85
48
Highly Correlated Predictors and

Resampling
Recall that resampling slightly perturbs the
training data set to increase variation
If a model is adversely affected by high
correlations between predictors, the resampling
performance estimates can be poor in
comparison to the test set
In this case, resampling does a better job at predicting
how the model works on future samples
49
Centering and Scaling

Standardizing the predictors can greatly improve
the stability of model calculations.
More importantly, there are several models (e.g.
partial least squares) that implicitly assume that
all of the predictors are on the same scale
Apart from the loss of the original units, there is
no real downside of centering and scaling
50

Regression-type Models
51
Setting
Response is continuous
52
Objective

Data
Model
Prediction
53
Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests
Each of these methods seek to find a relationship

between the predictors and response that minimizes
error between the observed and predicted response
54
Additive Models
In the beginning there were linear models:
E Y 0 1 X 1 p X p
And Nelder and Wedderburn (1972) said, Let there be

Generalized Linear Models:
g E Y 0 1 X 1 p X p
and link functions appeared.
And Hastie and Tibshirani (1990) said, Let there be

Generalized Additive Models:
E Y f 0 f1 X 1 f p X p
55
and scatterplot smoothers and backfitting

algorithms appeared.
Families of Additive Models
Recursive
Partitioning
(Trees)
Flexibility
GLM
PLS
Bagging
Boosting
Multivariate
Adaptive
Regression
Splines*
Random
Forests
GAM
56
* Additivity depends on model parameters
Neural
Nets
Support
Vector
Machines*
Assessing Model Performance
57
Assessing Model Performance

How well does a regression model perform? Answering this
question depends on how we want to use the model.
Possible goals are:
To understand the relationship between the predictor and the
response.
To use the model to predict future observations response.
In either case, we can use several of different measures to

evaluate model performance. We will focus on two:
Coefficient of determination (R2)
Root mean square error (RMSE)
However, the set of data that we use to evaluate

performance will change depending on our purpose.
58
Which Set of Data to Use to Evaluate Performance?

If we are only interested in understanding the underlying
relationship between the predictor and the response, then
we can compute R2 and RMSE on the data for which the
model was built (i.e the training data).
However, these values will be overly optimistic of the models
ability to predict future observations.
If we are interested in understanding the models ability to

predict future observations, then we need to compute R2
and RMSE on data for which the model was not built (i.e.
a test set or cross-validation set).
For a held-out set of data, R2 is commonly referred to as Q2 and
RMSE is commonly referred to as root mean squared prediction
error (RMSPE)
59
Root Mean Squared Error (RMSE) and

Root Mean Squared Prediction Error (RMSPE)
RMSE measures the average deviation of an observation
to the best-fit plane
SSE
RMSE
n p 1
RMSPE measures the average deviation of an
observation to its predicted value for the test or crossvalidation set
n*
RMSPE
y
i i
i 1
n* = the number of observations in the test or cross-validation set

60
Computing Q2
Process:
Partition the data into
a training and testing set, or
blocks to be used for training and testing
Build the model on the training data and predict the

testing data
Q2 = R2 of the relationship between the observed

and predicted values for the testing data.
61
Multiple Linear Regression:

A Quick Review
62
Multiple Linear Regression
Objective: Find the plane through the data that minimizes the
sum-of-squares error.
63
The Best Plane

To find the best plane, we solve:
min Y X
where Ynx1, Xnx(p+1) and (p+1)x1
The best is:

p
64
X X X Y
Aside: A Bit More About (XTX)

(XTX) is a critical matrix for many statistical
modeling techniques
A few fun facts
(XTX) is proportional to the covariance matrix, S
S contains the variances and covariances of all
predictors
Techniques that depend on (XTX) also require that it is
invertible
65
Assumptions: Diagnostic Plots
66
When Does Regression Fail?

When a plane does not capture the structure in the data
When the variance/covariance matrix is overdetermined
Recall, the plane that minimizes SSE is:
X T X
XTY
To find the best plane, we must compute the inverse of the

variance/covariance matrix
The variance/covariance matrix is not always invertible. Two
common conditions that cause it to be uninvertible are:
Two or more of the predictors are correlated (multicollinearity)
There are more predictors than observations
67
A (Trivial) Example of Multicollinearity

Suppose that we have one observation (3,5), and we wish to find the best line for the
data. In this example, the number of observations (1) is less than the number of
parameters (2: slope and intercept). When the number of parameters is greater than
the number of observations, we can find an infinite number of best solutions.
Solution 1
Solution 2
Solution 3
In the presence of multicollinearity, the best
solution will be unstable.
68
Boston Housing Data

Lets use a linear regression model to predict the median
house price in Boston.
Process:
Split the data into a training set (n = 337) and testing set (n = 169)
For the training set, use the bootstrap to determine the RMSPE
and Q2
For the test data determine RMSPE and Q2
If the underlying model is stable, the values of RMSPE

and Q2 should be similar between the bootstrap and
testing data
69
Results
Training Data
(bootstrap)
Linear Reg
Test Data
RMSE
Q2
RMSE
R2
5.23
0.691
4.53
0.742
The results are fairly similar, at least within the variation of

resampling
One reason you may see differences: multicollinearity
Multicollinearity in the predictors can produce somewhat unstable
solutions for each resample
When the data are slightly changed, the model can drastically
change
The test set is a single, static set of data for verification

The bootstrap estimate of performance may be better with
collinearity
70
Partial Least Squares Regression
71
Solutions for Overdetermined Covariance Matrices

Variable reduction
Try to accomplish this through the pre-processing
steps
Partial least squares (PLS)

Other methods
Apply a generalized inverse
Ridge regression: Adjusts the variance/covariance
matrix so that we can find a unique inverse.
Principal component regression (PCR)
not recommendedbut its a good way to understand PLS
72
Understanding Partial Least Squares:

Principal Components Analysis
PCA seeks to find linear combinations of the

original variables that summarize the maximum
amount of variability in the original data
These linear combinations are often called principal
components or scores.
A principal direction is a vector that points in the
direction of maximum variance.
73

PCA is inherently an optimization problem, which
is subject to two constraints
1. The principal directions have unit length
2. Either
a.Successively derived scores are uncorrelated to previously
derived scores, OR
b.Successively derived directions are required to be orthogonal
to previously derived directions
In the mathematical formulation, either constraint implies the
other constraint
74

5
Direction 1
Score
Predictor 2
0
-6
-5
-4
-3
-2
-1
0
-1
-2
-3
-4
Predictor 1
75
http://pfizerpedia/index.php/Image:PCAmovie.gif
Mathematically Speaking
The optimization problem defined by PCA can be solved
through the following formulation:
aTX
arg max Var

a
a a
subject to constraints 2a. or b.

Facts
the ith principal direction, ai, is the eigenvector corresponding to
the ith largest eigenvalue of XTX.
the ith largest eigenvalue is the amount of variability summarized
byT the ith principal component.
ai X
are the ith scores

76
PCA Benefits and Drawbacks

Benefits
Dimension reduction
We can often summarize a large percentage of original variability
with only a few directions
Uncorrelated scores
The new scores are not linearly related to each other
Drawbacks
PCA chases variability
PCA directions will be drawn to predictors with the most variability
Outliers may have significant influence on the directions and
resulting scores.
77
Principal Component Regression
Procedure:
1. Reduce dimension of predictors using PCA
2. Regress scores on response
Notice: The procedure is sequential
78
Principal Component Regression

Dimension reduction is
independent of the objective
Predictor
Variables
PCA
PC Scores
MLR
Response
Variable
79
First Principal Direction
80
Relationship of First Direction with Response

Scatter of First PCA Scores with Response
2.50
2.00
1.50
Response
1.00
0.50
0.00
-0.50
-1.00
-1.50
R2 = 0.001
-2.00
-6.00
81
-4.00
-2.00
0.00
2.00
First PCA Scores
4.00
6.00
8.00
PLS History
H. Wold (1966, 1975)

S. Wold and H. Martens (1983)
Stone and Brooks (1990)
Frank and Friedman (1991, 1993)
Hinkle and Rayens (1994)
82
Latent Variable Model

Predictor1
Predictor2
1
Response1
Predictor3
Predictor4
Response2
Response3
Predictor5
Predictor6
Predictors
Latent Variables
Responses
Note: PLS can handle multiple response variables

83
Comparison with Regression
Predictor1
Predictor2
Predictor3
Predictor4
Predictor5
84
Response1
PLS Optimization
(many predictors, one response)
PLS seeks to find linear combinations of the

independent variables that summarize the
maximum amount of co-variability with the
response.
These linear combinations are often called PLS
components or PLS scores.
A PLS direction is a vector that points in the direction
of maximum co-variance.
85
PLS Optimization
(many predictors, one response)
PLS is inherently an optimization problem, which

1. The PLS directions have unit length
2. Either
a.Successively derived scores are uncorrelated to previously
derived scores, OR
b.Successively derived directions are orthogonal to previously
derived directions
Unlike PCA, either constraint does NOT imply the other
constraint
Constraint 2.a. is most commonly implemented
86
The optimization problem defined by PLS can be solved
through the following formulation:
2
Cov a X, Y
arg max
,
T
a a
a
87
Facts
the ith PLS direction, ai, is the eigenvector
corresponding to the ith largest eigenvalue of ZTZ,
where Z = XTy.
the ith largest eigenvalue is the amount of co-variability
summarized
by the ith PLS component.
T
ai X
are the ith scores
PLS is Simultaneous Dimension Reduction and

Regression
Cov a X, Y
arg max
a Ta
a
var a T X var Y corr 2 a T X, Y
arg max
T
a a
a
var a T X corr 2 a T X, Y
var Y arg max
a Ta
a
2
var scores corr scores, response
var response arg max
a Ta
a
2
88
PLS is Simultaneous Dimension Reduction

and Regression
max Var(scores) Corr2(response,scores)
Dimension Reduction
(PCA)
89
Regression
PLS Benefits and Drawbacks

Benefit
Simultaneous dimension reduction and regression
Drawbacks
Similar to PCA, PLS chases co-variability
PLS directions will be drawn to independent variables with the most
variability (although this will be tempered by the need to also be
related to the response)
Outliers may have significant influence on the directions, resulting
scores, and relationship with the response. Specifically, outliers can
make it appear that there is no relationship between the
predictors and response when there truly is a relationship, or
make it appear that there is a relationship between the
predictors and response when there truly is no relationship
90
Partial Least Squares
Simultaneous dimension
reduction and regression
Predictor
Variables
PLS
Response
Variable
91
First PLS Direction
92
Relationship of First Direction with Response

Scatter of First PLS Scores with Response
2.50
2.00
1.50
Response
1.00
0.50
0.00
-0.50
-1.00
-1.50
R2 = 0.93
-2.00
-2.00
93
-1.50
-1.00
-0.50
0.00
0.50
First PLS Scores
1.00
1.50
2.00
2.50
PLS in Practice
PLS seeks to find latent variables (LVs) that
summarize variability and are highly predictive of
the response.
How do we determine the number of LVs to
compute?
Evaluate RMSPE (or Q2)
The optimal number of components is the

number of components that minimizes RMSPE
94
PLS for the Boston housing data:

Training the PLS Model
Since PLS can handle
highly correlated
variables, we fit the model
using all 12 predictors
The model was trained
with up to 6 components
RMSE drops noticeably
from 1 to 2 components
and some for 2 to 3
components.
Models with 3 or more
components might be
sufficient for these data
95
Training the PLS Model

Roughly the same
profile is seen when
the models are judged
on R2
96
Boston Housing Results

Using the two component model, we can predict
the test set
PLS training statistics are similar to those from
linear regression
Both methods perform about the same in the test
set
Training Data
(bootstrap)
97
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
PLS Model Fit Test Set Results
98
PLS Optimization (2)

(many predictors, many responses)
PLS seeks to find linear combinations of the

independent variables and a linear combination
of the dependent variables that summarize the
maximum amount of co-variability between the
combinations.
These linear combinations are often called PLS Xspace and Y-space components or PLS X-space and
Y-space scores.
Likwise, X-space and Y-space PLS directions point in
the direction of maximum co-variance between the
spaces.
99
PLS Optimization (2)

PLS is inherently an optimization problem, which

1. The X-space and Y-space PLS directions have unit
length
2. Either
a.Successively derived scores in each space are uncorrelated
to previously derived scores, OR
b.Successively derived directions in each space are orthogonal
Constraint 2.a. is most commonly implemented
100
The optimization problem defined by PLS can be
solved through the following formulation:
2
Cov a X, b Y
arg max
,
T
T
a a b b
a, b
var a X var b Y corr a X, b Y

arg max
T
T
a
a
b
b
a, b
101
PLS is Simultaneous Dimension Reduction

and Regression
max Var(X-scores) Corr2(X-scores,Y-scores)Var(Y-scores)
X-space Dimension
Reduction (PCA)
102
Regression
Y-space Dimension
Reduction (PCA)
Neural Networks
103
Neural Networks
Like PLS or PCR, these models create
intermediary latent variables that are used to
predict the outcome
Neural networks differ from PLS or PCR in a few
ways
the objective function used to derive the new variables
is different
The latent variables are created using flexible, highly
nonlinear functions
The latent variables usually do not have any meaning
104
Network Structures
There are many types of neural network structures
we will concentrate on the single layer, feed-forward network
One hidden layer of
latent variables
Predictor1
Predictor2
Predictor3
Hidden Unit 1
Hidden Unit 2
Predictor4
Predictor5
105
Hidden Unit k
Response1
From Predictors to Hidden Units

The transition from this
sub-model to the hidden
units is nonlinear
sigmoidal functions, such
as the logistic function, are
typically used
106
From Hidden Units to the Outcome

The hidden units are then
used to predict the
outcome using simple
linear combinations
Clearly, the parameters are not identifiable and

the hidden units have no real meaning (unlike
PCA)
107
Training Networks
It is highly recommended that the predictors are
centered and scaled prior to training
The number of hidden units is a tuning
parameter
With many predictors and hidden units, the
number of estimated parameters can become
very large
with a large number of hidden units, these models can
quickly start to overfit
Random starting values are typically used to

initialize the parameter estimates
108
Weight Decay
This is a training technique that attempts to
shrink the parameter estimates towards zero
large parameter estimates are penalized in the model
training
This leads to smoother, less extreme models

the effect of weight decay is demonstrated for
109
Boston Housing Data

The model seems to
do well with fewer
components (not
typical)
For these data, larger
amounts of weight
decay is better for the
model fit
110
Boston Housing Results

The final model used high value for weight decay
and 1 hidden unit
This model seems to be an improvement
compared to the others
Training Data
(bootstrap)
111
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
Support Vector Machines
112
Support Vector Machines (SVMs)

SVMs are predictive statistical models developed
in 1963 by Vapnik that were significantly
expanded in the 90s
These models were initially developed for
classification models, but were later adapted for
regression models
113
Objective Functions
Recall that linear
regression estimates
parameters by
calculating:
the model residuals
the total sum of the
squared residuals (SSR)
The parameters with

the smallest SSR are
optimal
114
Objective Functions
Support vector machine
regression models create a
funnel around the
regression line
residuals within the funnel are
not counted in the parameter
estimation
the sum of the residuals
outside the funnel are used as
the objective function (no
squared term)
A funnel size is set to 1 SD

of the outcome is not a bad
place to start
115
The SVM Model Optimization

Like Huber-type robust
regression, outliers have a
linear effect on the
objective function
Overfitting can be
controlled by using a
penalized objective
function (more later)
Quadratic programming
methods are needed to
solve these equations
116
Support Vectors and Data Reduction

The points that are outside
the funnel (or on its
boundary) are the support
vectors
It turns out that the prediction
function only uses the
support vectors
the prediction equation is more
compact and efficient
the model may be more robust
to outliers
117

The model fitting routine produces values () that
are non-zero for all of the support vectors
To predict a new sample, the original training data
for the non-zero values are needed:
118
Nonlinear Boundaries
Nonlinear boundaries can be computed using the
kernel trick
The predictor space can be expanded by adding
nonlinear functions of the predictors
Common kernel functions are:
119
The trick is that the computations can operate
only on the inner-products of the extended
predictor set
In this way, the predictor space dimension can be

greatly expanded without much computational
impact
120
Cost functions
Support vector machines also include a regularization
parameter that controls how much the regression line can
adapt to the data
smaller values result in more linear (i.e. flat) surfaces
This parameter is generally referred to as Cost

For example, this link show the effect of the cost function
for a highly nonlinear problem
SvmRegMovieA.gif
This one shows the robustness of SVM regression
models
SvmRegMovieB.gif
121
Boston Housing Data

As previously
mentioned, there is a
way to analytically
estimate the tuning
parameter for the RBF
here, a fixed value of
0.0219 is used
The remaining
parameter (cost) shows
a clear optimum
122
Summary
Currently, the SVM model is best at prediction (but
worst at interpretation)
Training Data
(bootstrap)
123
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
SVM (radial)
3.79
0.834
3.28
0.861
Multivariate Adaptive Regression Splines
124
Multivariate Adaptive Regression Splines

MARS is a nonlinear statistical model
The model does an exhaustive search across the
predictors (and each distinct value of the
predictor) to find the best way to sub-divide the
data
Based on this split value, MARS creates new
features based on that variable
These artificial features are used to model the
outcome
125
MARS Features
MARS uses hinge functions
that are two connected lines
For a data point x of a
predictor, MARS creates a
function that models the data
on each side of x:
These features are created in

sets of two (switching which
side is zeroed)
126
h(x-6) h(6-x)
10
10
Prediction Equation and Model Selection

The model iteratively adds the two new features and uses
ordinary regression methods to create a prediction
equation. The process then continues iteratively.
MARS also includes a built-in
feature selection routine that
can remove model terms
the maximum number of retained
features (and the feature degree)
are the tuning parameters
The Generalized CrossValidation statistic (GCV) is

used to select the most
important terms
127
Sine Wave Example

As an example, we can use
MARS to model one predictor
with a sinusoidal pattern
The first MARS iteration
produces a split at 4.3
two new features are created
a regression model is fit with
these features
the red line shows the fit
128
Sine Wave Example

On the second iteration, a split
was found at 7.9
two new features are created
However, the model fit on the left

side was already pretty good
one of the new surrogate predictors
was removed by the automatic
feature selection
The model now has three

features
129
Sine Wave Example

The third split occurred at 5.5
Again, only the right-hand
feature was retained in the model
This process would continue until
no more important features are found
the user-defined limit is achieved
130
Higher Order Features

Higher degree features
can also be used
two or more hinge functions
can be multiplied together
to for a new feature
in two dimensions, this
means that three of four
quadrants of the feature can
be zero if some features are
discarded
131
Boston Housing Data

We tried only additive
models
the model could retain
from 4 to 36 model terms
The best model used

18 terms
132
Boston Housing Data

Since the model is additive, we can look at the
prediction profile of each factor while keeping the
others constant
133
Summary
SVMs are still optimal, but the respectable
performance and interpretability of MARS might
make us reconsider
Training Data
(bootstrap)
134
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
SVM (radial)
3.79
0.834
3.28
0.861
MARS
4.29
0.791
3.98
0.804
Regression Trees
135
Regression Trees
A regression tree searches through each
predictor to find a value of single predictor that
best splits the data into two groups.
the best split minimizes the mean squared error of the
model.
For the two resulting groups, the process is

repeated until a hierarchical structure (a "tree") is
created.
in effect, trees partition the predictor space into
rectangular sections that assign a single average to
compounds within the rectangle.
136
Computational Difficulties
Suppose we have n observations and p
predictors.
For each level of the tree, there are at most p(n-1)
possible splits
As tree depth increases, the number of possible

split combinations multiplies
The total number of possible split combinations is
bounded above by [p(n-1)]depth
Suppose we have 100 observations and 100
dimensions.
The number of possible trees is bounded above by
10400!
137
A Greedy Approach
Instead of trying to find the best global set of
regions for which the responses are similar, we
recursively partition the data to find an optimal
set of decision rules.
A regression tree searches through each
predictor to find a value of a single predictor that
best splits the data into two groups.
138
Objective at Each Split

Let [Xnxp|Ynx1] represent the data matrix
We seek a predictor, Xj, and split point, s, that solve:
min
c
xij R1
c1 min
2
c2
xij R2
c2
2
where xij R1 x xij s and xij R2 x xij s,

for i 1,2,..., n and j 1,2,..., p.
The best c1 and c2 are the average responses for the
observations in each region
For the two resulting groups, the process is repeated
139
Splitting Example Boston Housing

We start with all of the
training data
Searching through all
the data yields the first
split
a lower status value of
9.6% provides the best
decrease in MSE
140
Splitting Example Boston Housing

Searching though the
first left split (), the
best split again uses
the lower status %
In the initial right split
(), the split was
based on the mean
number of rooms
Now, there are 4
possible predicted
values
141
Tree Fitting Process

This process would continue until some criterion
for stopping is met
such as the minimum number of compounds in a node
The largest possible tree may over-fit

Pruning is the process of iteratively removing
terminal nodes
looking for drops in resampling performance
142
Tree Fitting Process

There are many possible pruning paths
how many possible trees are there with 6 terminal
nodes?
We can index the possible trees by a complexity

parameter, Cp.
Cp = 0 is the largest tree possible
as Cp increases, the tree shrinks
there are a discrete set of Cp values for a data set
Algorithmically, we can control the complexity by

setting the maximum tree depth
143
Comparison
For these data, we tried 6
possible tree sizes
For each value, resample the
data and calculate
performance
After a depth of 4, the model
cannot improve performance
Training Data
(bootstrap)
Single Tree
144
Test
RMSE
Q2
RMSE
R2
5.18
0.700
4.28
0.780
Boston Housing Example

A depth of 4 was
optimal (see righthand branch)
This model has a test
set performance of
0.78
so far the best is 0.86
However, we can
clearly get a sense of
what the model is
saying
145
Single Trees
Advantages
can be computed very quickly and have simple
interpretations.
have built-in predictor selection: if a predictor was not
used in any split, the model is completely independent
of that data.
Disadvantages
instability due to high variance: small changes in the
data can drastically affect the structure of a tree
data fragmentation
high order interactions
146
Ensemble Methods
147
Ensemble Methods
Ensembles of trees have been shown to provide
more predictive models than individual trees and
are less variable than individual trees
Common ensemble methods are:
Bagging
Random forests, and
Boosting
148
Bagging Trees
Bootstrap Aggregation
Breiman (1994, 1996)
Bagging is the process of
1. creating bootstrap samples
of the data,
2. fitting models to each
sample
3. aggregating the model
predictions
The largest possible tree is

built for each bootstrap
sample
149
Bagging Model
Prediction of an observation, x:
M
F ( x)
150
f x
m 1
Comparison
Bagging can significantly increase performance of trees
from resampling:
Training Data
(bootstrap)
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
The cost is computing time and the loss of interpretation

One reason that bagging works is that single trees are
unstable
small changes in the data may drastically change the tree
151
Random Forests
Random forests models are similar to bagging
separate models are built for each bootstrap sample
the largest tree possible is fit for each bootstrap sample
However, when random forests starts to make a

new split, it only considers a random subset of
predictors
The subset size is the (optional) tuning parameter
Random forests defaults to a subset size that is the

square root of the number of predictors and is
typically robust to this parameter
152
Random Predictor Illustration
Randomly select a
subset of variables
from original data
Dataset 1
Dataset 2
Dataset M
|
Build trees
Predict
Predict
Final Prediction
153
Predict
Random Forests Model
M
F ( x)
154
f x
m 1
Properties of Random Forests

Variance reduction
Averaging predictions across many models provides
more stable predictions and model accuracy
(Breiman, 1996)
Robustness to noise
All observations have an equal chance to influence
each model in the ensemble
Hence, outliers have less of an effect on individual
models for the overall predicted values
155
Comparison
Comparing the three methods using resampling:
Training Data
(bootstrap)
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
Rand Forest
3.55
0.857
3.00
0.885
Both bagging and random forests are memoryless

each bootstrap sample doesnt know anything about the other
samples
156
Boosting Trees
A method to boost weak learning algorithms
(small trees) into strong learning algorithms
Kearns and Valiant (1989), Schapire (1990), Freund
(1995), Freund and Schapire (1996a)
Boosted trees try to improve the model fit over

different trees by considering past fits
157
Boosting Trees
First, an initial tree model is fit (the size of the
tree is controlled by the modeler, but usually the
trees are small (depth < 8))
if a sample was not predicted well, the model residual
will be different from zero
samples that were predicted poorly in the last tree will
be given more weight in the next tree (and vice-versa)
After many iterations, the final prediction is a

weighted average of the prediction form each
tree
158
Boosting Illustration
Stage
Build
weighted
tree
1
n=200
X1 > 5.2
n=90
X1 < 5.2
n=110
2
e
i 32.9
Compute
stage weight
stage 1 = f(32.9)
159
n=200
Compute
error
Reweigh
observations
(wi=1,2,..., n)
...
i 1
Determine weight of ith

observation:
The larger the error,
the higher the weight
X27 > 22.4

n=64
n=200
X27 < 22.4
X6 > 0
X6 < 0
n=136
n=161
n=39
2
e
i 26.7
i 1
stage 2 = f(26.7)
Determine weight of ith
observation
2
e
i 29.5
i 1
stage M = f(29.5)
Boosting Trees
Boosting has three tuning parameters:
number of iterations (i.e. trees)
complexity of the tree (i.e. number of splits)
learning rate: how quickly the algorithm adapts
This implementation is the most computationally

taxing of the tree methods shown here
160
Final Boosting Model
M
F ( x) m f m x
m 1
where the m are constrained to sum to 1.
161
Properties of Boosting
Robust to overfitting
As the number of iterations increases, the test set
error does not increase
Schapire, et al. (1998), Friedman, et al. (2000),
Freund, et al. (2001)
Can be misled by noise in the response

Boosting will be unable to find a predictive model if the
response is too noisy.
Kriegar, et al. (2002), Wyner (2002), Schapire (2002),
Optiz and Maclin (1999)
162
Boosting Trees
One approach to training is
to set the learning rate to a
high value (0.1) and tune
the other two parameters
In the plot to the right, a grid
of 9 combinations of the 2
tuning parameters were
used to optimize the model
The optimal settings were:
500 trees with high complexity
163
Comparison Summary
Comparing the four methods:
Training Data
(bootstrap)
164
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
Rand Forest
3.55
0.857
3.00
0.885
Boosting
3.64
0.847
3.19
0.870

Model Comparisons
165
Which Model is Best?

The No Free Lunch Theorem:
over the set of all possible problems, each algorithm
will do on average as well as any other
or, in other words,
if one model is better than another, it is because of the
particular problem at hand; no one method is uniformly
best
Despite this statement, the next slide has some
(subjective) ratings of models
166
Top Level Comparisons
Excellent
167
Very Good
Average
Fair
Poor
Top Level Comparisons
ZV = zero var predictor, NZV = near-zero var predictor,

CS = center+scale, HCP = highly correlated predictor
* Depends on implementation
168
Boston Housing Data
The correlation between the results on the training set

(n=337) via cross-validation and the results from the test
set (n=169) were 0.971 (RMSE) and 0.965 (R2)
169
Some Advice
There is an inverse relationship between
performance and interpretability
We want the best of both worlds: great
performance and a simple, intuitive model
Interpretability
Tree
Regression
PLS
MARS
Try this:
Fit a high performance model to get an
idea of the best possible performance
Move up the line and see if a less
complex model can keep performance
up with some interpretability
NNet
Boosted
Tree
SVM
RF/Bagging
Performance
170
Regression Datasets
171
Internet Move Data Base

IMDB is an on-line resource that catalogs movies and TV
programs from many countries.
Basic information about the program is maintained and
users can rate each program on a five point scale.
We extracted information about movies and captured:
the average vote
the number of votes
basic information: run time, rating (if any), year of release, etc
genre: drama, comedy etc and
keywords: based on novel, female lead, title spoken by character
Can we predict the movie rating based on these data?

172
Tecator Spectroscopy Data

From Statlib:
These data are recorded on a Tecator Infratec Food and Feed
Analyzer working in the wavelength range 850 - 1050 nm by the
Near Infrared Transmission (NIT) principle.
Each sample contains finely chopped pure meat with different
moisture, fat and protein contents.
For each meat sample the data consists of a 100 channel
spectrum of absorbances and the contents of moisture (water), fat
and protein.
The absorbance is -log10 of the transmittance measured by the
spectrometer.
The three contents, measured in percent, are determined by
analytic chemistry.
173
Tecator Spectroscopy Data

The variables are spectral
measurements at specific
wavelengths and are
highly autocorrelated.
We wish to predict the
percent fat for each
sample.
174
Towson Home Sales

Information about homes sold in the Towson, Maryland area (north of
Baltimore) were collected.
The area encompasses the northern border of Baltimore city
(Idlewydle), suburban areas (Annelsie, Rodgers Forge, Wiltondale)
and more expensive areas (Stoneleigh, Ruxton).
Variables include:
The lot size
The sale date and
Square footage
The year built
Number of baths
Can we accurately predict the sale price of a home?

175
Regression Backup Slides
176
SVM Model Fit Test Set Results
177
MARS Model Fit Test Set Results
178
Regression Tree Model Fit Test Set Results
179
Boosting Tree Model Fit Test Set Results
180
Variable Importance for PLS

To understand the
importance of each factor,
we can look at a weighted
sum of the absolute
regression coefficients
the weights are based on
the decrease in error as
more components are
added
We can also look at the

loadings to get a more
detailed assessment
181
Variable Importance for PLS

Here, we can look at the
increase in R2 as model
terms are added
If the variable is never
used in a term, it has an
importance of zero
182
Variable Importance for Regression Trees

Here, we can look at the
decrease in MSE as
model terms are added
If the variable is never
used in a split, it has an
importance of zero
183
Variable Importance for Random Forests

A permutation approach is
used
Each training data for
variable is scrambled in
turn and the % increase in
the out-of-bag MSE is
tracked
184
Boosting, Formally
Boosting fits a forward stagewise additive model
(Hastie, Tibshirani and Friedman, 2001) through
the following steps:
1. Let f 0 x 0
2. For m 1, 2, , M do steps a and b

N
a. m , hm arg min ,h yi f m 1 xi h xi
i 1
where R, and h is a tree.

b. f m x f m 1 x m hm x
185
Boostings Underlying Model

acts as a shrinkage parameter and is called the
learning rate.
a parameter that controls the rate of learning of observations
that overlap on a decision boundary (Friedman, 2001)
Shrinkage boosting can be viewed as fitting this additive

model:
fM x
hm H d
Hd
h x m
m m
m 1
where hm(x) Hd , and Hd represents a dictionary of

trees of depth d. (Hastie, 2001)
186
Linear Regression Pre-Processing

Linear regression models will fail if there are zerovariance predictors included
They will also fail during cross-validation if any nearzero variance predictors are in the data
As just discussed, removing highly correlated

predictors is strongly suggested
Centering and scaling are not required, but can
greatly increase the numerical stability of the
model
187
PLS Pre-Processing
Because of its dimension reduction abilities, PLS
is resistant to zero- and near-zero variance
predictors
Also, since PLS can handle (and perhaps exploit)
correlated predictors, it is not necessary to
remove them
Centering and scaling are extremely important for
PLS models
otherwise, the predictors with large variability can
dominate the selection of components
188
Neural Network Pre-Processing

Neural network models will not fail with zero-variance
predictors
However, these models use a large number of parameters
and near-zero variance predictors may lead to numerical
issues such as a failure to converge
Highly correlated predictors should be removed;
multicollinearity can have a significant effect on model
performance
Centering and scaling are required
189
MARS Pre-Processing
MARS models are resistant to zero- and near-zero
variance predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up
Centering and scaling are not required but are suggested
190
Tree Pre-Processing
A basic regression tree requires very little preprocessing
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results
highly correlated predictors are allowed

Including highly correlated descriptors can cause instability
and make descriptor importance rankings somewhat random
zero- and near-zero variance predictors are allowed
191
Classification-type Models
192
Setting
Response is categorical
Response may have more than two categories
193
Objective

Data
Model
Prediction
194
Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis
Modern classification methods

Tree-based ensemble methods
Boosting and random forests
Neural networks
k-nearest neighbors
Naive Bayes
Each of these methods seek to find a partitioning of the

data that minimizes classification error
195
Evaluating Classification Model Performance

Like regression models, we desire to understand the
predictive ability of a classification model.
We can evaluate a models performance by using crossvalidation or a test set of data.
For regression models, the measure of performance was
RMSE (or RMSPE)a function of the deviation of the
observed value from the predicted value.
This is a valid measure of performance when the response is
continuous, but not when the response is categorical.
Instead, we need a measure of predictive ability that is

appropriate for categorical data.
196
Objective
Minimize classification error (or maximize accuracy)
Determine how well the model prediction agrees with the
actual classification of observations.
Actual
Predicted
197
Active
Inactive
Total
Active
A+B
Inactive
C+D
Total
A+C
B+D
N=A+B+C+D
Intuition
An intuitive measure of accuracy is
(A + D) / N
When the actual classes are balanced, this is an
appropriate measure of model performance.
But, this measure produces the same values for

different tables:
Active
Inactive
Active
50
50
Inactive
50
4850
vs
Active
Inactive
Active
95
Inactive
95
4805
Accuracy for both tables is 0.98

198
Does one table show more agreement than the other?
Another Measure: Kappa

To provide a measure of agreement for unbalanced
tables, Cohen (1960) proposed comparing the observed
agreement to the expected agreement
To compute Kappa, we need
The observed agreement: O = (A + D) / N
The expected agreement
A C A B B D C D
E
N2
Kappa is defined as: k = (O E) / (1 E)

199
Kappa Properties
Generally: -1 k 1
values close to 0 indicate poor agreement
values close to 1 indicate near perfect agreement
for complete disagreement, k = -1
Values of 0.4 or above are considered to indicate moderate

agreement, and values of 0.8 or higher indicate excellent
agreement. (Stokes, Davis, and Koch, 2001)
Can be generalized to > 2 classes

k = 0.49
Active
Active
Inactive
Active
Inactive
50
k = 0.65
Active
50
95
Inactive
50
4850
Inactive
95
4805
Note: When the observed classes are balanced, kappa = accuracy

200
Another Measure:
Receiver Operating Characteristic (ROC) Curves
ROC curves can be used to assess a
classification models performance or to compare
several models performance
Building an ROC curve requires that the model
produces a continuous prediction
For each predicted value of the response, we
construct a 2x2 table using the predicted value
as the cutoff.
201
ROC Curves
Terminology:
Sensitivity = True Positive Rate = TP / (TP + FN)
Specificity = True Negative Rate = TN / (FP + TN)
An ROC curve is a plot of 1 specificity versus

sensitivity for each predicted value of the response
false positive rate versus true positive rate
A perfect classification model has both a sensitivity and

specificity of 1.
202
ROC Example
All observations with predicted probabilities the cutoff are classified as negative.
203
Classification Model Predictions

Several classification models generate a predicted value
for each class in the original data
PLSDA, FDA, and NN
The class with the largest predicted outcome is the

predicted class
Predictions from the model are generally between 0 and 1, but are
not guaranteed to be within this range.
The softmax technique is used to transform the predicted

outcomes to probability-like values that can be
interpreted as class probabilities
On the [0, 1] scale and add up to 1
204
Softmax Function
Let gik be the classification score of the ith
observation into group k.
The probability that the observation is in group k
g ik
is:
e
K
g ip
p 1
where K is the total number of groups
205
Discriminant Models
206
Classical Discriminant Models

These models form a discriminant function that
can be used to classify samples
The discriminant function is a linear function of the
predictors that attempts to:
This is a latent variable method similar to PLS and

others that we have seen
how the latent variable is created differs between
methods
207
Linear Discriminant Analysis

Assumption: the within group variability is the same for
each group.
For a two-class problem, the classification boundary is a
straight line
The function uses the within-class means and the overall

covariance structure to create the latent variable
Because it uses the covariance matrix, there must be

at least as many compounds as predictors
no zero-variance or linearly dependent predictors
LDA is not optimal for groups separated by curvature

208
Example where LDA works
The plot on the right

shows a three class
example where a linear
method like LDA is most
effective
209
Aside: LDA and Logistic Regression
It turns out that LDA and logistic regression are fitting models that are
very similar
LDA assumes that the predictors are measured with error and that the
classification of the observations is known
LR assumes that the predictors are known and that the classification of
the observations are measured with error
210
Assuming that the response error is Normal, the optimal separating

plane for logistic regression is:
LDA estimates a large number of parameters and has fairly strict

constraints on the data
Also, logistic models may be more forgiving of skewed predictor

distributions
Example Data
For our example data
set, LDA doesnt do a
very good job since
the boundary is
nonlinear
The linear predictor is
determined to be
(1.18 Predictor A)
(0.25 Predictor B)
211
Aside: LDA and Large Number of Predictors

Some classification models are not drastically
affected by large numbers of predictors
In many cases, a number of predictors will be noise
LDA has the potential to overfit

LDA class probability estimates become more extreme
as the number of predictors becomes large even when
there is no underlying difference
A similar issue occurs in LR

For LR, at some point a random predictor will perfectly
split the classes
212
Aside: LDA and Large Number of Predictors

For example, we simulated a
data set that was complete noise
For a small number of predictors,
the posterior probabilities were
grouped around 0.50
As the number of predictors was
increased, the certainty of
these probabilities became more
extreme
213
PLS for Discrimination

In regression PLS seeks to find linear
combinations of the original variables
(scores) that are highly correlated with
the response.
For classification problems we can use
PLS to find linear combinations of the
original variables that optimally
separate the data.
Unlike regression, the response for
classification is a binary matrix, with each
column indicating the class of the
observation
214
Response Matrix
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
PLS Optimization
Like the regression setting, we must solve an

optimization problem that is subject to
constraints:
1. The X-space and Y-space PLS directions have unit
length
2. Either
a.Successively derived scores in each space are uncorrelated
to previously derived scores, OR
b.Successively derived directions in each space are orthogonal
215
Solution:
Same as PLS for Regression
The optimization problem defined by PLS can be

solved through the following formulation:
2
Cov a X, b Y
arg max
,
T
T
a a b b
a, b
var a X var b Y corr a X, b Y

arg max
T
T
a
a
b
b
a, b
216
Facts
Barker and Rayens (2003) showed:
The PLS directions are the eigenvectors of a modified
between-class covariance matrix, B.
Coding of the response matrix does not matter
either g columns or g-1 columns provides the same answer
The constraint in the Y-space does not make sense

Why constrain a response that denotes class membership?
If the Y-space constraint is removed, the PLS

directions are exactly the eigenvectors of the betweenclass covariance matrix, B.
LDA is optimal if dimension reduction is not necessary
The optimal directions for LDA are the eigenvectors of W-1B.
217
PLS Discriminant Analysis Example 1
The softmax function is used to determine classification boundaries.

218
PLS Discriminant Analysis Example 2

PLSDA
219
LDA
Quadratic Discriminant Analysis

Assumption: the within group variability is different for
each group.
The decision rule is
where k represents group k.

The class with the largest score is the predicted class
A function of squared distance of each observation from each
groups center
The decision rule depends on the covariance matrix for

each group
220
Quadratic Discriminant Analysis

QDA extends the LDA
model by using quadratic
(i.e nonlinear) classification
boundaries
However, the data
requirements are more
stringent
at least as many compounds
as predictors in each class
no zero-variance or linearly
dependent predictors
221
Regularized Discriminant Analysis

The method tries to split the difference between LDA and
QDA.
It uses two tuning parameters, gamma and lambda:
gamma controls the correlation assumption for the predictors
as gamma 1 the model assumes less predictor correlations
lambda toggles between linear and quadratic boundaries

gamma = 0 & lambda = 1 LDA
gamma = 0 & lambda = 0 QDA
Other combinations of gamma and lambda produce

models that are compromises between LDA and QDA
222
Regularized Discriminant Analysis

To see the effect of changing gamma:
RdaMovieA.gif
To see the effect of changing lambda:

RdaMovieB.gif
We can find the optimal gamma and lambda by

cross-validation
223
Flexible Discriminant Analysis

FDA generalizes LDA to highly nonlinear boundaries
In addition to the original predictors, nonlinear functions of
the predictors are added to the data
This is known as a basis expansion of the original data
This procedure essentially builds a set of one versus all

a 0/1 outcome is used for each model
the softmax function is used to convert the model output to class
probabilities
224
Flexible Discriminant Analysis

For example, the MARS hinge functions can be
used
For each 0/1 outcome, the best predictor/split of
the data is determined and two hinge functions
are added
Hinge functions are added until a pre-specified
number of terms is reached
Like the MARS model, the number of features is
reduced until the fit begins to suffer
225
FDA Example
FDA uses the MARS procedure to determine new
hinge features
for these data, 3 sets of features were used in to
discriminate the classes
226
Modern Classification Methods
227
Classification Trees
Like regression trees, classification trees search
through each predictor to find a value of single
predictor that splits the data into two (or more)
groups that are more pure than the original
group.
For each partition, each predictor is evaluated at
all possible split points and the best predictor
and split are selected.
Process continues until some criterion for stopping is
met (like minimum number of observations in a node)
228
Splitting Example
Pred A
A > Thresh 1
A Thresh 1
Pred B
B > Thresh 2
Pred D
B Thresh 2
Pred A
A > Thresh 3
229
D > Thresh 4
D Thresh 4
A Thresh 3
Impurity Measures
There are several measures for determining the
purity of the split. For a two-class, two common
measures are
Misclassification error
Gini index
230
Impurity Measure Definitions
a
c
p1 min
,
ac ac
d
b
p2 min
,
bd bd
ac
bd
w1
, w2
n
n
Misclassification error: w1p1 + w2p2

When w1 = w2= 0.5, ME = 0.5*(p1 + p2)
Gini index: w1p1(1-p1) + w2p2(1-p2)

When w1 = w2= 0.5, GI = 0.5*(p1(1-p1) + p2(1-p2))
231
Impurity Measure Comparison
232
Simple Example
10
In this example a few

possible partitions clearly
stand out:
x1 = 5,
How does each impurity

measure rank these
partitions?
x2
x2 = 1.5
x2 = 7.5, or
6
x1
233
10
Classification Results
234
Ensemble Methods
Like individual regression trees, single
classification trees
are not optimal classification methods.
have high variabilitysmall changes in the data can
drastically affect the structure of the tree.
Bagging, random forests, and boosting can also

be implemented for classification problems
235
Bagging, Random Forests, and Boosting

Each of these ensemble methods are
implemented in the same way as in regression.
The objective is to minimize misclassification
error
The loss function changes to exponential loss rather
than squared error loss.
Tuning parameters for these methods are the

same as in regression
236
Neural Networks
Like PLS, neural networks for classification
translate the classes to a set of binary (zero/one)
variables.
The binary variables are modeled using the
predictors and the softmax technique is used to
make sure that the model outputs behave like
probabilities
237
Fitting Neural Networks

As in regression models, there are two
complexity parameters:
The number of hidden units
The amount of weight decay
The second parameter helps determine the

smoothness of the classification boundaries
For the example data:
nnetMovie.gif
238
Support Vector Machines (SVMs)

SVMs for classification use
a completely different
objective function:
the margin
Suppose we have two

predictors and a bunch of
compounds
We may want to classify
compounds as active or
inactive
Lets further suppose that
these two predictors
completely separate these
classes
239
The Margin
There are an infinite
number of straight lines
that we can use to
separate these two
groups
some must be better than
others
The margin is a defined

by equally spaced
boundaries on each side
of the line
240
The Margin
To maximize the
margin, we try to make
it as large as possible
without capturing any
compounds
As the margin
increases, the solution
becomes more robust
SVMs maximize the
margin to estimate
parameters
241

When the classes overlap, points are allowed within the
margin
the number of points is controlled by a cost parameter
The points that are within the margin (or on its

boundary) are the support vectors
It turns out that the prediction function only uses the
support vectors
the prediction equation is more compact and efficient
the model may be more robust to outliers
242
Similar to regression models, the kernel trick
can be used to generate highly nonlinear class
boundaries
For classification, there are two common kernel
functions
polynomial (3 tuning variables)
radial basis functions (2 parameters)
243
SVM Example Class Boundary

RBF Kernel
244
79 SVs (31.6%)
The Effect of the Cost Parameter

As the cost parameter is increased, the model will
work very hard to correctly classify the compounds
This can lead to over-fitting
To see the effect of the cost parameter, the link

below shows an animation for a radial basis
function SVM
SvmMovieB.gif
Note that, as the boundary becomes more

complicated, the #SV decreases
The margin is becoming very small
245
Nearest Neighbor Classifiers

To predict the class of a new compound, this
procedure uses the most frequent class of the
closest k neighbors
if a tie, randomly pick from the most frequent classes
k, the number of neighbors, is the tuning

parameter
Since distance is used to define the nearest
points, the predictors should be centered and
scaled
246
Nearest Neighbor Classifiers

For the simulated data,
the model was tuned
across k values from 1 to
20
7 neighbors was found to
be optimal
k-NN class boundaries

tend to be somewhat
jagged but smooth out as
k increases
247
Nave Bayes
Recall Bayes theorem:
Of course, the predictor distributions are usually

multivariate and these probabilities would involve
multidimensional integration
248
Nave Bayes
In nave Bayes, aka Idiots Bayes, the
relationships between predictors are ignored
i.e all predictors are treated as uncorrelated
249
Nave Bayes
Despite this assumption, this model usually is
very competitive, even with strong correlations
How do we estimate continuous predictor
distributions?
parametrically: assume normality and use the sample
mean and variance
non-parametrically: use a nonparametric density
estimator
250
Nave Bayes
For example, looking at only the distribution of
predictor A in our example, we see a slight shift
between the distributions of the predictor for
each class:
251
Nave Bayes
If a new sample has a

value of predictor A =
-1, it is more likely to
be active
active density ~ 0.40
inactive density ~ 0.17
252
Nave Bayes
For predictor B, the
inactive probability is much
larger for values between
-0.5 and 0.5
For each predictor, the
distributions are modeled
class probabilities can be
computed for each predictor
The final class probability

is calculated by multiplying
all the probabilities
together
253
A Tale of Two Samples
Sample 1
254
Sample 2
Pred A
Pred B
Pred B Pred A
-1
Total
-1
-1
Total
Active
0.40
0.14
0.06
0.40
0.30
0.12
Inactive
0.17
0.62
0.10
0.17
0.08
0.01
Nave Bayes and Many Predictors

Like LDA, nave Bayes
models can overfit when
many noisy predictors are
included in the model
As with LDA, we simulated
noise data and were able
to see class separation
increase as the number of
predictors went up
255
Nave Bayes Classifiers

Class boundaries for
nave Bayes models
can show circular or
elliptical islands
Since the predictors
are treated as
uncorrelated, there
cannot be any
diagonal ellipses
256
Example: Prediction of Spam

These data were collect by HP. 4,601 e-mails were
classified as spam or not spam.
Predictor variables are derived form the emails related to
the frequency of words or characters in the e-mail.
Variables include:
A set of word frequency variables. For example, the variable make
measures the relative frequency of that word in the email
Variables related to numbers: words that start with numbers are
also measured. For example, the variable num415 measures how
often the number 415 appears
Other variables relate to special characters (e.g. the variable
charExclamation) or capital letters (capitalAve)
257
Example: Prediction of Spam

We would like to classify emails as being spam with an
emphasis on high specificity, i.e. a low probability of nonspam being labeled as spam
For training, an 80% split was used via stratified random
sampling
258
Method Comparison
259
Method Comparison
260
ROC Comparison
261
Classification Datasets
262
Glaucoma Data
62 variables are derived from a confocal laser scanning
image of the optic nerve head, describing its morphology.
Observations are from normal and glaucomatous eyes,
respectively. Examples of variables are:
as: superior area
vbss: volume below surface temporal
mhcn: mean height contour nasal
vari: volume above reference inferior, etc
We would like to predict whether a subject has glaucoma

given their imaging data
263
Predicting Diabetes in Pima Indians

These data are from Pima Indian women living in Arizona.
Several variables were collected, such as:
pregnant: number of
pregnancies
glucose: plasma glucose
levels
pressure: diastolic BP
mass: body mass index

pedigree: diabetic pedigree
function,
age
diabetes: negative or positive
triceps: skin fold thickness

insulin: serum insulin
We would like to predict a new Indian woman's diabetic

status given their other information.
264
Classification Backup Slides
265
FDA Pre-Processing
FDA models often use the MARS hinge functions, so they
share similar properties.
FDA models are resistant to zero- and near-zero variance
predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up
266
Tree Pre-Processing
Same as for regression
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results
highly correlated predictors are allowed

Including highly correlated predictors can cause
instability and make predictor importance rankings
somewhat random
zero- and near-zero variance predictors are

allowed
267
RDA Pre-Processing
RDA models are cannot deal with zero- and near-zero
variance predictors
they must be removed
Highly correlated predictors are allowed, but not

suggested
However, perfectly correlated predictors will cause the model to fail

Additionally, there cannot be linear dependencies between
predictors
268
Neural Network Pre-Processing

Neural network models will not fail with zero-variance
predictors
However, these models use a large number of parameters
and near-zero variance predictors may lead to numerical
issues such as a failure to converge
Highly correlated predictors should be removed.
269
Nearest Neighbor Pre-Processing

These models are resistant to zero- and near-zero
variance predictors as well as highly correlated predictors
270
Nave Bayes Pre-Processing

These model will not fail with zero-variance predictors
Highly correlated predictors are also allowed.
Centering and scaling are not required
271

Other Considerations
272
Variables to Select
Variables thought to be related to the response
should be included in the model
Sometimes we dont know if a set of variables are
related to the response
Should these be included in the analysis?
If the variables are not related to the response,
then we are including noise into our predictor set
What happens to the performance of the
techniques when noise is added?
Can we still find signal?
273
Illustration
To the blood brain barrier data of Mente and Lombardo
(2005), we have added 10, 50, 100, and 200 random
predictors
For each of these new data sets, we have built each
regression model, using cross-validation to determine the
optimal parameter settings
The results are on the following slides
Keep in mind that these results are for one example
Methods may have different rankings for other examples
274
Performance Comparison
R2: CV for Training Set
0.5
0.4
0.3
0.2
0.1
0
0
50
Noise
275
100
150
200
Performance Comparison
R2: Test Set
0.5
0.4
0.3
0.2
0.1
50
Noise
276
100
150
200
Variables to Select
Hopefully, weve demonstrated that resampling is
a good way to avoid over-fitting
Realize that predictor selection is part of the
modeling process
Doing predictor selection outside of crossvalidation can lead to sever predictor selection
bias
and potential over-fitting (but you wont know until a
test set)
277
Effects of Categorizing a Continuous Response

A majority of responses are measured on a continuous
scale
The continuous scale allows us to compare observations
on their original scale
Sometimes the continuous response naturally falls into
two or more modes
If the relative distance between these modes is not relevant, then
the response can be binned
However, if the distance between modes is relevant, then we lose
information by binning the response
Binning a continuous response that does not have natural

modes will make us lose even more information and will
degrade model
278
Thanks
Thanks for sitting through all this
More thanks to:
Benevolent overlords David Potter and Ed
Kadyszewski
Nathan Coulter and Gautam Bhola for computing
support
Pfizer Chemistry for feedback on earlier versions of
this training
279

Mulitvariate Random Trees

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mulitvariate Random Trees

Uploaded by

Copyright:

Available Formats

Model Building Training

General approaches to model building

Response may be continuous or categorical

observed and/or calculated

We would like to know why the model works, as

What Are They Not Good For?

(lets do these instead!)

To make a good model that predicts well on

What Are They Not Good For?

If there is a lot of knowledge of how the therapy

The Big Picture

In the end, [predictive modeling] is not a

Modern classification methods

Interesting Models We Dont Have Time For

Other Boosted Models

Example Data Sets

Boston Housing Data

crim: per capita crime rate

Indus: proportion of non-retail

dis: weighted distances to five

rad: index of accessibility to

tax: full-value property-tax rate

ptratio: pupil-teacher ratio by

Medv: median value homes

nox: nitric oxides concentration

rm: average number of rooms

chas: Charles River dummy

Age: proportion of owneroccupied units built prior to

Toy Classification Example

A simulated data set will be

A probability model was used to

Toy Classification Example

With two predictors, the class

Model Building Training

To construct a model of predictors that

Model Building Steps

The modeler has a finite amount of data, which

The more data we spend, the better estimates well get

Methods for Creating a Test Set

Creating a Test Set: Completely Random Splits

A completely random (CR) split randomly partitions the

Creating a Test Set: Stratified Random Splits

Stratification attempts to preserve the distribution

Complex regression and classification models assume

one of them is over-fit

The bias and variance are inversely related

How can we accurately estimate the MSE from

Resampling can help estimate these metrics

How Do We Estimate Over-Fitting?

Often, poor choices for these parameters can

How Do We Estimate Over-Fitting?

Resampling methods try to embed variation in

K-fold Cross Validation

K-fold Cross Validation

Leave Group Out Cross Validation

Some samples wont be selected

The process is repeated multiple times (say 30)

Training Models with Tuning Parameters

Simulated Data Example

Well train over 11 odd values for the number of

Toy Data Example

It looks like 7 or more

Toy Data Example

Model Building Training