Professional Documents
Culture Documents
Max Kuhn
Kjell Johnson
Global Nonclinical Statistics
Overview
Typical data scenarios
Examples well be using
Typical Data
Predictive Models
What is a predictive model?
A model whose primary purpose is for prediction
(as opposed to inference)
(yikes)
Models: Easy
References
Statistical Modeling: The Two Cultures by Leo
Breiman (Statistical Science, Vol 16, #3 (2001),
199-231)
The Elements of Statistical Learning by Hastie,
Tibshirani and Friedman
Regression Modeling Strategies by Harrell
Supercrunchers by Ayres
8
Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Support vector machines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests
9
Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis
Neural networks
Support vector machines
k-nearest neighbors
Naive Bayes
10
Other Models:
Conditional inference trees, C4.5, C5, Cubist, other tree models
Learned vector quantization
Self-organizing maps
Active learning techniques
11
12
b: proportion of minorities
13
14
15
General Strategies
16
Objective
18
Spending Data
We typically spend data on training and test data sets
Training Set: these data are used to estimate model parameters
and to pick the values of the complexity parameter(s) for the
model.
Test Set (aka validation set): these data can be used to get an
independent assessment of model efficacy. They should not be
used during model training.
21
Over-Fitting
Over-fitting occurs when a model has extremely good
prediction for the training data but predicts poorly when
the data are slightly perturbed
new data (i.e. test data) are used
23
Over-Fitting Example
The plots below show classification boundaries
for two models built on the same data
Predictor B
Predictor B
Predictor A
24
Predictor A
Over-Fitting in Regression
Historically, we evaluate the quality of a
regression model by its mean squared error.
Suppose that are prediction function is
parameterized by some vector
25
Over-Fitting in Regression
MSE can be decomposed into three terms:
irreducible noise
squared bias of the estimator from its expected value
the variance of the estimator
26
Over-Fitting in Regression
When the model under-fits,
the bias is generally high and
the variance is low
Over-fitting is typically
characterized by high
variance, low bias estimators
In many cases, small
increases in bias result in
large decreases in variance
27
Over-Fitting in Regression
Generally, controlling the MSE yields a good
trade-off between over- and under-fitting
a similar statement can be made about classification
models, although the metrics are different (i.e. not
MSE)
32
Bootstrapping
Bootstrapping takes a random sample with
replacement
the random sample is the same size as the original data
set
compounds may be selected more than once
each compound has a 63.2% change of showing up at
least once
The Bootstrap
With bootstrapping,
the number of heldout samples is
random
Some models, such
as random forest, use
bootstrapping within
the modeling process
to reduce over-fitting
35
38
Data Pre-Processing
40
Why Pre-Process?
In order to get effective and stable results, many
models require certain assumptions about the
data
this is model dependent
41
42
43
44
NZV Example
In computational chemistry we
created predictors based on
structural characteristics of
compounds.
As an example, the descriptor
nR11 is the number of 11member rings
The table to the right is the
distribution of nR11 from a
training set
the distinct value percentage is
5/535 = 0.0093
the frequency ratio is 501/23 = 21.8
46
# 11-Member Rings
Value
Frequency
501
23
Detecting NZVs
Two criteria for detecting NZVs are the
Discrete value percentage
Defined as the number of unique values divided by the number of
observations
Rule-of-thumb: discrete value percentage < 20% could indicate a
problem
Frequency ratio
Defined as the frequency of the most common value divided by the
frequency of the second most common value
Rule-of-thumb: > 19 could indicate a problem
or, alternatively
1. Compute the correlation matrix of the predictors
2. Predictors with (absolute) pair-wise correlations above
a threshold can be flagged for removal
3. Rule-of-thumb threshold: 0.85
48
49
51
Setting
Response is continuous
52
Objective
Regression Methods
Multiple linear regression
Partial least squares
Neural networks
Multivariate adaptive regression splines
Support vector machines
Regression trees
Ensembles of trees:
Bagging, boosting, and random forests
Additive Models
In the beginning there were linear models:
E Y 0 1 X 1 p X p
g E Y 0 1 X 1 p X p
and link functions appeared.
E Y f 0 f1 X 1 f p X p
55
Recursive
Partitioning
(Trees)
Flexibility
GLM
PLS
Bagging
Boosting
Multivariate
Adaptive
Regression
Splines*
Random
Forests
GAM
56
Neural
Nets
Support
Vector
Machines*
57
59
SSE
RMSE
n p 1
RMSPE measures the average deviation of an
observation to its predicted value for the test or crossvalidation set
n*
RMSPE
y
i i
i 1
Computing Q2
Process:
Partition the data into
a training and testing set, or
blocks to be used for training and testing
62
Objective: Find the plane through the data that minimizes the
sum-of-squares error.
63
min Y X
p
64
X X X Y
65
66
X T X
XTY
Solution 3
In the presence of multicollinearity, the best
solution will be unstable.
68
69
Results
Training Data
(bootstrap)
Linear Reg
Test Data
RMSE
Q2
RMSE
R2
5.23
0.691
4.53
0.742
71
73
74
Direction 1
Score
Predictor 2
0
-6
-5
-4
-3
-2
-1
0
-1
-2
-3
-4
Predictor 1
75
http://pfizerpedia/index.php/Image:PCAmovie.gif
Mathematically Speaking
The optimization problem defined by PCA can be solved
through the following formulation:
aTX
a a
Uncorrelated scores
The new scores are not linearly related to each other
Drawbacks
PCA chases variability
PCA directions will be drawn to predictors with the most variability
Outliers may have significant influence on the directions and
resulting scores.
77
Procedure:
1. Reduce dimension of predictors using PCA
2. Regress scores on response
Notice: The procedure is sequential
78
Predictor
Variables
PCA
PC Scores
MLR
Response
Variable
79
80
2.00
1.50
Response
1.00
0.50
0.00
-0.50
-1.00
-1.50
R2 = 0.001
-2.00
-6.00
81
-4.00
-2.00
0.00
2.00
4.00
6.00
8.00
PLS History
82
1
Response1
Predictor3
Predictor4
Response2
Response3
Predictor5
Predictor6
Predictors
Latent Variables
Responses
Predictor1
Predictor2
Predictor3
Predictor4
Predictor5
84
Response1
PLS Optimization
(many predictors, one response)
85
PLS Optimization
(many predictors, one response)
Mathematically Speaking
The optimization problem defined by PLS can be solved
through the following formulation:
2
Cov a X, Y
arg max
,
T
a a
a
subject to constraints 2a. or b.
87
Facts
the ith PLS direction, ai, is the eigenvector
corresponding to the ith largest eigenvalue of ZTZ,
where Z = XTy.
the ith largest eigenvalue is the amount of co-variability
summarized
by the ith PLS component.
T
ai X
Cov a X, Y
arg max
a Ta
a
var a T X var Y corr 2 a T X, Y
arg max
T
a a
a
var a T X corr 2 a T X, Y
var Y arg max
a Ta
a
2
var scores corr scores, response
var response arg max
a Ta
a
2
88
Dimension Reduction
(PCA)
89
Regression
Drawbacks
Similar to PCA, PLS chases co-variability
PLS directions will be drawn to independent variables with the most
variability (although this will be tempered by the need to also be
related to the response)
Outliers may have significant influence on the directions, resulting
scores, and relationship with the response. Specifically, outliers can
make it appear that there is no relationship between the
predictors and response when there truly is a relationship, or
make it appear that there is a relationship between the
predictors and response when there truly is no relationship
90
Simultaneous dimension
reduction and regression
Predictor
Variables
PLS
Response
Variable
91
92
2.00
1.50
Response
1.00
0.50
0.00
-0.50
-1.00
-1.50
R2 = 0.93
-2.00
-2.00
93
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
2.00
2.50
PLS in Practice
PLS seeks to find latent variables (LVs) that
summarize variability and are highly predictive of
the response.
How do we determine the number of LVs to
compute?
Evaluate RMSPE (or Q2)
94
96
97
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
98
100
Mathematically Speaking
The optimization problem defined by PLS can be
solved through the following formulation:
2
Cov a X, b Y
arg max
,
T
T
a a b b
a, b
X-space Dimension
Reduction (PCA)
102
Regression
Y-space Dimension
Reduction (PCA)
Neural Networks
103
Neural Networks
Like PLS or PCR, these models create
intermediary latent variables that are used to
predict the outcome
Neural networks differ from PLS or PCR in a few
ways
the objective function used to derive the new variables
is different
The latent variables are created using flexible, highly
nonlinear functions
The latent variables usually do not have any meaning
104
Network Structures
There are many types of neural network structures
we will concentrate on the single layer, feed-forward network
One hidden layer of
latent variables
Predictor1
Predictor2
Predictor3
Hidden Unit 1
Hidden Unit 2
Predictor4
Predictor5
105
Hidden Unit k
Response1
106
Training Networks
It is highly recommended that the predictors are
centered and scaled prior to training
The number of hidden units is a tuning
parameter
With many predictors and hidden units, the
number of estimated parameters can become
very large
with a large number of hidden units, these models can
quickly start to overfit
Weight Decay
This is a training technique that attempts to
shrink the parameter estimates towards zero
large parameter estimates are penalized in the model
training
109
110
111
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
112
113
Objective Functions
Recall that linear
regression estimates
parameters by
calculating:
the model residuals
the total sum of the
squared residuals (SSR)
Objective Functions
Support vector machine
regression models create a
funnel around the
regression line
residuals within the funnel are
not counted in the parameter
estimation
the sum of the residuals
outside the funnel are used as
the objective function (no
squared term)
117
118
Nonlinear Boundaries
Nonlinear boundaries can be computed using the
kernel trick
The predictor space can be expanded by adding
nonlinear functions of the predictors
Common kernel functions are:
119
Nonlinear Boundaries
The trick is that the computations can operate
only on the inner-products of the extended
predictor set
Cost functions
Support vector machines also include a regularization
parameter that controls how much the regression line can
adapt to the data
smaller values result in more linear (i.e. flat) surfaces
The remaining
parameter (cost) shows
a clear optimum
122
Summary
Currently, the SVM model is best at prediction (but
worst at interpretation)
Training Data
(bootstrap)
123
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
SVM (radial)
3.79
0.834
3.28
0.861
124
MARS Features
MARS uses hinge functions
that are two connected lines
For a data point x of a
predictor, MARS creates a
function that models the data
on each side of x:
126
h(x-6) h(6-x)
10
10
128
129
130
131
132
133
Summary
SVMs are still optimal, but the respectable
performance and interpretability of MARS might
make us reconsider
Training Data
(bootstrap)
134
Test Data
RMSE
Q2
RMSE
R2
Linear Reg
5.23
0.691
4.53
0.742
PLS
5.25
0.689
4.56
0.739
Neural Net
4.60
0.757
4.20
0.780
SVM (radial)
3.79
0.834
3.28
0.861
MARS
4.29
0.791
3.98
0.804
Regression Trees
135
Regression Trees
A regression tree searches through each
predictor to find a value of single predictor that
best splits the data into two groups.
the best split minimizes the mean squared error of the
model.
Computational Difficulties
Suppose we have n observations and p
predictors.
For each level of the tree, there are at most p(n-1)
possible splits
A Greedy Approach
Instead of trying to find the best global set of
regions for which the responses are similar, we
recursively partition the data to find an optimal
set of decision rules.
A regression tree searches through each
predictor to find a value of a single predictor that
best splits the data into two groups.
138
min
c
xij R1
c1 min
2
c2
xij R2
c2
2
140
141
Comparison
For these data, we tried 6
possible tree sizes
For each value, resample the
data and calculate
performance
After a depth of 4, the model
cannot improve performance
Training Data
(bootstrap)
Single Tree
144
Test
RMSE
Q2
RMSE
R2
5.18
0.700
4.28
0.780
However, we can
clearly get a sense of
what the model is
saying
145
Single Trees
Advantages
can be computed very quickly and have simple
interpretations.
have built-in predictor selection: if a predictor was not
used in any split, the model is completely independent
of that data.
Disadvantages
instability due to high variance: small changes in the
data can drastically affect the structure of a tree
data fragmentation
high order interactions
146
Ensemble Methods
147
Ensemble Methods
Ensembles of trees have been shown to provide
more predictive models than individual trees and
are less variable than individual trees
Common ensemble methods are:
Bagging
Random forests, and
Boosting
148
Bagging Trees
Bootstrap Aggregation
Breiman (1994, 1996)
Bagging is the process of
1. creating bootstrap samples
of the data,
2. fitting models to each
sample
3. aggregating the model
predictions
149
Bagging Model
Prediction of an observation, x:
M
F ( x)
150
f x
m 1
Comparison
Bagging can significantly increase performance of trees
from resampling:
Training Data
(bootstrap)
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
Random Forests
Random forests models are similar to bagging
separate models are built for each bootstrap sample
the largest tree possible is fit for each bootstrap sample
Randomly select a
subset of variables
from original data
Dataset 1
Dataset 2
Dataset M
|
Build trees
Predict
Predict
Final Prediction
153
Predict
Prediction of an observation, x:
M
F ( x)
154
f x
m 1
Robustness to noise
All observations have an equal chance to influence
each model in the ensemble
Hence, outliers have less of an effect on individual
models for the overall predicted values
155
Comparison
Comparing the three methods using resampling:
Training Data
(bootstrap)
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
Rand Forest
3.55
0.857
3.00
0.885
156
Boosting Trees
A method to boost weak learning algorithms
(small trees) into strong learning algorithms
Kearns and Valiant (1989), Schapire (1990), Freund
(1995), Freund and Schapire (1996a)
157
Boosting Trees
First, an initial tree model is fit (the size of the
tree is controlled by the modeler, but usually the
trees are small (depth < 8))
if a sample was not predicted well, the model residual
will be different from zero
samples that were predicted poorly in the last tree will
be given more weight in the next tree (and vice-versa)
Boosting Illustration
Stage
Build
weighted
tree
1
n=200
X1 > 5.2
n=90
X1 < 5.2
n=110
2
e
i 32.9
Compute
stage weight
stage 1 = f(32.9)
159
n=200
Compute
error
Reweigh
observations
(wi=1,2,..., n)
...
i 1
n=200
X27 < 22.4
X6 > 0
X6 < 0
n=136
n=161
n=39
2
e
i 26.7
i 1
stage 2 = f(26.7)
Determine weight of ith
observation
2
e
i 29.5
i 1
stage M = f(29.5)
Boosting Trees
Boosting has three tuning parameters:
number of iterations (i.e. trees)
complexity of the tree (i.e. number of splits)
learning rate: how quickly the algorithm adapts
160
Prediction of an observation, x:
M
F ( x) m f m x
m 1
161
Properties of Boosting
Robust to overfitting
As the number of iterations increases, the test set
error does not increase
Schapire, et al. (1998), Friedman, et al. (2000),
Freund, et al. (2001)
Boosting Trees
One approach to training is
to set the learning rate to a
high value (0.1) and tune
the other two parameters
In the plot to the right, a grid
of 9 combinations of the 2
tuning parameters were
used to optimize the model
The optimal settings were:
500 trees with high complexity
163
Comparison Summary
Comparing the four methods:
Training Data
(bootstrap)
164
Test
RMSE
Q2
RMSE
R2
Single Tree
5.18
0.700
4.28
0.780
Bagging
4.32
0.786
3.69
0.825
Rand Forest
3.55
0.857
3.00
0.885
Boosting
3.64
0.847
3.19
0.870
165
Excellent
167
Very Good
Average
Fair
Poor
Some Advice
There is an inverse relationship between
performance and interpretability
We want the best of both worlds: great
performance and a simple, intuitive model
Interpretability
Tree
Regression
PLS
MARS
Try this:
Fit a high performance model to get an
idea of the best possible performance
Move up the line and see if a less
complex model can keep performance
up with some interpretability
NNet
Boosted
Tree
SVM
RF/Bagging
Performance
170
Regression Datasets
171
174
Square footage
Number of baths
176
177
178
179
180
182
183
184
Boosting, Formally
Boosting fits a forward stagewise additive model
(Hastie, Tibshirani and Friedman, 2001) through
the following steps:
1. Let f 0 x 0
a. m , hm arg min ,h yi f m 1 xi h xi
i 1
fM x
hm H d
Hd
h x m
m m
m 1
187
PLS Pre-Processing
Because of its dimension reduction abilities, PLS
is resistant to zero- and near-zero variance
predictors
Also, since PLS can handle (and perhaps exploit)
correlated predictors, it is not necessary to
remove them
Centering and scaling are extremely important for
PLS models
otherwise, the predictors with large variability can
dominate the selection of components
188
189
MARS Pre-Processing
MARS models are resistant to zero- and near-zero
variance predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up
190
Tree Pre-Processing
A basic regression tree requires very little preprocessing
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results
191
Classification-type Models
192
Setting
Response is categorical
Response may have more than two categories
193
Objective
Classification Methods
Discriminant analysis framework
Linear, quadratic, regularized, flexible, and partial least squares
discriminant analysis
Neural networks
Support vector machines
k-nearest neighbors
Naive Bayes
Objective
Minimize classification error (or maximize accuracy)
Determine how well the model prediction agrees with the
actual classification of observations.
Actual
Predicted
197
Active
Inactive
Total
Active
A+B
Inactive
C+D
Total
A+C
B+D
N=A+B+C+D
Intuition
An intuitive measure of accuracy is
(A + D) / N
When the actual classes are balanced, this is an
appropriate measure of model performance.
Inactive
Active
50
50
Inactive
50
4850
vs
Active
Inactive
Active
95
Inactive
95
4805
A C A B B D C D
E
N2
Kappa Properties
Generally: -1 k 1
values close to 0 indicate poor agreement
values close to 1 indicate near perfect agreement
for complete disagreement, k = -1
Active
Inactive
Active
Inactive
50
k = 0.65
Active
50
95
Inactive
50
4850
Inactive
95
4805
Another Measure:
Receiver Operating Characteristic (ROC) Curves
ROC curves can be used to assess a
classification models performance or to compare
several models performance
Building an ROC curve requires that the model
produces a continuous prediction
For each predicted value of the response, we
construct a 2x2 table using the predicted value
as the cutoff.
201
ROC Curves
Terminology:
Sensitivity = True Positive Rate = TP / (TP + FN)
Specificity = True Negative Rate = TN / (FP + TN)
ROC Example
All observations with predicted probabilities the cutoff are classified as negative.
203
204
Softmax Function
Let gik be the classification score of the ith
observation into group k.
The probability that the observation is in group k
g ik
is:
e
K
g ip
p 1
205
Discriminant Models
206
209
It turns out that LDA and logistic regression are fitting models that are
very similar
LDA assumes that the predictors are measured with error and that the
classification of the observations is known
LR assumes that the predictors are known and that the classification of
the observations are measured with error
210
Example Data
For our example data
set, LDA doesnt do a
very good job since
the boundary is
nonlinear
The linear predictor is
determined to be
(1.18 Predictor A)
(0.25 Predictor B)
211
214
Response Matrix
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
PLS Optimization
(many predictors, many responses)
215
Solution:
Same as PLS for Regression
Cov a X, b Y
arg max
,
T
T
a a b b
a, b
Facts
Barker and Rayens (2003) showed:
The PLS directions are the eigenvectors of a modified
between-class covariance matrix, B.
Coding of the response matrix does not matter
either g columns or g-1 columns provides the same answer
219
LDA
223
224
FDA Example
FDA uses the MARS procedure to determine new
hinge features
for these data, 3 sets of features were used in to
discriminate the classes
226
227
Classification Trees
Like regression trees, classification trees search
through each predictor to find a value of single
predictor that splits the data into two (or more)
groups that are more pure than the original
group.
For each partition, each predictor is evaluated at
all possible split points and the best predictor
and split are selected.
Process continues until some criterion for stopping is
met (like minimum number of observations in a node)
228
Splitting Example
Pred A
A > Thresh 1
A Thresh 1
Pred B
B > Thresh 2
Pred D
B Thresh 2
Pred A
A > Thresh 3
229
D > Thresh 4
D Thresh 4
A Thresh 3
Impurity Measures
There are several measures for determining the
purity of the split. For a two-class, two common
measures are
Misclassification error
Gini index
230
a
c
p1 min
,
ac ac
d
b
p2 min
,
bd bd
ac
bd
w1
, w2
n
n
232
Simple Example
10
x1 = 5,
x2
x2 = 1.5
x2 = 7.5, or
6
x1
233
10
Classification Results
234
Ensemble Methods
Like individual regression trees, single
classification trees
are not optimal classification methods.
have high variabilitysmall changes in the data can
drastically affect the structure of the tree.
235
Neural Networks
Like PLS, neural networks for classification
translate the classes to a set of binary (zero/one)
variables.
The binary variables are modeled using the
predictors and the softmax technique is used to
make sure that the model outputs behave like
probabilities
237
238
The Margin
There are an infinite
number of straight lines
that we can use to
separate these two
groups
some must be better than
others
The Margin
To maximize the
margin, we try to make
it as large as possible
without capturing any
compounds
As the margin
increases, the solution
becomes more robust
SVMs maximize the
margin to estimate
parameters
241
Nonlinear Boundaries
Similar to regression models, the kernel trick
can be used to generate highly nonlinear class
boundaries
For classification, there are two common kernel
functions
polynomial (3 tuning variables)
radial basis functions (2 parameters)
243
244
79 SVs (31.6%)
245
246
Nave Bayes
Recall Bayes theorem:
Nave Bayes
In nave Bayes, aka Idiots Bayes, the
relationships between predictors are ignored
i.e all predictors are treated as uncorrelated
249
Nave Bayes
Despite this assumption, this model usually is
very competitive, even with strong correlations
How do we estimate continuous predictor
distributions?
parametrically: assume normality and use the sample
mean and variance
non-parametrically: use a nonparametric density
estimator
250
Nave Bayes
For example, looking at only the distribution of
predictor A in our example, we see a slight shift
between the distributions of the predictor for
each class:
251
Nave Bayes
252
Nave Bayes
For predictor B, the
inactive probability is much
larger for values between
-0.5 and 0.5
For each predictor, the
distributions are modeled
class probabilities can be
computed for each predictor
Sample 1
254
Sample 2
Pred A
Pred B
Pred B Pred A
-1
Total
-1
-1
Total
Active
0.40
0.14
0.06
0.40
0.30
0.12
Inactive
0.17
0.62
0.10
0.17
0.08
0.01
255
256
258
Method Comparison
259
Method Comparison
260
ROC Comparison
261
Classification Datasets
262
Glaucoma Data
62 variables are derived from a confocal laser scanning
image of the optic nerve head, describing its morphology.
Observations are from normal and glaucomatous eyes,
respectively. Examples of variables are:
as: superior area
vbss: volume below surface temporal
mhcn: mean height contour nasal
vari: volume above reference inferior, etc
265
FDA Pre-Processing
FDA models often use the MARS hinge functions, so they
share similar properties.
FDA models are resistant to zero- and near-zero variance
predictors
Highly correlated predictors are allowed, but this can lead
to significant amount of randomness during the predictor
selection process
The split choice between two highly correlated predictors becomes
a toss-up
266
Tree Pre-Processing
Same as for regression
missing predictor values are allowed
centering and scaling are not required
centering and scaling do not affect results
RDA Pre-Processing
RDA models are cannot deal with zero- and near-zero
variance predictors
they must be removed
268
269
270
271
272
Variables to Select
Variables thought to be related to the response
should be included in the model
Sometimes we dont know if a set of variables are
related to the response
Should these be included in the analysis?
If the variables are not related to the response,
then we are including noise into our predictor set
What happens to the performance of the
techniques when noise is added?
Can we still find signal?
273
Illustration
To the blood brain barrier data of Mente and Lombardo
(2005), we have added 10, 50, 100, and 200 random
predictors
For each of these new data sets, we have built each
regression model, using cross-validation to determine the
optimal parameter settings
The results are on the following slides
Keep in mind that these results are for one example
Methods may have different rankings for other examples
274
Performance Comparison
R2: CV for Training Set
0.5
0.4
0.3
0.2
0.1
0
0
50
Noise
275
100
150
200
Performance Comparison
R2: Test Set
0.5
0.4
0.3
0.2
0.1
50
Noise
276
100
150
200
Variables to Select
Hopefully, weve demonstrated that resampling is
a good way to avoid over-fitting
Realize that predictor selection is part of the
modeling process
Doing predictor selection outside of crossvalidation can lead to sever predictor selection
bias
and potential over-fitting (but you wont know until a
test set)
277
278
Thanks
Thanks for sitting through all this
More thanks to:
Benevolent overlords David Potter and Ed
Kadyszewski
Nathan Coulter and Gautam Bhola for computing
support
Pfizer Chemistry for feedback on earlier versions of
this training
279