You are on page 1of 6

1 of 6 1 of 6

Resampling Methods
1. Cross-Validation
2. Bootstrap
Cross-validation and the bootstrap are both ways of resampling from the data. Purpose of it is to get additional
information about the tted model.
Cross-validation is a very important tool to get a good idea of the test set error of a model. Also, it allows us to
pick a good model from a set of competing models.
Bootstrap is most useful to get an idea of the variability (standard deviation) of an estimate and its bias.
Cross-Validation
Training Error vs Test Error
Training error - error we get applying the model to the same data from which we trained.
Test error - error we get on previously unseen data.
Training error is too optimistic. The more we t to the data, the lower the training error. But the test error can get
higher if we overt.
Training- versus Test-Set Performance
Model complexity - the number of features, or the number of coefcients that we t in the model.
Training error curve:
2 of 6 2 of 6
On the left, the model complexity is low, were tting a small number of parameters. The training error is high.
As we increase the model complexity, we t more and more features in the model or higher complexity or higher
order polynomial, the training error goes down.
Test error curve:
Test error is the red curve does not consistently go down. It starts off high like the training error, comes down for
a while, but then it starts to come up again. This is an example of overtting. On the left, weve added complexity,
some features that actually are important for predicting the response. So they reduce the test error. But at that
point, we seem to have t all the important features. And now were putting in things which are just noise. The
trainer error goes down as it has to, but the test error is starting to go up.
We dont want to overt, because well increase the test error. The training error has not told us anything about
overtting, because its using the same data to measure error. The more parameters, the better it looks. So it does
not give us a good idea of the test error.
The ingredients of prediction error are bias and variance. Bias is how far off on the average the model is from the
truth. Variance is how much that the estimate varies around its average.
When we dont t very hard, the bias is high. The variance is low, because there are few parameters being t. As
we increase the amount of complexity moving to the right, the bias goes down because the model can adapt to
more and more subtleties in the data. But the variance goes up, because we have more and more parameters to
estimate from the same amount of data.
Bias and variance together give us prediction error and theres a trade off. So we cant use training error to
estimate test error, as the previous picture shows us. What do we do?
The best solution is to have a large test set, we can use that. But very often, we dont have a large test set.
Mathematical methods (Cp statistic, AIC and BIC) - these methods adjust the training error by increasing it
by a factor that involves the amount of tting that weve done to the data and the variance.
Validation-set approach: estimate the test error by holding out a subset of the training observations from
the tting process, and then applying the statistical learning method to those held out observations.
Validation - Set Approach
Here we randomly divide the available set of samples into two parts (of roughly equal size): a training set
and a validation or hold-out set.
We take the model, we t it on the training half and then we apply the tted model to the other half, the
validation or hold-out set.
The resulting validation-set error provides an estimate of the test error. This is typically assessed using
3 of 6 3 of 6
MSE (mean squared error) in the case of a quantitative response and misclassication rate in the case of a
qualitative (discrete) response.
This is wasteful if youve got a very small data set. Cross-validation removes that waste and is more efcient.
Example:
Left panel shows single split; right panel shows multiple splits
Were comparing the linear model to high-order polynomials in regression. We have 392 observations divided up
into two parts at random, 196 in one part and 196 in the other part. The rst part being the training set, and the
other part being the validation set. If we do this once, do a single split, and we record the mean squared error, we
get the red curve as a function of the degree of the polynomial. Minimum occurs at around 2, meaning that the
best model is quadratic.
But look what happens when we repeat this process with more and more splits at random into two parts (right
panel). We get a lot of variability. The minimum does tend to occur around 2 generally, but the error is varying
from about 16 up to 24, depending on the split. So this is a consequence part of the fact that we divided up into
two parts. And when you divide data in two, you get a lot of variability depending on the split. The training set is
half as big as it was originally.
There are two things we want to use cross-validation for: to pick the best size of the model and also to give us an
idea of how good the error is.
This procedure is successful at the rst thing, the minimums around 2 pretty consistently. But the actual level of
the curve is varying a lot, so it wouldnt be so good at telling an idea of the error.
Were throwing away half the data each time in training. We actually want the test error for a training set of size
n, but we are getting an idea of test error for a training set of size n/2. And thats likely to be quite a bit higher
4 of 6 4 of 6
than the error for a training set of size n.
K-Fold Cross-Validation
Widely used approach for estimating test error. Estimates can be used to select best model, and to give an idea of
the test error of the nal chosen model.
Idea is to randomly divide the data into K equal-sized parts. We leave out part k, t the model to the other K1
parts (combined), and then obtain predictions for the left-out part. This is done in turn for each part k = 1, 2,
, K, and then the results are combined.
The best choices for K, the number of folds, is usually about 5 or 10.
Let have K = 5:
Take the data set, divide at random into ve (equal) parts. The rst part is the validation set. We train the model
on the rest of the data, take the t of the model, and then predict on the validation part, and record the error.
Thats phase one. Phase two, the validation set will be part two. All the other four parts will be the training set.
We t them all to the training set, and then apply it to this validation part.
We keep doing this until all K parts play the role of validation set. We take all the prediction errors from all ve
parts, we add them together, and that gives us whats called the cross-validation error.
- number of observations in part k
- mean square error obtained on validation part.
is the t for observation i, obtained from the data with part k removed.
Since each training set is only (K 1) / K as big as the original training set, the estimates of prediction error
will typically be biased upward.
This bias is minimized when K = n (LOOCV), but this estimate has high variance.
K = 5 or 10 provides a good compromise for this bias-variance tradeoff.
This is cross-validation for a quantitative response. For the classication problems only thing that changes is the
measure of error (no longer square error, but a misclassication error).
The Bootstrap
k
th
C = MS V
(K)

k=1
K
n
k
n
E
k
n
k
MS = E
k

iC
k
( y
i
y
^
i
)
2
n
k
y
^
i
5 of 6 5 of 6
The bootstrap is a powerful method for assessing uncertainty in estimates. For example, it can provide an
estimate of the standard error of a coefcient, or a condence interval for that coefcient.
Example:
Suppose we have a xed sum of money which we want to invest in two assets that yield returns X and Y, where X
and Y are round quantities.
We want to invest a fraction of our money in X, and the remaining 1 - in Y, and we want to choose such to
minimize the total risk (that is the variance) of our investment. In other words we want to minimize
.
Mathematically:
but the values and are unknown.
These quantities are not known in general, because they are population quantities. But if we have a data set from
the population under study, we can get an idea of these quantities, the variances and the covariances. From the
sample values from the data set, we can compute estimates and
We can then estimate the value of that minimizes the variance of our investment using the estimates:
Now, imagine we simulate investment returns, containing 100 pairs of X and Y and calculate , and then repeat
this simulation 1,000 times. We will have 1,000 estimates of . The mean of those 1,000 estimates will be very
close to true .
We cant apply this with real data, because we dont actually have the ability to get 1,000 samples from the
population. We have a single sample, and we dont know the populations from which that data arose.
The bootstrap approach allows us to mimic the process of obtaining new data sets, so that we can estimate the
variability of our estimate without generating additional samples.
Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets
by repeatedly sampling observations from the original data set with replacement (we treat data sample as a
population)
Each of these bootstrap data sets is created by sampling with replacement, and is the same size as our original

Var(X + (1 )Y)
= =
Var(Y) Cov(X, Y)
Var(X) + Var(Y) 2Cov(X, Y)

2
Y

XY
+ 2
2
X

2
Y

XY
,
2
X

2
Y

XY
,
^
2
X

^
2
Y

^
XY

=
^

^
2
Y

^
XY
+ 2
^
2
X

^
2
Y

^
XY

6 of 6 6 of 6
dataset. As a result some observations may appear more than once in a given bootstrap data set and some not at
all.
Block bootstrap.
In previous example we assumed observations were IID. But in time series, because the observations are not
independent, to correlate across time block bootstrap is used.
The block bootstrap divides the data up into blocks, assumed to be independent. Our sampling units are not
individual observations, but entire blocks. So we would sample with replacement from all the blocks and then
paste them together into a new time series
Central point here is that you have to sample the things that are uncorrelated. Here it is assumed that beyond a
time lag of the block, observations are uncorrelated. But within a block, we expect correlation, so we keep the
blocks intact and sample them as units.
Uses of the bootstrap
Primarily used to obtain standard errors of an estimate.
Another very common use of the bootstrap is for condence interval of a population parameter (called Bootstrap
Percentile condence interval).
Resampling Methods
Gabriela Hromis
Notes are based on different books and class notes from different universities.
Images are from Statistical Learning, Hastie & Tibshirani

You might also like