Professional Documents
Culture Documents
About Me
PhD from ETH
Used to be a statistician at Link, now
Senior Business Analyst at Expedia
04.10.201
3 | 3|
Cross Validation
Overview
What is R2?
Model error
Variance in the dependent variable
=1
55.001
2 = 0.6624
2
=1
55.001
2 = 0.72
2
Overfitting
Left to its own devices, Multiple Regression will fit too many
patterns.
Overfitting
Error on the dataset used
to fit the model can be
misleading
0.20
0.15
train data
test data
0.05
Bias-Variance Tradeoff.
low bias
high variance
----->
0.10
performance.
prediction error
high bias
low variance
<-----
40
60
80
100
120
140
160
Overfitting Cont.
What are the consequences of overfitting?
Overfitted models will have high R2 values, but will perform poorly in
predicting out-of-sample cases
Test
Validation
Test
3. See how well the model can predict the test set
Model
Validation
4.
For each fold estimate a model on all the subsets except one
Use the left out subset to test the model, by calculating a CV metric of
choice
1
Train
1
Train
1
Train
2
Train
2
Train
2
Train
3 Test
4
Train
3
Train
4
Train
5
Train
5
Train
5
Train
3 Test
4 Test
Mean Absolute
Percentage Error (MAPE)
(1/|
120%
|)*100
Simulation Example
Simulated 8 variables from a standard gaussian. The first
three variables where related to the dependent variable,
the other variables where random noise.
Simulation Example
Conclusion