Professional Documents
Culture Documents
Variance
Alaa Tharwat
Email: engalaatharwat@hotmail.com
Alaa Tharwat 1 / 21
Classification and regression errors
Alaa Tharwat 2 / 21
Classification and regression errors
where,
M SE represents the difference between the target function (ti ) which
is represented by the training data and the approximation function
(yi ).
ti is the target function for the ith training sample,
N is the number of training data and,
and yi is the predicted or approximated function.
Alaa Tharwat 3 / 21
Classification and regression errors
N
( )
1 X
Risk = E(M SE) = E (ti − yi )2
N
i=1
N
1 X
E (ti − yi )2
= (2)
N
i=1
Alaa Tharwat 4 / 21
Classification and regression errors
E (ti − yi )2 = E (ti − fi + fi − yi )2
(3)
where,
E fi = fi2 because f is deterministic.
2
E (ti − yi )2 = E 2 + E (fi − yi )2
(5)
where,
E 2 represents
the variance noise or conditional variance,
2
E (fi − yi ) is the conditional mean (the MSE between the true
function and the predicted function).
The first term can not be minimized, it is called irreducible error.
the second term can be simplified (next slide).
Alaa Tharwat 6 / 21
Classification and regression errors
where,
E {fi E {yi }} = fi E {yi } because f is deterministic as mentioned
before,
n o
2 2
E E {yi } = E {yi } since E {E {z}} = E {z},
E {yi fi } = fi E {yi } ,
2
E {yi E {yi }} = E {yi } ,
Alaa Tharwat 7 / 21
Classification and regression errors
2 2
fi E {yin}
: : E {yi }
o
2 : fi E {yi } : E {yi }
2(
E{f
i E
{y i }} − E
E
{y
i } − E {y
i i
f } + E {y
i
E {y i }})
= fi E {yi } − E {yi }2 − fi E {yi } + E {yi }2 = 0 (7)
E (ti − yi )2 = E 2 + E (fi − yi )2
+ E (E {yi } − yi )2
Alaa Tharwat 8 / 21
Classification and regression errors
Alaa Tharwat 9 / 21
Classification and regression errors
Alaa Tharwat 10 / 21
Classification and regression errors
Alaa Tharwat 11 / 21
Classification and regression errors
The third term in Equation (8), the variance of the learning model
(Variance {yi }) indicates how much the learning model yi will move
around its mean.
this type of error is caused by the sensitivity to small fluctuations in
the training data which is also due to the estimation from finite
samples.
the high variance indicates that the learning model interpolates the
training data perfectly including the noise or outliers in the data, this
is called Overfitting.
Alaa Tharwat 12 / 21
Classification and regression errors
Assume that the one of the learning models interpolates perfectly fits
means that E {yi } = fi ; hence, the bias error,
alltraining data. This
E (fi − E {yi })2 , is zero.
the variance E (E {yi } − yi )2 = E (fi − yi )2 and
yi = ti = fi + ; thus,
E (fi − yi )2 = E (fi − ti )2 = E (fi − (ti + ))2 = E 2 ;
which is the square of noise of the training data, and the bias error is
zero.
when the classification or regression model can perfectly interpolate
the training data; i.e., y = t and E(y) = f , for all data samples;
hence, the bias term can be neglected, but the variance term will be
increased and in this case it will be equal to the variance of noise.
Finally, all the three terms (in Equation (8)) are non-negative and
this forms a lower bound on the expected error on unseen data.
Moreover, the last two terms; i.e., bias and variance, in Equation (8)
can be minimized.
Alaa Tharwat 13 / 21
Illustrative example
1.5
f
1 t
0.5
f(x)
−0.5
−1
−1.5
−1 −0.5 0 0.5 1
x
Figure: Illustration of the original function and the target function which are
generated from it.
Alaa Tharwat 14 / 21
Illustrative example
0.5
f(x)
f(x)
0
t(x)
y1(x)
−0.5 y5(x)
y9(x)
−1
−1 −0.5 0 0.5 1
x
Figure: Illustration of fitting models (y1 , y5 , and y9 ) for the target data (t).
Alaa Tharwat 15 / 21
Illustrative example
The figure below shows the models y1 for 10 training datasets each
consists of 20 samples.
y1 achieved high bias error and high MSE between f (x) and the
mean of all fits and most of the error is bias error and the error
between the mean of all fits/approximations and f (x) is high.
1
0.5
Individual y1(x)
−0.5
Mean of All Fits
f(x)
Squared Error
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the first model (y1 ) with ten target data each of size 20.
Alaa Tharwat 16 / 21
Illustrative example
The figure below shows the models y2 for 10 training datasets each
consists of 20 samples.
y2 achieved results better than y1 .
1 Individual y5(x)
Mean of All Fits
f(x)
0.5
Squared Error
−0.5
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the second model (y5 ) with ten target data each of size 20.
Alaa Tharwat 17 / 21
Illustrative example
The figure below shows the models y9 for 10 training datasets each
consists of 20 samples.
y9 achieved the minimum bias (lower than y1 and y2 ), and the third
model (y9 ) was sharp and it perfectly fits all the training samples.
1 Individual y (x)
9
Mean of All Fits
f(x)
0.5 Squared Error
−0.5
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the third model (y9 ) with ten target data (t) each of size
20.
Alaa Tharwat 18 / 21
Illustrative example
Some models achieved high bias and low variance such as y1 in our
example, and the decrease in bias followed by increase in variance.
However, there is some intermediate model complexity that can
balance between the bias and variance and achieve the minimum
expected test error.
Alaa Tharwat 19 / 21
Illustrative example
0.1
0.05
0
2 4 6 8 10
Model Complexity (Polynomial Order)
Figure: Plot of variance and square bias together with their sum. Moreover,
shown the average of test error and the complexity of the model which achieved
the minimum testing error.
Alaa Tharwat 20 / 21
Illustrative example
The figure below compares the training and testing errors when the
model complexity is changed.
the model complexity is not a good estimate for the test error and
the training error decreased inversely proportional with the model
complexity and the training error reaches to zero when a complex
model is used, i.e., overfitting.
0.4
Train Error
Test Error
0.3 Best Model
0.2
0.1
0
2 4 6 8 10
Model Complexity (Polynomial Order)
Figure: Illustration of the training and testing errors with respect to the model
complexity.
Alaa Tharwat 21 / 21