Bias and Variance

Classification and regression errors: Bias and
Variance
Alaa Tharwat
Email: engalaatharwat@hotmail.com
Alaa Tharwat 1 / 21
Classification and regression errors
In the classification or regression problems, not all errors are equal,

the goal in any regression or classification model is to find a function
y that approximates the true function t = f +
where
is the noise and has zero mean, i.e. E {} = 0 and the variance is
σ2 .
The function t represents the training data or target function that can
be generated from the same function f
different training data t can be generated from the same function f .
Alaa Tharwat 2 / 21
Good learning algorithm can be obtained by minimizing the Mean

Square Error (M SE) for both training data and testing or unseen
data.
N
1 X
M SE = (ti − yi )2 (1)
N
i=1
where,
M SE represents the difference between the target function (ti ) which
is represented by the training data and the approximation function
(yi ).
ti is the target function for the ith training sample,
N is the number of training data and,
and yi is the predicted or approximated function.
Alaa Tharwat 3 / 21
To evaluate the model at arbitrary many test points, the expectation

of the MSE is calculated as follows:
N
( )
1 X
Risk = E(M SE) = E (ti − yi )2
N
i=1
N
1 X
E (ti − yi )2

= (2)
N
i=1
Alaa Tharwat 4 / 21
E (ti − yi )2 can be simplified as follows:

E (ti − yi )2 = E (ti − fi + fi − yi )2

= E (ti − fi )2 + E (fi − yi )2 + 2E {(ti − fi )(fi − yi )}

= E 2 + E (fi − yi )2 + 2E (ti fi − fi2 − yi ti + yi fi )

(3)
where,
E fi = fi2 because f is deterministic.
2
E {fi ti } = fi2 because f is deterministic and

E {ti } = E {fi + } = E {fi }.
E {yi ti } = E {yi (fi + )} = E {yi fi + yi } = E {yi fi }
the last term in Equation (3) is as follows:
 
 E{fi2 } E {yi fi }
2E > − f 2 − y t
ti f *
+ y f = fi2 −fi2 −E {yi fi }+E {yi fi } = 0
 i i i i i i

(4)
Alaa Tharwat 5 / 21
The final form of E (ti − yi )2 is as follows:

E (ti − yi )2 = E 2 + E (fi − yi )2

(5)
where,

E 2 represents
the variance noise or conditional variance,
2
E (fi − yi ) is the conditional mean (the MSE between the true
function and the predicted function).
The first term can not be minimized, it is called irreducible error.
the second term can be simplified (next slide).
Alaa Tharwat 6 / 21
E (fi − yi )2 = E (fi − E {yi } + E {yi } − yi )2

= E (fi − E {yi })2 + E (E {yi } − yi )2

+ 2E {(fi − E {yi })(E {yi } − yi )}

= bias2 + Variance {yi }
n o
+ 2(E {fi E {yi }} − E E {yi }2
− E {yi fi } + E {yi E {yi }}). (6)
where,
E {fi E {yi }} = fi E {yi } because f is deterministic as mentioned
before,
n o
2 2
E E {yi } = E {yi } since E {E {z}} = E {z},
E {yi fi } = fi E {yi } ,
2
E {yi E {yi }} = E {yi } ,
Alaa Tharwat 7 / 21
The last term in Equation (6) is
2 2
fi E {yin}
: : E {yi }
o
2 : fi E {yi } : E {yi }
2(
E{f
i E
{y i }} − E

E
{y

i } − E {y
i i
f } + E {y
i
E {y i }})
= fi E {yi } − E {yi }2 − fi E {yi } + E {yi }2 = 0 (7)
therefore, E (fi − yi )2 = bias2 + Variance {yi }.

E (ti − yi )2 = E 2 + E (fi − yi )2

= E 2 + E (fi − E {yi })2

+ E (E {yi } − yi )2

= Variance {noise} + bias2 + Variance {yi } (8)
Alaa Tharwat 8 / 21
The first term in Equation (8), the variance of noise

(Variance {noise}), cannot be minimized and it is independent of the
classification or regression model, is called irreducible error,
However, infinite number of training data will reduce this type of
error.
Alaa Tharwat 9 / 21
The second term in Equation (8), bias of the learning model,

represents the error caused by the simplifying assumptions built into
the model.
e.g., assume the learning model try to approximate a nonlinear
function using constant or linear models; hence, there will be an error
in the approximated function due to this assumption.
On the contrary, assume the learning model is complex; this means
that more training data will perfectly interpolated and hence lower
bias will be obtained but the variance will be high.
the bias is the error that is obtained due to the erroneous
assumptions in the learning model.
Hence, the high bias indicates that the learning model can not find
the relevant relations between the given data and the target outputs,
this is called Underfitting; and more complex models are able to
represent the training data more accurately will low bias.
Alaa Tharwat 10 / 21
Assume that the output of the learning model is constant, i.e.

yi = c, which is a very simple model. Hence,
E (fi − c)2 = E (fi − E {c} + E {c} − c)2

= E (fi − E {c})2 + E (E {c} − c)2

+2E {(fi − E {c})(E {c} − c)}

n o
= bias2 + V ariance {c} + 2(E {fi E {c}} − E E {c}2
−E {cfi } + E {cE {c}}). (9)
where E {c} = 0 since c is constant. Hence, the variance error is

zero but the bias error is high.
The third term in Equation (8), the variance of the learning model
(Variance {yi }) indicates how much the learning model yi will move
around its mean.
this type of error is caused by the sensitivity to small fluctuations in
the training data which is also due to the estimation from finite
samples.
the high variance indicates that the learning model interpolates the
training data perfectly including the noise or outliers in the data, this
is called Overfitting.
Assume that the one of the learning models interpolates perfectly fits
means that E {yi } = fi ; hence, the bias error,
alltraining data. This
E (fi − E {yi })2 , is zero.
the variance E (E {yi } − yi )2 = E (fi − yi )2 and

yi = ti = fi + ; thus,
E (fi − yi )2 = E (fi − ti )2 = E (fi − (ti + ))2 = E 2 ;

which is the square of noise of the training data, and the bias error is
zero.
when the classification or regression model can perfectly interpolate
the training data; i.e., y = t and E(y) = f , for all data samples;
hence, the bias term can be neglected, but the variance term will be
increased and in this case it will be equal to the variance of noise.
Finally, all the three terms (in Equation (8)) are non-negative and
this forms a lower bound on the expected error on unseen data.
Moreover, the last two terms; i.e., bias and variance, in Equation (8)
can be minimized.
Illustrative example
Given original function f (x) = sin(πx), where t = f + , = 0.1.

the target function t is represented by 20 points which are generated
from f (x).
1.5
f
1 t
0.5
f(x)
−0.5
−1
−1.5
−1 −0.5 0 0.5 1
x
Figure: Illustration of the original function and the target function which are
generated from it.
Assume we have three different polynomial learning models

y1 (x), y5 (x) and y9 (x), where y1 (x) is the first model with degree
one, y5 (x) is the second model with degree five, and y9 (x) is the
third model with degree nine.
0.5
f(x)
f(x)
0
t(x)
y1(x)
−0.5 y5(x)
y9(x)
−1
−1 −0.5 0 0.5 1
x
Figure: Illustration of fitting models (y1 , y5 , and y9 ) for the target data (t).
The figure below shows the models y1 for 10 training datasets each
consists of 20 samples.
y1 achieved high bias error and high MSE between f (x) and the
mean of all fits and most of the error is bias error and the error
between the mean of all fits/approximations and f (x) is high.
1
0.5
Individual y1(x)
−0.5
Mean of All Fits
f(x)
Squared Error
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the first model (y1 ) with ten target data each of size 20.
y2 achieved results better than y1 .
1 Individual y5(x)
Mean of All Fits
f(x)
0.5
Squared Error
−0.5
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the second model (y5 ) with ten target data each of size 20.
y9 achieved the minimum bias (lower than y1 and y2 ), and the third
model (y9 ) was sharp and it perfectly fits all the training samples.
1 Individual y (x)
9
Mean of All Fits
f(x)
0.5 Squared Error
−0.5
−1
−1 −0.5 0 0.5 1
Figure: Illustration of the third model (y9 ) with ten target data (t) each of size
20.
Some models achieved high bias and low variance such as y1 in our
example, and the decrease in bias followed by increase in variance.
However, there is some intermediate model complexity that can
balance between the bias and variance and achieve the minimum
expected test error.
In the figure below, the model complexity is represented by the

degree or the polynomial order of the model.
the best model that achieved the minimum testing error obtained
using an intermediate model complexity where the trade-off between
the bias and variance can be achieved.
0.25 Test Error
Bias2
0.2 Variance
Bias2+Var.
0.15 Best Model
0.1
0.05
0
2 4 6 8 10
Model Complexity (Polynomial Order)
Figure: Plot of variance and square bias together with their sum. Moreover,
shown the average of test error and the complexity of the model which achieved
the minimum testing error.
The figure below compares the training and testing errors when the
model complexity is changed.
the model complexity is not a good estimate for the test error and
the training error decreased inversely proportional with the model
complexity and the training error reaches to zero when a complex
model is used, i.e., overfitting.
0.4
Train Error
Test Error
0.3 Best Model
0.2
0.1
0
2 4 6 8 10
Model Complexity (Polynomial Order)
Figure: Illustration of the training and testing errors with respect to the model
complexity.

Bias and Variance

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bias and Variance

Uploaded by

Copyright:

Available Formats

Classification and regression errors: Bias and

In the classification or regression problems, not all errors are equal,

Good learning algorithm can be obtained by minimizing the Mean

To evaluate the model at arbitrary many test points, the expectation

E (ti − yi )2 can be simplified as follows:

= E (ti − fi )2 + E (fi − yi )2 + 2E {(ti − fi )(fi − yi )}

= E 2 + E (fi − yi )2 + 2E (ti fi − fi2 − yi ti + yi fi )

E {fi ti } = fi2 because f is deterministic and

The final form of E (ti − yi )2 is as follows:

E (fi − yi )2 = E (fi − E {yi } + E {yi } − yi )2

= E (fi − E {yi })2 + E (E {yi } − yi )2

+ 2E {(fi − E {yi })(E {yi } − yi )}

The last term in Equation (6) is

therefore, E (fi − yi )2 = bias2 + Variance {yi }.

= E 2 + E (fi − E {yi })2

= Variance {noise} + bias2 + Variance {yi } (8)

The first term in Equation (8), the variance of noise

The second term in Equation (8), bias of the learning model,

Assume that the output of the learning model is constant, i.e.

E (fi − c)2 = E (fi − E {c} + E {c} − c)2

= E (fi − E {c})2 + E (E {c} − c)2

+2E {(fi − E {c})(E {c} − c)}

where E {c} = 0 since c is constant. Hence, the variance error is

Given original function f (x) = sin(πx), where t = f + ,  = 0.1.

Assume we have three different polynomial learning models

In the figure below, the model complexity is represented by the

You might also like

= E 2 + E (fi − yi )2 + 2E (ti fi − fi2 − yi ti + yi fi )

= E 2 + E (fi − E {yi })2

Given original function f (x) = sin(πx), where t = f + , = 0.1.