You are on page 1of 21

Classification and regression errors: Bias and

Variance

Alaa Tharwat

Email: engalaatharwat@hotmail.com

Alaa Tharwat 1 / 21
Classification and regression errors

In the classification or regression problems, not all errors are equal,


the goal in any regression or classification model is to find a function
y that approximates the true function t = f + 
where
 is the noise and  has zero mean, i.e. E {} = 0 and the variance is
σ2 .
The function t represents the training data or target function that can
be generated from the same function f
different training data t can be generated from the same function f .

Alaa Tharwat 2 / 21
Classification and regression errors

Good learning algorithm can be obtained by minimizing the Mean


Square Error (M SE) for both training data and testing or unseen
data.
N
1 X
M SE = (ti − yi )2 (1)
N
i=1

where,
M SE represents the difference between the target function (ti ) which
is represented by the training data and the approximation function
(yi ).
ti is the target function for the ith training sample,
N is the number of training data and,
and yi is the predicted or approximated function.

Alaa Tharwat 3 / 21
Classification and regression errors

To evaluate the model at arbitrary many test points, the expectation


of the MSE is calculated as follows:

N
( )
1 X
Risk = E(M SE) = E (ti − yi )2
N
i=1
N
1 X 
E (ti − yi )2

= (2)
N
i=1

Alaa Tharwat 4 / 21
Classification and regression errors

E (ti − yi )2 can be simplified as follows:




E (ti − yi )2 = E (ti − fi + fi − yi )2
 

= E (ti − fi )2 + E (fi − yi )2 + 2E {(ti − fi )(fi − yi )}


 

= E 2 + E (fi − yi )2 + 2E (ti fi − fi2 − yi ti + yi fi )


  

(3)

where, 
E fi = fi2 because f is deterministic.
2

E {fi ti } = fi2 because f is deterministic and


E {ti } = E {fi + } = E {fi }.
E {yi ti } = E {yi (fi + )} = E {yi fi + yi } = E {yi fi }
the last term in Equation (3) is as follows:
 
 E{fi2 } E {yi fi }
2E  > − f 2 − y t
ti f *
 + y f = fi2 −fi2 −E {yi fi }+E {yi fi } = 0
 i i i i i i

(4)
Alaa Tharwat 5 / 21
Classification and regression errors

The final form of E (ti − yi )2 is as follows:




E (ti − yi )2 = E 2 + E (fi − yi )2
  
(5)

where,

E 2 represents
the variance noise or conditional variance,
2
E (fi − yi ) is the conditional mean (the MSE between the true
function and the predicted function).
The first term can not be minimized, it is called irreducible error.
the second term can be simplified (next slide).

Alaa Tharwat 6 / 21
Classification and regression errors

E (fi − yi )2 = E (fi − E {yi } + E {yi } − yi )2


 

= E (fi − E {yi })2 + E (E {yi } − yi )2


 

+ 2E {(fi − E {yi })(E {yi } − yi )}


= bias2 + Variance {yi }
n o
+ 2(E {fi E {yi }} − E E {yi }2
− E {yi fi } + E {yi E {yi }}). (6)

where,
E {fi E {yi }} = fi E {yi } because f is deterministic as mentioned
before,
n o
2 2
E E {yi } = E {yi } since E {E {z}} = E {z},
E {yi fi } = fi E {yi } ,
2
E {yi E {yi }} = E {yi } ,
Alaa Tharwat 7 / 21
Classification and regression errors

The last term in Equation (6) is

2 2
  fi E {yin}
: : E {yi } 
o
2 : fi E {yi }  : E {yi }
2(
E{f
i E
 {y i }} − E

E
 {y
 
i } − E {y
 i i
 f } + E {y
 i
E {y i }})
= fi E {yi } − E {yi }2 − fi E {yi } + E {yi }2 = 0 (7)

therefore, E (fi − yi )2 = bias2 + Variance {yi }.




E (ti − yi )2 = E 2 + E (fi − yi )2
  

= E 2 + E (fi − E {yi })2


 

+ E (E {yi } − yi )2


= Variance {noise} + bias2 + Variance {yi } (8)

Alaa Tharwat 8 / 21
Classification and regression errors

The first term in Equation (8), the variance of noise


(Variance {noise}), cannot be minimized and it is independent of the
classification or regression model, is called irreducible error,
However, infinite number of training data will reduce this type of
error.

Alaa Tharwat 9 / 21
Classification and regression errors

The second term in Equation (8), bias of the learning model,


represents the error caused by the simplifying assumptions built into
the model.
e.g., assume the learning model try to approximate a nonlinear
function using constant or linear models; hence, there will be an error
in the approximated function due to this assumption.
On the contrary, assume the learning model is complex; this means
that more training data will perfectly interpolated and hence lower
bias will be obtained but the variance will be high.
the bias is the error that is obtained due to the erroneous
assumptions in the learning model.
Hence, the high bias indicates that the learning model can not find
the relevant relations between the given data and the target outputs,
this is called Underfitting; and more complex models are able to
represent the training data more accurately will low bias.

Alaa Tharwat 10 / 21
Classification and regression errors

Assume that the output of the learning model is constant, i.e.


yi = c, which is a very simple model. Hence,

E (fi − c)2 = E (fi − E {c} + E {c} − c)2


 

= E (fi − E {c})2 + E (E {c} − c)2


 

+2E {(fi − E {c})(E {c} − c)}


n o
= bias2 + V ariance {c} + 2(E {fi E {c}} − E E {c}2
−E {cfi } + E {cE {c}}). (9)

where E {c} = 0 since c is constant. Hence, the variance error is


zero but the bias error is high.

Alaa Tharwat 11 / 21
Classification and regression errors

The third term in Equation (8), the variance of the learning model
(Variance {yi }) indicates how much the learning model yi will move
around its mean.
this type of error is caused by the sensitivity to small fluctuations in
the training data which is also due to the estimation from finite
samples.
the high variance indicates that the learning model interpolates the
training data perfectly including the noise or outliers in the data, this
is called Overfitting.

Alaa Tharwat 12 / 21
Classification and regression errors

Assume that the one of the learning models interpolates perfectly fits
means that E {yi } = fi ; hence, the bias error,
alltraining data. This
E (fi − E {yi })2 , is zero.
the variance E (E {yi } − yi )2 = E (fi − yi )2 and
 

yi = ti = fi + ; thus,
E (fi − yi )2 = E (fi − ti )2 = E (fi − (ti + ))2 = E 2 ;
  

which is the square of noise of the training data, and the bias error is
zero.
when the classification or regression model can perfectly interpolate
the training data; i.e., y = t and E(y) = f , for all data samples;
hence, the bias term can be neglected, but the variance term will be
increased and in this case it will be equal to the variance of noise.
Finally, all the three terms (in Equation (8)) are non-negative and
this forms a lower bound on the expected error on unseen data.
Moreover, the last two terms; i.e., bias and variance, in Equation (8)
can be minimized.

Alaa Tharwat 13 / 21
Illustrative example

Given original function f (x) = sin(πx), where t = f + ,  = 0.1.


the target function t is represented by 20 points which are generated
from f (x).

1.5
f
1 t

0.5
f(x)

−0.5

−1

−1.5
−1 −0.5 0 0.5 1
x

Figure: Illustration of the original function and the target function which are
generated from it.

Alaa Tharwat 14 / 21
Illustrative example

Assume we have three different polynomial learning models


y1 (x), y5 (x) and y9 (x), where y1 (x) is the first model with degree
one, y5 (x) is the second model with degree five, and y9 (x) is the
third model with degree nine.

0.5

f(x)
f(x)

0
t(x)
y1(x)
−0.5 y5(x)
y9(x)
−1
−1 −0.5 0 0.5 1
x

Figure: Illustration of fitting models (y1 , y5 , and y9 ) for the target data (t).

Alaa Tharwat 15 / 21
Illustrative example

The figure below shows the models y1 for 10 training datasets each
consists of 20 samples.
y1 achieved high bias error and high MSE between f (x) and the
mean of all fits and most of the error is bias error and the error
between the mean of all fits/approximations and f (x) is high.
1

0.5

Individual y1(x)
−0.5
Mean of All Fits
f(x)
Squared Error
−1
−1 −0.5 0 0.5 1

Figure: Illustration of the first model (y1 ) with ten target data each of size 20.
Alaa Tharwat 16 / 21
Illustrative example

The figure below shows the models y2 for 10 training datasets each
consists of 20 samples.
y2 achieved results better than y1 .

1 Individual y5(x)
Mean of All Fits
f(x)
0.5
Squared Error

−0.5

−1
−1 −0.5 0 0.5 1

Figure: Illustration of the second model (y5 ) with ten target data each of size 20.

Alaa Tharwat 17 / 21
Illustrative example

The figure below shows the models y9 for 10 training datasets each
consists of 20 samples.
y9 achieved the minimum bias (lower than y1 and y2 ), and the third
model (y9 ) was sharp and it perfectly fits all the training samples.

1 Individual y (x)
9
Mean of All Fits
f(x)
0.5 Squared Error

−0.5

−1
−1 −0.5 0 0.5 1

Figure: Illustration of the third model (y9 ) with ten target data (t) each of size
20.
Alaa Tharwat 18 / 21
Illustrative example

Some models achieved high bias and low variance such as y1 in our
example, and the decrease in bias followed by increase in variance.
However, there is some intermediate model complexity that can
balance between the bias and variance and achieve the minimum
expected test error.

Alaa Tharwat 19 / 21
Illustrative example

In the figure below, the model complexity is represented by the


degree or the polynomial order of the model.
the best model that achieved the minimum testing error obtained
using an intermediate model complexity where the trade-off between
the bias and variance can be achieved.
0.25 Test Error
Bias2
0.2 Variance
Bias2+Var.
0.15 Best Model

0.1

0.05

0
2 4 6 8 10
Model Complexity (Polynomial Order)

Figure: Plot of variance and square bias together with their sum. Moreover,
shown the average of test error and the complexity of the model which achieved
the minimum testing error.
Alaa Tharwat 20 / 21
Illustrative example

The figure below compares the training and testing errors when the
model complexity is changed.
the model complexity is not a good estimate for the test error and
the training error decreased inversely proportional with the model
complexity and the training error reaches to zero when a complex
model is used, i.e., overfitting.
0.4
Train Error
Test Error
0.3 Best Model

0.2

0.1

0
2 4 6 8 10
Model Complexity (Polynomial Order)

Figure: Illustration of the training and testing errors with respect to the model
complexity.
Alaa Tharwat 21 / 21

You might also like