Evaluation

Supervised Learning
Evaluation
Evaluation Issues
What measures should we use?
Classification accuracy might not be enough
How reliable are the predicted values?
Are errors on the training data a good indicator of performance on future

data?
If the classifier is computed from the very same training data, any
estimate based on that data will be optimistic
New data probably not exactly the same as the training data
Road Map
1. Evaluation Metrics
2. Evaluation Methods
3. How To Improve Classifiers Accuracy
Confusion Matrix
The confusion matrix is a table of at least mm size. An entry CMi,j
indicated the number of tuples of class i that were labeled as class j
Real class
\Predicted class
Class1
Class2
Classm
Class1
CM1,1
CM12
CM1,m
Class2
CM2,1
CM2,2
CM2,m
Classm
CMm,1
CMm,2
CMm,m
Ideally, most of the tuples would be represented along the diagonal

of the confusion matrix
Confusion Matrix
For a targeted class
Predicted Class
Target
Yes
No
Yes
True positives
False negatives
No
False positives
True negatives
Class
True positives: positive tuples correctly labeled
True negatives: negative tuples correctly labeled

False positives: negative tuples incorrectly labeled
False negatives: positive tuples incorrectly labeled
Accuracy
Predicted Class
Target
Yes
No
Yes
True positives (TP)
False negatives (FN)
No
False positives (FP)
True negatives (TN)
Class
Most widely used metric
Accuracy =
# correctly classified tuples

total # tuples
Accuracy =
TP + TN
TP + TN + FP + FN
Accuracy
Accuracy is better measured using test data that was not used to
build the classifier
Referred to as the overall recognition rate of the classifier
Error rate or misclassification rate: 1-Accuracy
When training data are used to compute the accuracy, the error
rate is called resubstituion error
Sometimes Accuracy is not Enough

Consider a 2-class problem
a
a
a
a
a
a
a
b
a
a
b
a
a
a
a
a
If a model predicts always class a, accuracy is 28/30= 93%

Accuracy is misleading because the model does not detect any
tuple of class b
Cost Matrix
Useful when specific classification errors are more severe than
others
Predicted Class
Target
Yes
No
Yes
C(Yes|Yes)
C(No|Yes)
No
C(Yes|No)
C(No|No)
Class
C(i|j): Cost of misclassifying class j tuple as class i
Computing the Cost of Classification

Cost Matrix
Cost Matrix
Costumer
Satisfied
Confusion Matrix
Model M1
Costumer
Satisfied
Yes
No
Yes
150
60
No
40
250
Cost= 4860
Yes
No
Yes
No
120
0
Confusion Matrix
Predicted Class
Accuracy= 80%
Predicted Class
Model M2
Costumer
Satisfied
Predicted Class
Yes
No
Yes
250
No
45
200
Accuracy= 90%
Cost= 5405
Cost-Sensitive Measures
Predicted Class
Yes
No
Target
Yes
True positives (TP)
False negatives (FN)
Class
No
False positives (FP)
True negatives (TN)
Precision(p) =
Recall(r) =
TP
TP + FP
TP
TP + FN
2rp
F1- measure =
r+ p
WeightedAccuracy =
Biased towards C(Yes|Yes) & C(Yes|No)

The higher the precision, the lower the FPs
Biased towards C(Yes|Yes) & C(No|Yes)
The higher the recall, the lower the FNs
Biased towards all except C(No|No)
It is high when both p and r are high
The higher the F1, the lower the FPs & FNs
wTP ! TP + wTN ! TN
wTP ! TP + wFN ! FN + wFP ! FP + wTN ! TN
Other Specific Measures

Other measures can be used when the accuracy measure is not
acceptable
TP
Sensitivity(sn) =
# Positives
Specificity(sp) =
Accuracy = se
TN
# Negatives
# Positives
# Negatives
+ sp
(# Positives + # Negatives)
(# Positives + # Negatives)
Predictor Error Measures

The predictor returns continuous values
Measure how far the predicted value from the known value
Compute loss functions
Absolute error =| yi ! y 'i |

yi : the true value
yi: the predicted value
Squared error = (yi ! y 'i )2
The test error or generalization error is the average loss

N
"| y
Mean absolute error =
! y 'i |
i=1
N is the size of
the test dataset
N
N
" (y
Mean squared error =
! y 'i ) 2
i=1
Predictor Error Measures

The total loss can be normalized: divid by the total loss incurred from
always predicting the mean
N
"| y
! y 'i |
Relative absolute error =
i=1
N
"| y
!y|
i=1
" (y
! y 'i ) 2
Relative squared error =
i=1
N
" (y
! y )2
i=1
N is the size of
the test dataset
Road Map
3. How to Improve Classifiers Accuracy
Step1: Prepare Training and Test Sets

Training set
data
Test set
Step2: Build The Model

Training set
data
Test set
Model
Step3: Evaluate The Model

Training set
data
Predictions
Test set
Model
Evaluation
Note on Parameter Tuning

Some leaning schemes operates in two stages:
Stage1: Build the basic structure of the model
Sage2: Optimize parameter settings
It is important not to use

test data to build the model
data
The test data should not

be used for parameter
tuning
Validation set
Training set
Parameter
Tuning
Test set
Evaluation
Model
Predictions
Methods for Performance Evaluation

How to obtain reliable estimate of performance?
Performance of a model may depend on other factors besides the

learning algorithm
Class distribution
Cost of misclassification
Size of training and test sets
Holdout
Typically use two-thirds of the data are allocated to training set

and one-third is allocated to test set
The estimate is pessimistic because only a portion of the initial
data is used to derive the model
For small or unbalanced datasets, samples might not be
representative
Stratified sampling: make sure that each class is represented with
approximately equal proportions in both subsets
Random Subsampling
Holdout estimate can be made more reliable by repeating the
process with different subsamples (k times)
In each iteration, a certain proportion is randomly selected for

training (possibly with stratification)
The error rates on the different iterations are averaged to yield an

overall error rate
Still not optimum since the different test sets overlap
Cross Validation
Avoids overlapping test sets
Data is split into k mutually exclusive subset, or folds, of equal size:
D1, D2,, Dk
Each subset in turn is used for testing and the remainder for training:
First iteration: use D2,Dk and training and D1 as test
Second iteration: use D1,D3,,Dk as training and D2 as test
This is called k-fold cross validation
Often the subsets are stratified before cross-validation is performed
The error estimates are averaged to yield an overall error estimate
Cross Validation
Standard method for evaluation stratified 10-fold cross validation
Why 10? Extensive experiments have shown that this is the best
choice to get an accurate estimate
Stratification reduces the estimates variance
Even better: repeated cross-validation. E.g. ten-fold cross-validation

is repeated ten times and results are averaged (reduces the
variance)
Leave-One-Out Cross Validation

A special case of k-fold cross-validation
K is set to the initial number of tuples
Only one sample is left out at a time for the test set
Makes best use of the data

Involves no random subsampling
Computationally expensive
Disadvantage: stratification is not possible
Extreme example: random dataset split equally into two classes
Model predicts majority class
50% accuracy on the whole data
Leave-One-Out-CV estimate is 100% error
The 0.632 Bootstrap

Sample training tuples uniformly with replacement: each time a
tuple is selected, it is equally to be selected again and re-added to
the training set
An instance has a probability of 1-1/n of not being picked
Thus its probability of ending up in the test data is:
n
"
1%
!1
1!
$
' ( e = 0.368
#
n&
This means the training data will contain approximately 63.2% of the
instances
Estimating Error using Bootstrap

The error estimate on the test data will be very pessimistic, since
training was on just ~63% of the instances
Therefore, combine it with the resubstitution error:
k
Err(M ) = " (0.632 ! Err(M i )test _ set +0.368 ! Err(M i )train _ set )
i=1
The resubstitution error gets less weight than the error on the test
data
Repeat process several times with different replacement samples;
average the results
More on Bootstrap
Probably the best way of estimating performance for very small
datasets
However, it has some problems
Consider a random dataset with a 50% class distribution
A model that memorizes all the training data will achieve 0%
resubstituion error on the training data and ~50% error on test data
Bootstrap estimate for this classifier:
Err=0.6320.5+0.3680=31.6%
True expected error: 50%
Road Map
3. How to Improve Classifiers Accuracy
Increasing the Accuracy

We have seen that pruning improves the accuracy of decision trees
by reducing the overfitting effect
There are some general strategies for improving the accuracy of

classifiers and predictors
Bagging and Boosting are some of these strategies

Ensemble methods: use a combination of models
Find an an improved composite model M*
Combine a series of learned classifiers M1,M2,,Mk
Find an an improved composite model M*
Bagging
Intuition
Ask diagnosis
to one doctor
How accurate is
this diagnosis ?
diagnosis
diagnosis_1
patient
diagnosis_2
diagnosis_3
Choose the diagnosis that occurs more than any of the others
Bagging
K iterations
At each iteration a training set Di is sampled with replacement
The combined model M* returns the most frequent class in case of
classification, and the average value in case of prediction
New data
sample
Data
.
.
.
Prediction
Boosting
Intuition
diagnosis_1 0.4
patient
diagnosis_2 0.5
diagnosis_3 0.1
Assign different weights to the doctors based on the accuracy of

their previous diagnosis
Boosting
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are adjusted to allow the
subsequent classifier to pay more attention to training tuples
misclassified by Mi
The final boosted classifier M* combines the votes of each individual
classifier where the weight of each classifier is a function of its
accuracy
This strategy can be extended for the prediction of continuous
values
Example: The Adaboost Algorithm
Given a set of d class-labeled tuples (X1, y1), , (Xd, yd)
Initially, all the weights of tuples are the same: 1/d
Generate k classifiers in k rounds.
At round i, tuples from D are sampled (with replacement) to form a

training set Di of the same size
Each tuples chance of being selected depends on its weight
A classification model Mi is derived and tested using Di
If a tuple is misclassified, its weight increases, otherwise it decreases

(use err(Mi)/(1-err(Mi)))
Example: The Adaboost Algorithm

Error rate err(Xi) is the misclassification error of tuple Xi
Classifier Mi error rate is the sum of the weights of the misclassified
tuples
d
error ( M i ) = w j err (X j )
j
Tuple correctly classified: err(Xi)=0

Tuple incorrectly classified: err(Xi)=1
The weight of classifuer Mis vote is
1 error ( M i )
log
error ( M i )
Summary
Accuracy is used to assess classifiers
Error measures are used to assess predictors
Stratified 10-fold cross validation is recommended for estimating

accuracy
Bagging and boosting are used to improve the the accuracy of

classifiers and predictors

Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluation

Uploaded by

Copyright:

Available Formats

Supervised Learning

How reliable are the predicted values?

Are errors on the training data a good indicator of performance on future

Ideally, most of the tuples would be represented along the diagonal

True positives: positive tuples correctly labeled

True negatives: negative tuples correctly labeled

False negatives: positive tuples incorrectly labeled

True positives (TP)

False negatives (FN)

False positives (FP)

True negatives (TN)

Most widely used metric

# correctly classified tuples

Referred to as the overall recognition rate of the classifier

Error rate or misclassification rate: 1-Accuracy

Sometimes Accuracy is not Enough

If a model predicts always class a, accuracy is 28/30= 93%

C(i|j): Cost of misclassifying class j tuple as class i

Computing the Cost of Classification

True positives (TP)

False negatives (FN)

False positives (FP)

True negatives (TN)

Biased towards C(Yes|Yes) & C(Yes|No)

Other Specific Measures

Predictor Error Measures

Compute loss functions

Absolute error =| yi ! y 'i |

Squared error = (yi ! y 'i )2

The test error or generalization error is the average loss

Mean absolute error =

Mean squared error =

Predictor Error Measures

Relative absolute error =

Relative squared error =

Step1: Prepare Training and Test Sets

Step2: Build The Model

Step3: Evaluate The Model

Note on Parameter Tuning

It is important not to use

The test data should not

Methods for Performance Evaluation

Performance of a model may depend on other factors besides the

Typically use two-thirds of the data are allocated to training set

In each iteration, a certain proportion is randomly selected for

The error rates on the different iterations are averaged to yield an

Still not optimum since the different test sets overlap

This is called k-fold cross validation

Often the subsets are stratified before cross-validation is performed

The error estimates are averaged to yield an overall error estimate

Stratification reduces the estimates variance

Even better: repeated cross-validation. E.g. ten-fold cross-validation

Leave-One-Out Cross Validation

Makes best use of the data

The 0.632 Bootstrap

Estimating Error using Bootstrap

Increasing the Accuracy

There are some general strategies for improving the accuracy of

Bagging and Boosting are some of these strategies

Assign different weights to the doctors based on the accuracy of

Example: The Adaboost Algorithm

Given a set of d class-labeled tuples (X1, y1), , (Xd, yd)

Initially, all the weights of tuples are the same: 1/d

Generate k classifiers in k rounds.