You are on page 1of 37

Supervised Learning

Evaluation

Evaluation Issues
What measures should we use?
Classification accuracy might not be enough

How reliable are the predicted values?

Are errors on the training data a good indicator of performance on future


data?
If the classifier is computed from the very same training data, any
estimate based on that data will be optimistic
New data probably not exactly the same as the training data

Road Map

1. Evaluation Metrics
2. Evaluation Methods
3. How To Improve Classifiers Accuracy

Confusion Matrix
The confusion matrix is a table of at least mm size. An entry CMi,j
indicated the number of tuples of class i that were labeled as class j
Real class
\Predicted class

Class1

Class2

Classm

Class1

CM1,1

CM12

CM1,m

Class2

CM2,1

CM2,2

CM2,m

Classm

CMm,1

CMm,2

CMm,m

Ideally, most of the tuples would be represented along the diagonal


of the confusion matrix

Confusion Matrix
For a targeted class
Predicted Class

Target

Yes

No

Yes

True positives

False negatives

No

False positives

True negatives

Class

True positives: positive tuples correctly labeled

True negatives: negative tuples correctly labeled


False positives: negative tuples incorrectly labeled

False negatives: positive tuples incorrectly labeled

Accuracy
Predicted Class

Target

Yes

No

Yes

True positives (TP)

False negatives (FN)

No

False positives (FP)

True negatives (TN)

Class

Most widely used metric

Accuracy =

# correctly classified tuples


total # tuples

Accuracy =

TP + TN
TP + TN + FP + FN

Accuracy
Accuracy is better measured using test data that was not used to
build the classifier

Referred to as the overall recognition rate of the classifier

Error rate or misclassification rate: 1-Accuracy

When training data are used to compute the accuracy, the error
rate is called resubstituion error

Sometimes Accuracy is not Enough


Consider a 2-class problem

a
a

a
a

a
a

a
b

a
a

b
a

a
a

a
a

If a model predicts always class a, accuracy is 28/30= 93%


Accuracy is misleading because the model does not detect any
tuple of class b

Cost Matrix
Useful when specific classification errors are more severe than
others
Predicted Class

Target

Yes

No

Yes

C(Yes|Yes)

C(No|Yes)

No

C(Yes|No)

C(No|No)

Class

C(i|j): Cost of misclassifying class j tuple as class i

Computing the Cost of Classification


Cost Matrix
Cost Matrix
Costumer
Satisfied

Confusion Matrix
Model M1

Costumer
Satisfied

Yes

No

Yes

150

60

No

40

250

Cost= 4860

Yes

No

Yes

No

120

0
Confusion Matrix

Predicted Class

Accuracy= 80%

Predicted Class

Model M2

Costumer
Satisfied

Predicted Class
Yes

No

Yes

250

No

45

200

Accuracy= 90%
Cost= 5405

Cost-Sensitive Measures
Predicted Class
Yes

No

Target

Yes

True positives (TP)

False negatives (FN)

Class

No

False positives (FP)

True negatives (TN)

Precision(p) =

Recall(r) =

TP
TP + FP

TP
TP + FN

2rp
F1- measure =
r+ p

WeightedAccuracy =

Biased towards C(Yes|Yes) & C(Yes|No)


The higher the precision, the lower the FPs
Biased towards C(Yes|Yes) & C(No|Yes)
The higher the recall, the lower the FNs
Biased towards all except C(No|No)
It is high when both p and r are high
The higher the F1, the lower the FPs & FNs

wTP ! TP + wTN ! TN
wTP ! TP + wFN ! FN + wFP ! FP + wTN ! TN

Other Specific Measures


Other measures can be used when the accuracy measure is not
acceptable

TP
Sensitivity(sn) =
# Positives
Specificity(sp) =

Accuracy = se

TN
# Negatives

# Positives
# Negatives
+ sp
(# Positives + # Negatives)
(# Positives + # Negatives)

Predictor Error Measures


The predictor returns continuous values
Measure how far the predicted value from the known value

Compute loss functions

Absolute error =| yi ! y 'i |


yi : the true value
yi: the predicted value

Squared error = (yi ! y 'i )2

The test error or generalization error is the average loss


N

"| y

Mean absolute error =

! y 'i |

i=1

N is the size of
the test dataset

N
N

" (y

Mean squared error =

! y 'i ) 2

i=1

Predictor Error Measures


The total loss can be normalized: divid by the total loss incurred from
always predicting the mean
N

"| y

! y 'i |

Relative absolute error =

i=1
N

"| y

!y|

i=1

" (y

! y 'i ) 2

Relative squared error =

i=1
N

" (y

! y )2

i=1

N is the size of
the test dataset

Road Map

1. Evaluation Metrics
2. Evaluation Methods
3. How to Improve Classifiers Accuracy

Step1: Prepare Training and Test Sets


Training set
data

Test set

Step2: Build The Model


Training set
data

Test set

Model

Step3: Evaluate The Model


Training set
data

Predictions

Test set

Model
Evaluation

Note on Parameter Tuning


Some leaning schemes operates in two stages:
Stage1: Build the basic structure of the model
Sage2: Optimize parameter settings

It is important not to use


test data to build the model

data

The test data should not


be used for parameter
tuning
Validation set

Training set

Parameter
Tuning

Test set

Evaluation

Model
Predictions

Methods for Performance Evaluation


How to obtain reliable estimate of performance?

Performance of a model may depend on other factors besides the


learning algorithm
Class distribution
Cost of misclassification
Size of training and test sets

Holdout

Typically use two-thirds of the data are allocated to training set


and one-third is allocated to test set
The estimate is pessimistic because only a portion of the initial
data is used to derive the model
For small or unbalanced datasets, samples might not be
representative
Stratified sampling: make sure that each class is represented with
approximately equal proportions in both subsets

Random Subsampling
Holdout estimate can be made more reliable by repeating the
process with different subsamples (k times)

In each iteration, a certain proportion is randomly selected for


training (possibly with stratification)

The error rates on the different iterations are averaged to yield an


overall error rate

Still not optimum since the different test sets overlap

Cross Validation
Avoids overlapping test sets
Data is split into k mutually exclusive subset, or folds, of equal size:
D1, D2,, Dk
Each subset in turn is used for testing and the remainder for training:
First iteration: use D2,Dk and training and D1 as test
Second iteration: use D1,D3,,Dk as training and D2 as test

This is called k-fold cross validation

Often the subsets are stratified before cross-validation is performed

The error estimates are averaged to yield an overall error estimate

Cross Validation
Standard method for evaluation stratified 10-fold cross validation

Why 10? Extensive experiments have shown that this is the best
choice to get an accurate estimate

Stratification reduces the estimates variance

Even better: repeated cross-validation. E.g. ten-fold cross-validation


is repeated ten times and results are averaged (reduces the
variance)

Leave-One-Out Cross Validation


A special case of k-fold cross-validation
K is set to the initial number of tuples
Only one sample is left out at a time for the test set

Makes best use of the data


Involves no random subsampling
Computationally expensive
Disadvantage: stratification is not possible
Extreme example: random dataset split equally into two classes
Model predicts majority class
50% accuracy on the whole data
Leave-One-Out-CV estimate is 100% error

The 0.632 Bootstrap


Sample training tuples uniformly with replacement: each time a
tuple is selected, it is equally to be selected again and re-added to
the training set
An instance has a probability of 1-1/n of not being picked
Thus its probability of ending up in the test data is:
n

"
1%
!1
1!
$
' ( e = 0.368
#
n&
This means the training data will contain approximately 63.2% of the
instances

Estimating Error using Bootstrap


The error estimate on the test data will be very pessimistic, since
training was on just ~63% of the instances
Therefore, combine it with the resubstitution error:
k

Err(M ) = " (0.632 ! Err(M i )test _ set +0.368 ! Err(M i )train _ set )
i=1

The resubstitution error gets less weight than the error on the test
data
Repeat process several times with different replacement samples;
average the results

More on Bootstrap
Probably the best way of estimating performance for very small
datasets
However, it has some problems
Consider a random dataset with a 50% class distribution
A model that memorizes all the training data will achieve 0%
resubstituion error on the training data and ~50% error on test data
Bootstrap estimate for this classifier:
Err=0.6320.5+0.3680=31.6%
True expected error: 50%

Road Map

1. Evaluation Metrics
2. Evaluation Methods
3. How to Improve Classifiers Accuracy

Increasing the Accuracy


We have seen that pruning improves the accuracy of decision trees
by reducing the overfitting effect

There are some general strategies for improving the accuracy of


classifiers and predictors

Bagging and Boosting are some of these strategies


Ensemble methods: use a combination of models
Find an an improved composite model M*
Combine a series of learned classifiers M1,M2,,Mk
Find an an improved composite model M*

Bagging
Intuition
Ask diagnosis
to one doctor

How accurate is
this diagnosis ?
diagnosis
diagnosis_1

patient

diagnosis_2

diagnosis_3

Choose the diagnosis that occurs more than any of the others

Bagging
K iterations
At each iteration a training set Di is sampled with replacement
The combined model M* returns the most frequent class in case of
classification, and the average value in case of prediction
New data
sample

Data

.
.
.

Prediction

Boosting
Intuition
diagnosis_1 0.4
patient

diagnosis_2 0.5

diagnosis_3 0.1

Assign different weights to the doctors based on the accuracy of


their previous diagnosis

Boosting
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are adjusted to allow the
subsequent classifier to pay more attention to training tuples
misclassified by Mi
The final boosted classifier M* combines the votes of each individual
classifier where the weight of each classifier is a function of its
accuracy
This strategy can be extended for the prediction of continuous
values

Example: The Adaboost Algorithm

Given a set of d class-labeled tuples (X1, y1), , (Xd, yd)

Initially, all the weights of tuples are the same: 1/d

Generate k classifiers in k rounds.

At round i, tuples from D are sampled (with replacement) to form a


training set Di of the same size

Each tuples chance of being selected depends on its weight

A classification model Mi is derived and tested using Di

If a tuple is misclassified, its weight increases, otherwise it decreases


(use err(Mi)/(1-err(Mi)))

Example: The Adaboost Algorithm


Error rate err(Xi) is the misclassification error of tuple Xi
Classifier Mi error rate is the sum of the weights of the misclassified
tuples
d

error ( M i ) = w j err (X j )
j

Tuple correctly classified: err(Xi)=0


Tuple incorrectly classified: err(Xi)=1

The weight of classifuer Mis vote is

1 error ( M i )
log
error ( M i )

Summary
Accuracy is used to assess classifiers

Error measures are used to assess predictors

Stratified 10-fold cross validation is recommended for estimating


accuracy

Bagging and boosting are used to improve the the accuracy of


classifiers and predictors

You might also like