Decision Trees

COMP 424 - Artificial Intelligence Lecture 16: Decision Trees
Instructors:
Joelle Pineau (jpineau@cs.mcgill.ca) Sylvie Ong (song@cs.mcgill.ca)
Class web page: www.cs.mcgill.ca/~jpineau/comp424
Todays overview
Topic: Decision Trees What are they? Decision tree construction Brief interlude on information theory Variations on the standard algorithm
COMP-424: Artificial intelligence
Joelle Pineau
Recall examples of supervised learning

Handwritten digit recognition Object segmentation Face recognition Spam filtering Fraud detection Weather prediction Customer segmentation Categorization of news articles by topic
Which of these are classification tasks?

COMP-424: Artificial intelligence 3 Joelle Pineau
Example: Breast Cancer Outcome Prediction

Researchers computed 30 different features (or attributes) of the cancer cells nuclei o Features relate to radius, texture, area, smoothness, concavity, etc. of the nuclei o For each image, mean, standard error, and max of these properties across nuclei Vectors in the form ( (x1,x2,x3,) , y ) = ( X , y ) Data set D in the form of a data table:
Joelle Pineau
Decision tree example

What does a node represent?
A partitioning of the input space.
Internal nodes are tests on the values of different attributes

Dont need to be binary test.
R = recurrence N = no recurrence
Leaf nodes also include the set of training examples that satisfy the tests along the branch.
Each training example falls in precisely one leaf. Each leaf typical contains more than one example.
Joelle Pineau
Using decision trees for classification

Suppose we get a new instance: radius=18, texture=12, How do we classify it?
Simple procedure:
At every node, test the corresponding attribute
Follow the appropriate branch of the tree At a leaf, either predict the class of the majority of the examples for that leaf, or sample from the probabilities of the two classes.
Joelle Pineau
Interpreting decision trees

We can convert a decision tree into an equivalent set of if-then rules.
Joelle Pineau
Interpreting decision trees

We can convert a decision tree into an equivalent set of if-then rules.
We can also calculate an estimated probability of recurrence.

Joelle Pineau
More Formally : Tests

Each internal node contains a test that typically depends on one feature.
For discrete features, we typically branch on all possibilities For real features, we typically branch on a threshold value
Test returns a discrete outcome, e.g.
Learning = choosing the tests at every node and the shape of the tree.
A finite set of candidate tests is usually chosen before learning the tree.
Joelle Pineau
Expressivity of decision trees

What kind of functions can be learned via decision trees? Every Boolean Function can be fully expressed using a Decision Tree
Each entry in the truth table could be one path -> very inefficient Most boolean functions can be encoded compactly.
Many other functions can be approximated by a boolean function. Some functions are harder to encode
Parity Function = Returns 1 iff an even number of inputs are 1
An exponentially big decision tree O(2M) would be needed
Majority Function = Returns 1 if more than half the inputs are 1
10
Joelle Pineau
Learning Decision trees

We could enumerate all possible trees (assuming the number of possible tests is finite)
Each tree could be evaluated using the training set or, better yet, a test set But there are many possible trees! Wed probably overfit the data anyway
Usually, decision trees are constructed in two phases:

1. A recursive, top-down procedure grows a tree (possibly until the training data is completely t) 2. The tree is pruned back to avoid overtting
11
Joelle Pineau
Top-down (recursive) induction of decision trees

Given a set of labeled training instances:
1. If all the training instances have the same class, create a leaf with that class label and exit. 2. Pick the best test to split the data on . 3. Split the training set according to the value of the outcome of the test. 4. Recursively repeat steps 2 and 3 on each subset of the training data.
How do we pick the best test?
12
Joelle Pineau
What is a good Test?

The test should provide information about the class label.
E.g. You are given 40 examples: 30 positive, 10 negative Consider two tests that would split the examples as follows: Which is best?
Intuitively, we prefer a test that separates the training instances as well as possible. How do we quantify this (mathematically)?
13 Joelle Pineau
What is information?
Consider three cases:
You are about to observe the outcome of a dice roll You are about to observe the outcome of a coin flip You are about to observe the outcome of a biased coin flip
Intuitively, in each situation, you have a different amount of uncertainty as to what outcome / message you will observe.
14
Joelle Pineau
Information content
Let E be an event that occurs with probability P(E). If we are told that E has occurred with certainty, then we received I(E) bits of information.
You can also think of information as the amount of surprise in the outcome (e.g., consider P(E) = 1, then I(E) 0 )
E.g.
fair coin flip provides log2 2 = 1 bit of information fair dice roll provides log2 6 2.58 bits of information
15
Joelle Pineau
Entropy
Suppose we have an information source S which emits symbols from an alphabet {s1 , . . . sk } with probabilities {p1 , . . . pk }. Each emission is independent of the others. What is the average amount of information we expect from the output of S?
Call this entropy of S Note that this depends only on the probability distribution, and not on the actual alphabet. So we can write H(P).
16
Joelle Pineau
Entropy
How to think about entropy:

Average amount of information per symbol Average amount of surprise when observing the symbol Uncertainty the observer has before seeing the symbol Average number of bits needed to communicate the symbol
17
Joelle Pineau
Binary Classification
We try to classify sample data using a decision tree. Suppose we have p positive samples and n negative samples. What is the entropy of this data set?
18
Joelle Pineau
Entropy of an attribute/feature
Suppose an attribute x can take k distinct values that split the dataset D in subsets S1,,Sk Each subset Si has pi positive and ni negative samples
After testing x, we will need R(x) bits of information to classify the sample
The remainder R(x) is the conditional entropy H(D | x ).

H(D | x) = 0 means that the data is perfectly separated.
19
Joelle Pineau
Information gain
How much information do we gain when testing an attribute x?
Or in terms of the conditional entropy
Alternative interpretation: how much reduction in entropy do we get if we know x.
20
Joelle Pineau
Information gain to determine best test

If tests are binary:
In this case, which test is chosen?
21
Joelle Pineau
Top-down (recursive) induction of decision trees

Given a set of labeled training instances:
1. If all the training instances have the same class, create a leaf with that class label and exit. 2. Pick the best test to split the data on . 3. Split the training set according to the value of the outcome of the test. 4. Recursively repeat steps 2 and 3 on each subset of the training data.
How do we pick the best test? Answer: We choose the test that has highest information gain.
22
Joelle Pineau
Caveats on tests with multiple values

If the outcome of a test is not binary, the number of possible values influences the information gain.
The more possible values, the higher the gain! (the more likely it is to form small, but pure partitions) Nonetheless, the attribute could be irrelevant
It is possible to transform a real-valued attribute in one (or many) binary attributes.
C4.5 (the most popular decision tree construction algorithm in ML community) uses only binary tests: Attribute=Value (discrete) or Attribute<Value (continuous)
Other approaches consider smarter metrics (e.g. gain ratio), which account for the number of possible outcomes.
23
Joelle Pineau
Tests for real-valued features

Suppose feature j is real-valued How do we choose a finite set of possible thresholds, for tests of the form xj > ? Choose midpoints of the observed data values, x1,j , x2,j , . . . , xm,j
Choose midpoints of data values with different y values
24
Joelle Pineau
A complete (articial) example

An articial binary classication problem with two real-valued input features:
25
Joelle Pineau
A complete (articial) example

The decision tree, graphically:
26
Joelle Pineau
Assessing Performance
Split data into a training set and a test set Apply learning algorithm to training set Measure error with the test set
From R&N , p.703
27
Joelle Pineau
Overfitting in decision trees

Remember, decision tree construction proceeds until all leaves are pure , i.e. all examples are from the same class. As the tree grows, the generalization performance can start to degrade, because the algorithm is including irrelevant attributes/tests/outliers.
How can we avoid this?
28
Joelle Pineau
Avoiding Overtting
Objective: Remove some nodes to get better generalization. Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the test set. Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the test set. In general, post pruning is better! It allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative.
29
Joelle Pineau
Reduced-error pruning (aka post pruning)

Split the data set into a training set and a test set Grow a large tree (e.g. until each leaf is pure) For each node: Evaluate the test set accuracy of pruning the subtree rooted at the node Greedily remove the node that most improves test set accuracy, with its corresponding subtree Replace the removed node by a leaf with the majority class of the corresponding examples Stop when pruning starts hurting the accuracy on the test set.
30
Joelle Pineau
Example: Effect of reduced-error pruning
31
Joelle Pineau
Regression trees
Similar to classification trees, but for regression problems. We try to predict a numerical value. Tests can be the same as before. At the leaves, instead of predicting the majority of the examples, we define a linear function of some subset of numerical values (e.g. predict the mean).
32
Joelle Pineau
Applications of Decision Trees

Simple things like a How-To Failure Diagnosis And many more
More sophisticated example: Classification of financial data Explanation of stock dynamics using publicly available information Classify stocks as undervalued, overvalued or neutral
33
Joelle Pineau
Application of Decision Trees

Input Features Output
Short Long Neutral
http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2008-009.pdf
COMP-424: Artificial intelligence 34 Joelle Pineau
Application of Decision Trees
35
Joelle Pineau
Advantages of decision trees

Provide a general representation of classication rules
Scaling / normalization not needed, as we use no notion of distance between examples.
The learned function, y = h(x), in a Decision Tree is easy to interpret. Easy to implement. Fast learning algorithms (e.g. C4.5, CART). Good accuracy in practice many applications in industry!
36
Joelle Pineau
Limitations
Sensitivity: Exact tree output may be sensitive to small changes in data With many features, tests may not be meaningful
In standard form, good for learning (nonlinear) piecewise axisorthogonal decision boundaries, but not for learning functions with smooth, curvilinear boundaries.
37
Joelle Pineau
What you should know

How to use a decision tree to classify a new example. How to build a decision tree using an information-theoretic approach. How to detect (and fix) overfitting in decision trees. How to handle both discrete and continuous attributes, and outputs.
38
Joelle Pineau

Decision Trees

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Trees

Uploaded by

Copyright:

Available Formats

COMP 424 - Artificial Intelligence Lecture 16: Decision Trees

Joelle Pineau (jpineau@cs.mcgill.ca) Sylvie Ong (song@cs.mcgill.ca)

Class web page: www.cs.mcgill.ca/~jpineau/comp424

COMP-424: Artificial intelligence

Recall examples of supervised learning

Which of these are classification tasks?

Example: Breast Cancer Outcome Prediction

COMP-424: Artificial intelligence

Decision tree example

Internal nodes are tests on the values of different attributes

COMP-424: Artificial intelligence

Using decision trees for classification

COMP-424: Artificial intelligence

Interpreting decision trees

COMP-424: Artificial intelligence

Interpreting decision trees

We can also calculate an estimated probability of recurrence.

COMP-424: Artificial intelligence

More Formally : Tests

Test returns a discrete outcome, e.g.

COMP-424: Artificial intelligence

Expressivity of decision trees

Majority Function = Returns 1 if more than half the inputs are 1

COMP-424: Artificial intelligence

Learning Decision trees

Usually, decision trees are constructed in two phases:

COMP-424: Artificial intelligence

Top-down (recursive) induction of decision trees

How do we pick the best test?

COMP-424: Artificial intelligence

What is a good Test?

COMP-424: Artificial intelligence

COMP-424: Artificial intelligence

COMP-424: Artificial intelligence

COMP-424: Artificial intelligence

How to think about entropy:

COMP-424: Artificial intelligence

COMP-424: Artificial intelligence

The remainder R(x) is the conditional entropy H(D | x ).

COMP-424: Artificial intelligence

Or in terms of the conditional entropy

Alternative interpretation: how much reduction in entropy do we get if we know x.

COMP-424: Artificial intelligence

Information gain to determine best test

In this case, which test is chosen?

COMP-424: Artificial intelligence

Top-down (recursive) induction of decision trees

COMP-424: Artificial intelligence

Caveats on tests with multiple values

It is possible to transform a real-valued attribute in one (or many) binary attributes.

COMP-424: Artificial intelligence

Tests for real-valued features

Choose midpoints of data values with different y values

COMP-424: Artificial intelligence

A complete (articial) example

COMP-424: Artificial intelligence

A complete (articial) example

COMP-424: Artificial intelligence

From R&N , p.703

COMP-424: Artificial intelligence

Overfitting in decision trees

How can we avoid this?

COMP-424: Artificial intelligence

COMP-424: Artificial intelligence