You are on page 1of 19

COMP 424 - Artificial Intelligence Lecture 16: Decision Trees

Instructors:

Joelle Pineau (jpineau@cs.mcgill.ca) Sylvie Ong (song@cs.mcgill.ca)

Class web page: www.cs.mcgill.ca/~jpineau/comp424

Todays overview
Topic: Decision Trees What are they? Decision tree construction Brief interlude on information theory Variations on the standard algorithm

COMP-424: Artificial intelligence

Joelle Pineau

Recall examples of supervised learning


Handwritten digit recognition Object segmentation Face recognition Spam filtering Fraud detection Weather prediction Customer segmentation Categorization of news articles by topic

Which of these are classification tasks?


COMP-424: Artificial intelligence 3 Joelle Pineau

Example: Breast Cancer Outcome Prediction


Researchers computed 30 different features (or attributes) of the cancer cells nuclei o Features relate to radius, texture, area, smoothness, concavity, etc. of the nuclei o For each image, mean, standard error, and max of these properties across nuclei Vectors in the form ( (x1,x2,x3,) , y ) = ( X , y ) Data set D in the form of a data table:

COMP-424: Artificial intelligence

Joelle Pineau

Decision tree example


What does a node represent?
A partitioning of the input space.

Internal nodes are tests on the values of different attributes


Dont need to be binary test.
R = recurrence N = no recurrence

Leaf nodes also include the set of training examples that satisfy the tests along the branch.
Each training example falls in precisely one leaf. Each leaf typical contains more than one example.

COMP-424: Artificial intelligence

Joelle Pineau

Using decision trees for classification


Suppose we get a new instance: radius=18, texture=12, How do we classify it?

R = recurrence N = no recurrence

Simple procedure:
At every node, test the corresponding attribute
Follow the appropriate branch of the tree At a leaf, either predict the class of the majority of the examples for that leaf, or sample from the probabilities of the two classes.

COMP-424: Artificial intelligence

Joelle Pineau

Interpreting decision trees


We can convert a decision tree into an equivalent set of if-then rules.

R = recurrence N = no recurrence

COMP-424: Artificial intelligence

Joelle Pineau

Interpreting decision trees


We can convert a decision tree into an equivalent set of if-then rules.

We can also calculate an estimated probability of recurrence.


R = recurrence N = no recurrence

COMP-424: Artificial intelligence

Joelle Pineau

More Formally : Tests


Each internal node contains a test that typically depends on one feature.
For discrete features, we typically branch on all possibilities For real features, we typically branch on a threshold value

Test returns a discrete outcome, e.g.

Learning = choosing the tests at every node and the shape of the tree.
A finite set of candidate tests is usually chosen before learning the tree.

COMP-424: Artificial intelligence

Joelle Pineau

Expressivity of decision trees


What kind of functions can be learned via decision trees? Every Boolean Function can be fully expressed using a Decision Tree
Each entry in the truth table could be one path -> very inefficient Most boolean functions can be encoded compactly.

Many other functions can be approximated by a boolean function. Some functions are harder to encode
Parity Function = Returns 1 iff an even number of inputs are 1
An exponentially big decision tree O(2M) would be needed

Majority Function = Returns 1 if more than half the inputs are 1

COMP-424: Artificial intelligence

10

Joelle Pineau

Learning Decision trees


We could enumerate all possible trees (assuming the number of possible tests is finite)
Each tree could be evaluated using the training set or, better yet, a test set But there are many possible trees! Wed probably overfit the data anyway

Usually, decision trees are constructed in two phases:


1. A recursive, top-down procedure grows a tree (possibly until the training data is completely t) 2. The tree is pruned back to avoid overtting

COMP-424: Artificial intelligence

11

Joelle Pineau

Top-down (recursive) induction of decision trees


Given a set of labeled training instances:
1. If all the training instances have the same class, create a leaf with that class label and exit. 2. Pick the best test to split the data on . 3. Split the training set according to the value of the outcome of the test. 4. Recursively repeat steps 2 and 3 on each subset of the training data.

How do we pick the best test?

COMP-424: Artificial intelligence

12

Joelle Pineau

What is a good Test?


The test should provide information about the class label.
E.g. You are given 40 examples: 30 positive, 10 negative Consider two tests that would split the examples as follows: Which is best?

Intuitively, we prefer a test that separates the training instances as well as possible. How do we quantify this (mathematically)?
13 Joelle Pineau

COMP-424: Artificial intelligence

What is information?
Consider three cases:
You are about to observe the outcome of a dice roll You are about to observe the outcome of a coin flip You are about to observe the outcome of a biased coin flip

Intuitively, in each situation, you have a different amount of uncertainty as to what outcome / message you will observe.

COMP-424: Artificial intelligence

14

Joelle Pineau

Information content
Let E be an event that occurs with probability P(E). If we are told that E has occurred with certainty, then we received I(E) bits of information.

You can also think of information as the amount of surprise in the outcome (e.g., consider P(E) = 1, then I(E) 0 )
E.g.
fair coin flip provides log2 2 = 1 bit of information fair dice roll provides log2 6 2.58 bits of information

COMP-424: Artificial intelligence

15

Joelle Pineau

Entropy
Suppose we have an information source S which emits symbols from an alphabet {s1 , . . . sk } with probabilities {p1 , . . . pk }. Each emission is independent of the others. What is the average amount of information we expect from the output of S?

Call this entropy of S Note that this depends only on the probability distribution, and not on the actual alphabet. So we can write H(P).

COMP-424: Artificial intelligence

16

Joelle Pineau

Entropy

How to think about entropy:


Average amount of information per symbol Average amount of surprise when observing the symbol Uncertainty the observer has before seeing the symbol Average number of bits needed to communicate the symbol

COMP-424: Artificial intelligence

17

Joelle Pineau

Binary Classification
We try to classify sample data using a decision tree. Suppose we have p positive samples and n negative samples. What is the entropy of this data set?

COMP-424: Artificial intelligence

18

Joelle Pineau

Entropy of an attribute/feature
Suppose an attribute x can take k distinct values that split the dataset D in subsets S1,,Sk Each subset Si has pi positive and ni negative samples

After testing x, we will need R(x) bits of information to classify the sample

The remainder R(x) is the conditional entropy H(D | x ).


H(D | x) = 0 means that the data is perfectly separated.

COMP-424: Artificial intelligence

19

Joelle Pineau

Information gain
How much information do we gain when testing an attribute x?

Or in terms of the conditional entropy

Alternative interpretation: how much reduction in entropy do we get if we know x.

COMP-424: Artificial intelligence

20

Joelle Pineau

Information gain to determine best test


If tests are binary:

In this case, which test is chosen?

COMP-424: Artificial intelligence

21

Joelle Pineau

Top-down (recursive) induction of decision trees


Given a set of labeled training instances:
1. If all the training instances have the same class, create a leaf with that class label and exit. 2. Pick the best test to split the data on . 3. Split the training set according to the value of the outcome of the test. 4. Recursively repeat steps 2 and 3 on each subset of the training data.

How do we pick the best test? Answer: We choose the test that has highest information gain.

COMP-424: Artificial intelligence

22

Joelle Pineau

Caveats on tests with multiple values


If the outcome of a test is not binary, the number of possible values influences the information gain.
The more possible values, the higher the gain! (the more likely it is to form small, but pure partitions) Nonetheless, the attribute could be irrelevant

It is possible to transform a real-valued attribute in one (or many) binary attributes.

C4.5 (the most popular decision tree construction algorithm in ML community) uses only binary tests: Attribute=Value (discrete) or Attribute<Value (continuous)

Other approaches consider smarter metrics (e.g. gain ratio), which account for the number of possible outcomes.

COMP-424: Artificial intelligence

23

Joelle Pineau

Tests for real-valued features


Suppose feature j is real-valued How do we choose a finite set of possible thresholds, for tests of the form xj > ? Choose midpoints of the observed data values, x1,j , x2,j , . . . , xm,j

Choose midpoints of data values with different y values

COMP-424: Artificial intelligence

24

Joelle Pineau

A complete (articial) example


An articial binary classication problem with two real-valued input features:

COMP-424: Artificial intelligence

25

Joelle Pineau

A complete (articial) example


The decision tree, graphically:

COMP-424: Artificial intelligence

26

Joelle Pineau

Assessing Performance
Split data into a training set and a test set Apply learning algorithm to training set Measure error with the test set

From R&N , p.703

COMP-424: Artificial intelligence

27

Joelle Pineau

Overfitting in decision trees


Remember, decision tree construction proceeds until all leaves are pure , i.e. all examples are from the same class. As the tree grows, the generalization performance can start to degrade, because the algorithm is including irrelevant attributes/tests/outliers.

How can we avoid this?

COMP-424: Artificial intelligence

28

Joelle Pineau

Avoiding Overtting
Objective: Remove some nodes to get better generalization. Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the test set. Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the test set. In general, post pruning is better! It allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative.

COMP-424: Artificial intelligence

29

Joelle Pineau

Reduced-error pruning (aka post pruning)


Split the data set into a training set and a test set Grow a large tree (e.g. until each leaf is pure) For each node: Evaluate the test set accuracy of pruning the subtree rooted at the node Greedily remove the node that most improves test set accuracy, with its corresponding subtree Replace the removed node by a leaf with the majority class of the corresponding examples Stop when pruning starts hurting the accuracy on the test set.

COMP-424: Artificial intelligence

30

Joelle Pineau

Example: Effect of reduced-error pruning

COMP-424: Artificial intelligence

31

Joelle Pineau

Regression trees
Similar to classification trees, but for regression problems. We try to predict a numerical value. Tests can be the same as before. At the leaves, instead of predicting the majority of the examples, we define a linear function of some subset of numerical values (e.g. predict the mean).

COMP-424: Artificial intelligence

32

Joelle Pineau

Applications of Decision Trees


Simple things like a How-To Failure Diagnosis And many more

More sophisticated example: Classification of financial data Explanation of stock dynamics using publicly available information Classify stocks as undervalued, overvalued or neutral

COMP-424: Artificial intelligence

33

Joelle Pineau

Application of Decision Trees


Input Features Output

Short Long Neutral

http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2008-009.pdf
COMP-424: Artificial intelligence 34 Joelle Pineau

Application of Decision Trees

COMP-424: Artificial intelligence

35

Joelle Pineau

Advantages of decision trees


Provide a general representation of classication rules
Scaling / normalization not needed, as we use no notion of distance between examples.

The learned function, y = h(x), in a Decision Tree is easy to interpret. Easy to implement. Fast learning algorithms (e.g. C4.5, CART). Good accuracy in practice many applications in industry!

COMP-424: Artificial intelligence

36

Joelle Pineau

Limitations
Sensitivity: Exact tree output may be sensitive to small changes in data With many features, tests may not be meaningful

In standard form, good for learning (nonlinear) piecewise axisorthogonal decision boundaries, but not for learning functions with smooth, curvilinear boundaries.

COMP-424: Artificial intelligence

37

Joelle Pineau

What you should know


How to use a decision tree to classify a new example. How to build a decision tree using an information-theoretic approach. How to detect (and fix) overfitting in decision trees. How to handle both discrete and continuous attributes, and outputs.

COMP-424: Artificial intelligence

38

Joelle Pineau

You might also like