Professional Documents
Culture Documents
Instructors:
Todays overview
Topic: Decision Trees What are they? Decision tree construction Brief interlude on information theory Variations on the standard algorithm
Joelle Pineau
Joelle Pineau
Leaf nodes also include the set of training examples that satisfy the tests along the branch.
Each training example falls in precisely one leaf. Each leaf typical contains more than one example.
Joelle Pineau
R = recurrence N = no recurrence
Simple procedure:
At every node, test the corresponding attribute
Follow the appropriate branch of the tree At a leaf, either predict the class of the majority of the examples for that leaf, or sample from the probabilities of the two classes.
Joelle Pineau
R = recurrence N = no recurrence
Joelle Pineau
Joelle Pineau
Learning = choosing the tests at every node and the shape of the tree.
A finite set of candidate tests is usually chosen before learning the tree.
Joelle Pineau
Many other functions can be approximated by a boolean function. Some functions are harder to encode
Parity Function = Returns 1 iff an even number of inputs are 1
An exponentially big decision tree O(2M) would be needed
10
Joelle Pineau
11
Joelle Pineau
12
Joelle Pineau
Intuitively, we prefer a test that separates the training instances as well as possible. How do we quantify this (mathematically)?
13 Joelle Pineau
What is information?
Consider three cases:
You are about to observe the outcome of a dice roll You are about to observe the outcome of a coin flip You are about to observe the outcome of a biased coin flip
Intuitively, in each situation, you have a different amount of uncertainty as to what outcome / message you will observe.
14
Joelle Pineau
Information content
Let E be an event that occurs with probability P(E). If we are told that E has occurred with certainty, then we received I(E) bits of information.
You can also think of information as the amount of surprise in the outcome (e.g., consider P(E) = 1, then I(E) 0 )
E.g.
fair coin flip provides log2 2 = 1 bit of information fair dice roll provides log2 6 2.58 bits of information
15
Joelle Pineau
Entropy
Suppose we have an information source S which emits symbols from an alphabet {s1 , . . . sk } with probabilities {p1 , . . . pk }. Each emission is independent of the others. What is the average amount of information we expect from the output of S?
Call this entropy of S Note that this depends only on the probability distribution, and not on the actual alphabet. So we can write H(P).
16
Joelle Pineau
Entropy
17
Joelle Pineau
Binary Classification
We try to classify sample data using a decision tree. Suppose we have p positive samples and n negative samples. What is the entropy of this data set?
18
Joelle Pineau
Entropy of an attribute/feature
Suppose an attribute x can take k distinct values that split the dataset D in subsets S1,,Sk Each subset Si has pi positive and ni negative samples
After testing x, we will need R(x) bits of information to classify the sample
19
Joelle Pineau
Information gain
How much information do we gain when testing an attribute x?
20
Joelle Pineau
21
Joelle Pineau
How do we pick the best test? Answer: We choose the test that has highest information gain.
22
Joelle Pineau
C4.5 (the most popular decision tree construction algorithm in ML community) uses only binary tests: Attribute=Value (discrete) or Attribute<Value (continuous)
Other approaches consider smarter metrics (e.g. gain ratio), which account for the number of possible outcomes.
23
Joelle Pineau
24
Joelle Pineau
25
Joelle Pineau
26
Joelle Pineau
Assessing Performance
Split data into a training set and a test set Apply learning algorithm to training set Measure error with the test set
27
Joelle Pineau
28
Joelle Pineau
Avoiding Overtting
Objective: Remove some nodes to get better generalization. Early stopping: Stop growing the tree when further splitting the data does not improve information gain of the test set. Post pruning: Grow a full tree, then prune the tree by eliminating lower nodes that have low information gain on the test set. In general, post pruning is better! It allows you to deal with cases where a single attribute is not informative, but a combination of attributes is informative.
29
Joelle Pineau
30
Joelle Pineau
31
Joelle Pineau
Regression trees
Similar to classification trees, but for regression problems. We try to predict a numerical value. Tests can be the same as before. At the leaves, instead of predicting the majority of the examples, we define a linear function of some subset of numerical values (e.g. predict the mean).
32
Joelle Pineau
More sophisticated example: Classification of financial data Explanation of stock dynamics using publicly available information Classify stocks as undervalued, overvalued or neutral
33
Joelle Pineau
http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2008-009.pdf
COMP-424: Artificial intelligence 34 Joelle Pineau
35
Joelle Pineau
The learned function, y = h(x), in a Decision Tree is easy to interpret. Easy to implement. Fast learning algorithms (e.g. C4.5, CART). Good accuracy in practice many applications in industry!
36
Joelle Pineau
Limitations
Sensitivity: Exact tree output may be sensitive to small changes in data With many features, tests may not be meaningful
In standard form, good for learning (nonlinear) piecewise axisorthogonal decision boundaries, but not for learning functions with smooth, curvilinear boundaries.
37
Joelle Pineau
38
Joelle Pineau