Classification & Regression Trees

2/8/2019
Classification and Regression Trees
Professor Margrét Vilborg Bjarnadóttir
Today’s Lecture
 Adding to our toolbox: Classification and regression trees
 The main ideas behind classification trees
 The key methodological questions
 Case Study
Our goal this morning:

• To have a clear idea of what happens behind the scenes when XLMiner runs
Classification trees
• To understand the value of pruning back the trees with our validation data
• To learn about applying Classification Trees in practice
Bonus goal:
• Be able to apply Regression Trees
2 Data Analytics - Spring 2019
1
2/8/2019
Classification Trees
…and Regression Trees
Motivating Example: Beer Preference

 Hacker Pschorr
 One of the oldest beer brewing
companies in Munich
 Collects data on beer preference
(light/ regular) and
demographic info
 Goal: determine
demographic factors for
preferring light beer
 We will first focus on two
predictors Income and Age
2
2/8/2019
The Underlying Idea

 Recursively separating the records into subgroups by creating
splits on the predictors
 This splitting of the data set can be visualized as trees.
All
Beer Preferences
data 100
90
Income Income
≤38,562 >38,562 80
70
60
Age
50
Regular
40
Light
30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
The Underlying Idea

 Recursively separating the records into subgroups by creating
splits on the predictors
 This splitting of the data set can be visualized as trees.
All
Beer Preferences
data 100
90
Income Income
≤38562 >38562 80
70
60
Age
50
Regular
Age Age Age Age 40
Light
≤37.5 >37.5 ≤48.5 >48.5 30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
3
2/8/2019
In-class Exercise
 The following tree was obtained on using a random 70/30
split. What does the decision boundary look like?
The Key Questions

All 1. How do we choose the split
data variable and the split value?
Income Income
≤38562 >38562 2. When should we stop?
3. What rule do we use for
Age Age Age Age
classification/prediction in
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Reg. Light Reg. 4. How do we classify a new
record?
4
2/8/2019
Determining the Best Split

 The CART algorithm (Classification And Regression Trees)
evaluates all possible binary splits (exhaustive search)
 For each variable
 For each possible split value (on that variable)
 Calculate the impurity of the resulting sub-nodes
 Summarize the impurity of the split as the weighted average of the
impurities of the sub-nodes
 Select the best “variable-value” split
Determining the Best Split: Example

 Searching for the best split value for the income variable
Beer Preferences
100
90
80
70
60
Age
50
Regular
40 Light
30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
5
2/8/2019
A numerical variables takes 5 values: 1,2,3,4,5,

how many possible splits will CART consider?
1. 0
2. 1
3. 2
4. 3
5. 4
6. 5
7. 6
Determining the Best Split

 What do we mean by “best”?
 We want to find the split that best discriminates between
records with different outcomes
 After the split we want the new sub-nodes to be more
homogenous in the outcome variable
 We need a measure of “homogeneousness”!
 There are (at least) three commonly used impurity measures:
 Entropy The one we will use for
 The Gini index demonstration purposes
 Twoing
The one used by XLMiner, emphasizes equal splits
over very narrow splits, for details see page 316
of Breiman et al, 1984, Classification and Regression
12 Data Analytics - Spring 2019 Trees
6
2/8/2019
The Gini Index

 Is a measure of impurity
0.6
0.5 for a node

0.4
 Is zero for pure nodes (all
class 1 or all class 0)
Impurity measure
 Is maximized when
0.3
0.2
___________________
0.1
0
0
0.06
0.12
0.18
0.24
0.3
0.36
0.42
0.48
0.54
0.6
0.66
0.72
0.78
0.84
0.9
0.96
proprtion of observations in class 1
Gini Index
The Gini Index n records
 The algorithm will consider a split when the

The
weighted Gini Index of the child nodes is parent
node
smaller than the Gini Index of the parent node
 The weights for the child nodes is the n1 records n2 records
proportion of the records in the parent node

Child Child
that each child node “gets” node 1 node 2
 After calculating the Gini index for all possible

splits for all variables, it will pick the split that The weights would
be n1/n and n2/n
minimizes the weighted Gini Index of the
child nodes (if it is an improvement)
Data Analytics - Spring 2019
7
2/8/2019
The Gini Index

 Notation
 K - the number of classes
 In our case we will consider two classes
 pi - the percentage of records belonging to class i
 The Gini Index if defined as:
K
GI  1   pi2
i 1
 In our case this simplifies to
GI  1  p12  p22
The
parent
node
The Gini Index Child Child

node
node
 Assume we have 2 classes, class 1 and class 0 1 2
 We have a parent node that has 20 records, 50% is class 1 and 50%
is class 0
 We are considering a split that would result in two nodes:
 Child node 1: 5 class 1 records
 Child node 2: 5 class 1, 10 class 0 records
 Using the Gini index, would we make the split?
 Step 1: The current Gini Index:
 Step 2: The Gini index of node A:
The Gini index of node B:
 Step 3: The weighted average of the Gini index for nodes A and B :
 Step 4: The algorithm would/would not consider the split

K
GI  1   pi2
i 1
8
2/8/2019
Finding the Best Split: Summary

 Consider all possible splits
 Calculate the weighted impurity measure of the child nodes
 Select the split that minimizes the weighted impurity
measure of the child nodes
Data Analytics - Spring 2019
The Key Questions

Income Income
Age Age Age Age
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
record?
9
2/8/2019
When should we stop growing the tree?

 One option is to stop when we can no longer find a split that
improves the impurity measures
 There is a (large) chance of overfitting if we keep splitting the
data until we only have very few points at each node
The goal is to arrive
Stop here
at a tree that
captures the patterns
but not the noise in
the training data,
therefore
maximizing the
prediction accuracy
on new data
The tree models relationship The tree models noise in the training set
19 between the outcome and the
predictors
Avoiding Overfitting: Stopping Rules

 There are a number of Stopping Rules that one can use to
avoid overfitting:
 Set a minimum number of records at a node
 Set a maximum number of splits
 Statistical significance of the split
 There is no simple good way to determine the right stopping
point (depends on the dataset)
10
2/8/2019
Avoiding Overfitting: Pruning

 Pruning refers to using the validation sample to prune back (cut
branches off) the full grown tree
Prune here
Validation Pruning has
been proven
more successful
in practice than
stopping rules
 Note, pruning uses the validation sample to select the best tree:
the performance of the pruned tree on the validation data is not
fully reflective of the performance on completely new data
The Beer Example

 Selecting the
“right” tree
How many decision nodes in the selected tree?
11
2/8/2019
The Key Questions

Income Income
Age Age Age Age
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
record?
Decision Rules in the End Nodes

 Default: majority vote
 In the 2-class case: majority vote corresponds to setting the cut-
off to 0.5
 Changing the cut off:
 If you change the cut off for the algorithm, the labeling will
change
 For each end node, the probability p of class 1 is calculated
based on the training data
 If p is above the cut off all members are labeled as 1, otherwise
as 0
12
2/8/2019
The Key Questions

Income Income
Age Age Age Age
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Dark Light Dark 4. How do we classify a new
record?
 Answer: We walk down the
How would you classify a 40 year old tree
person with $40,000 in annual income?
Converting a Tree into Rules

 We can translate the tree into decision rules
 If Income is less than or equal to $38,562
and Age <37.5 then we predict the All
data
person to like ________ beer
Income Income
 If Income is less than or equal to ≤38562 >38562
$38,562 and Age is greater than

37.5 then we predict the
Age Age Age Age
person to prefer ________beer ≤37.5 >37.5 ≤48.5 >48.5
 etc. Regul Regu

Light Light
ar lar
13
2/8/2019
Running Classification Trees in XLMiner

 We create the model in the same manner as before
A stopping rule, the Stopping rules, the algorithm will
algorithm will not split NOT grow the tree beyond the
up nodes with #records #of levels/splits/nodes provided
≤ value provided
If these
options are
selected,
XLMiner will
draw pictures
of these trees
– up to seven
levels deep
27
Running Classification Trees in XLMiner

 In XLMiner 2018
You want to make sure
that you use either the
best pruned or the
minimum error tree for
scoring if we have a
Possible stopping rules if validation sample
you do not have a
validation set
Select the trees that you

Summarizes variable want a picture of!
importance (new in
XLMiner 2018)
14
2/8/2019
Regression Trees (Numerical Outcome)

 Regression trees extend classification to a numerical
outcome variable
 The trees are built in the same manner, except:
 Labels of leaf nodes are the averages of the observations in the
node
 The impurity measures are different, to reflect the numerical
values of the outcome
 An nice alternative to linear regression, works especially well
with large datasets and many variables
Regression Tree – Airline Example

Fe a t ur e I mpor t a nc e
DI STANCE 100. 00
COUPON 38. 00
VACATI ON 31. 33
SW 27. 13
S_POP 24. 07
PAX 21. 20
S_I NCOME 17. 69
HI 17. 14
GATE 9. 43
E_POP 8. 11
NEW 7. 50
E_I NCOME 6. 81
SLOT 0. 00
15
2/8/2019
Advantages and Disadvantages

 The good:
 Interpretability: Easy to explain and interpret tree (represent rules)
 Tree growing is highly automated (no need for variable selection)
 Predictors can be nominal, ordinal, or continuous
 Robust to outliers
 No distributional assumptions
 Can be used as exploratory tool
 The less good:
 Can require a large number of records
 Not useful if “rectangles” cannot capture the data structure
Case Study: Predicting the

Outcome of the Supreme Court
16
2/8/2019
Data Models vs. Experts

 The outcomes of the supreme court is of interest to non-
profits, voters and companies alike, and anybody interested in
long-term planning can benefit from knowing the outcomes
of the court
 Legal experts regularly make predictions of the Court’s
decisions
 In 2002 a group of political scientists and law scholars
decided to test if they could build a prediction model that
could outperform a group of experts
Sources
1. Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, and Kevin M.Quinn, Competing approaches to predicting supreme
court decision making, Perspectives on Politics Symposium 2 (2004), no. 4, 761-767. Available at:
http://wusct.wustl.edu/media/man1.pdf
2. Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, and Kevin M. Quinn. The supreme court forecasting project: Legal and
political science approaches to predicting supreme court decisionmaking, The Columbia Law Review 104 (2004), no. 4, 1150-
1210. Available at: http://wusct.wustl.edu/media/man2.pdf
The US Supreme Court

 The highest court in the United States
 It has the ultimate jurisdiction over all federal courts and
over state court cases involving issues of federal law, and
more
 The Court, consists of a chief justice and eight associate
justices who are nominated by the President
and confirmed by the United States Senate
 Once appointed, justices have life tenure
17
2/8/2019
The Path to the Supreme Court

 A case starts at a district court, where the initial decision is
made
 The circuit courts hear appeals from the district courts and
can change the decision that was made at the district court
level
 If a circuit court decision is appealed, then it
may make its way to the supreme court
 The cases often involve an interpretation
of the Constitution and may have social
political and economical consequences
How Would you Build the Model to Predict

the Outcome of the Supreme Court?
 Data?
 Methods?
 Dependent variable?
18
2/8/2019

 The match-up: Classification trees vs. collective opinions of
experts
 A team of 83 experts in the field of law, based on writings,
training, referrals and experience with the supreme court,
including chaired professors (33) and former law school deans
(5)
 Each expert only predicted cases within their area of expertise
 When the Supreme Court term started in October 2002, the
model had been run and the experts had reached a decision
 The predictions were posted publicly on a website
A Two Stage Tree Model

 The first stage:
 Two trees that classify the ruling as unanimous liberal decision
and one to predict a unanimous conservative decision
 If these gave a conflicting response or both predicted “no” the
second stage was used to determine the prediction
 The second stage:
 A tree was built to predict the decision of each individual justice
 The majority decision of the nine trees are used as the final
prediction
19
2/8/2019
The Data
 Cases from 1994 through 2001
 The Supreme court had the same nine justices through out this period:
Breyer, Ginsburg, Kennedy, O’Connor, Chief Rehnquist, Scalia, Souter,
Stevens and Thomas
 The independent variables:
 There are 6 in total: Circuit court of origin, the issue of the case (for
example civil rights or federal taxation), the type of petitioner, the type of
respondent, the ideological direction of the lower court decision (liberal vs.
conservative), and a indicator variable noting whether or not the petitioner
argued that a law or practice was unconstitutional
 The dependent variable:
 There are 11 of them! A separate model is built for each justice, the
dependent variable is 1 if the justice decided to reverse the lower court
decision, 0 otherwise (affirm or maintain the lower court decision), plus two
dependent variables for unanimous decisions
A Really Cool Data JUSTICE THOMAS
Modeling Trick
 In their study not
only did they use
the 6 independent
variables, they
actually made it
possible to have
the prediction for
one justice
dependent on
another’s justice
prediction!
Data Analytics - Spring 2019 Source: Figure 9 in [2]. Available at:

40 http://wusct.wustl.edu/media/man2.pdf
20
2/8/2019
Trees for Individual Justices

The CT for The CT for
Justice Justice
O’Connor Rehnquist
41
The Results
 To the surprise of almost everyone, the model outperformed
the Experts
 The models and experts were compared using 68 cases
 Models predicted correctly in 75% of the cases, while the
experts predicted 59% (avg. individual performance) – 66%
(majority rule) of the cases correctly
21
2/8/2019
Predictions of Individual Justices

 Neither Models nor
expert dominate the
predictions for
individual justices
Source: Figure 2 in [2]. Available at:

43 Data Analytics - Spring 2019 http://wusct.wustl.edu/media/man2.pdf

 William Grove and co-authors completed a meta analysis of
136 man vs machine studies of “human health and behavior”
 In 128 out of 136 the models outperformed the experts
 However the best models use human expertise for model
building and evaluation
 Combine human intuition and reasoning with consistent and
unemotional models
22
2/8/2019
Summary
Where are we?

 We started by discussing the data mining process and
different data mining tasks
 Our focus since has been on prediction and classification
Data Mining Methods Performance Selection
task measures (examples) criteria -
variable
selection
(regression)
Prediction Linear Regression, k- RMSE, MAD, Adjusted Adjusted R2,
NN, Regression Trees R2, “business impact” Mallow’s Cp,
probability
Classification Logistic Regression, Sensitivity, specificity, Mallow’s Cp,
k-NN, Classification misclassification costs, probability
Trees “business impact”
23
2/8/2019
And were are we going?

 In next lecture we will introduce a very powerful idea:
Ensemble methods – how to combine multiple models to
boost predictive performance
 On the horizon are a collection of topics that explore other
aspects of data mining
 Clustering
 Association Rules
 Time Series
On the horizon
 Tomorrow:
 R tutorial II
 Wednesday:
 Ensemble methods
 Friday:
 Office hours
 Individual Assignment III
 Team Projects:
 Start collecting the data and running some exploratory analysis
on it, even fitting first models to get a sense of what you have
24
2/8/2019
Appendix
Notes on the “Best Pruned Tree”*

 The XLMiner output from the pruning phase highlights another tree
besides the Minimum Error Tree. This is the Best Pruned Tree. The reason
this tree is important is that it is the smallest tree in the pruning
sequence that has an error that is within one standard error of the
Minimum Error Tree.
 The estimate of error that we get from the validation data is just that: it
is an estimate. If we had had another set of validation data the minimum
error would have been different. The minimum error rate we have
computed can be viewed as an observed value of a random variable with
standard error equal to:
Emin (1  Emin )
N val
where Emin is the error rate (as a fraction) for the minimum error tree
and Nval is the number of observations in the validation data set

* Adapted with minor changes from MIT Open Courseware
25

Classification &amp; Regression Trees

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification &amp; Regression Trees

Uploaded by

Copyright:

Available Formats

2/8/2019

Classification and Regression Trees

Professor Margrét Vilborg Bjarnadóttir

Our goal this morning:

2 Data Analytics - Spring 2019

Motivating Example: Beer Preference

4 Data Analytics - Spring 2019

The Underlying Idea

5 Data Analytics - Spring 2019

The Underlying Idea

6 Data Analytics - Spring 2019

7 Data Analytics - Spring 2019

The Key Questions

8 Data Analytics - Spring 2019

Determining the Best Split

9 Data Analytics - Spring 2019

Determining the Best Split: Example

A numerical variables takes 5 values: 1,2,3,4,5,

11 Data Analytics - Spring 2019

Determining the Best Split

The Gini Index

0.5 for a node

proprtion of observations in class 1

13 Data Analytics - Spring 2019

The Gini Index n records

 The algorithm will consider a split when the

proportion of the records in the parent node

 After calculating the Gini index for all possible

Data Analytics - Spring 2019

The Gini Index

 In our case this simplifies to

15 Data Analytics - Spring 2019

The Gini Index Child Child

 Step 4: The algorithm would/would not consider the split

Finding the Best Split: Summary

Data Analytics - Spring 2019

The Key Questions

18 Data Analytics - Spring 2019

When should we stop growing the tree?

Avoiding Overfitting: Stopping Rules

20 Data Analytics - Spring 2019

Avoiding Overfitting: Pruning

The Beer Example

How many decision nodes in the selected tree?

22 Data Analytics - Spring 2019

The Key Questions

23 Data Analytics - Spring 2019

Decision Rules in the End Nodes

24 Data Analytics - Spring 2019

The Key Questions

25 Data Analytics - Spring 2019

Converting a Tree into Rules

$38,562 and Age is greater than

 etc. Regul Regu

26 Data Analytics - Spring 2019

Running Classification Trees in XLMiner

Running Classification Trees in XLMiner

Select the trees that you

Regression Trees (Numerical Outcome)

29 Data Analytics - Spring 2019

Regression Tree – Airline Example

Advantages and Disadvantages

31 Data Analytics - Spring 2019

Case Study: Predicting the

Classification & Regression Trees

Classification & Regression Trees