You are on page 1of 25

2/8/2019

Classification and Regression Trees

Professor Margrét Vilborg Bjarnadóttir

Today’s Lecture
 Adding to our toolbox: Classification and regression trees
 The main ideas behind classification trees
 The key methodological questions
 Case Study

Our goal this morning:


• To have a clear idea of what happens behind the scenes when XLMiner runs
Classification trees
• To understand the value of pruning back the trees with our validation data
• To learn about applying Classification Trees in practice
Bonus goal:
• Be able to apply Regression Trees

2 Data Analytics - Spring 2019

1
2/8/2019

Classification Trees
…and Regression Trees

Motivating Example: Beer Preference


 Hacker Pschorr
 One of the oldest beer brewing
companies in Munich
 Collects data on beer preference
(light/ regular) and
demographic info
 Goal: determine
demographic factors for
preferring light beer
 We will first focus on two
predictors Income and Age

4 Data Analytics - Spring 2019

2
2/8/2019

The Underlying Idea


 Recursively separating the records into subgroups by creating
splits on the predictors
 This splitting of the data set can be visualized as trees.

All
Beer Preferences
data 100
90
Income Income
≤38,562 >38,562 80
70
60
Age

50
Regular
40
Light
30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

5 Data Analytics - Spring 2019

The Underlying Idea


 Recursively separating the records into subgroups by creating
splits on the predictors
 This splitting of the data set can be visualized as trees.

All
Beer Preferences
data 100
90
Income Income
≤38562 >38562 80
70
60
Age

50
Regular
Age Age Age Age 40
Light
≤37.5 >37.5 ≤48.5 >48.5 30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

6 Data Analytics - Spring 2019

3
2/8/2019

In-class Exercise
 The following tree was obtained on using a random 70/30
split. What does the decision boundary look like?

7 Data Analytics - Spring 2019

The Key Questions


All 1. How do we choose the split
data variable and the split value?
Income Income
≤38562 >38562 2. When should we stop?
3. What rule do we use for
Age Age Age Age
classification/prediction in
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Reg. Light Reg. 4. How do we classify a new
record?

8 Data Analytics - Spring 2019

4
2/8/2019

Determining the Best Split


 The CART algorithm (Classification And Regression Trees)
evaluates all possible binary splits (exhaustive search)
 For each variable
 For each possible split value (on that variable)
 Calculate the impurity of the resulting sub-nodes
 Summarize the impurity of the split as the weighted average of the
impurities of the sub-nodes
 Select the best “variable-value” split

9 Data Analytics - Spring 2019

Determining the Best Split: Example


 Searching for the best split value for the income variable

Beer Preferences
100

90

80

70

60
Age

50
Regular
40 Light

30

20

10

0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
10 Data Analytics - Spring 2019

5
2/8/2019

A numerical variables takes 5 values: 1,2,3,4,5,


how many possible splits will CART consider?

1. 0
2. 1
3. 2
4. 3
5. 4
6. 5
7. 6

11 Data Analytics - Spring 2019

Determining the Best Split


 What do we mean by “best”?
 We want to find the split that best discriminates between
records with different outcomes
 After the split we want the new sub-nodes to be more
homogenous in the outcome variable
 We need a measure of “homogeneousness”!
 There are (at least) three commonly used impurity measures:
 Entropy The one we will use for
 The Gini index demonstration purposes

 Twoing
The one used by XLMiner, emphasizes equal splits
over very narrow splits, for details see page 316
of Breiman et al, 1984, Classification and Regression
12 Data Analytics - Spring 2019 Trees

6
2/8/2019

The Gini Index


 Is a measure of impurity
0.6

0.5 for a node


0.4
 Is zero for pure nodes (all
class 1 or all class 0)
Impurity measure

 Is maximized when
0.3

0.2

___________________
0.1

0
0
0.06
0.12
0.18
0.24
0.3
0.36
0.42
0.48
0.54
0.6
0.66
0.72
0.78
0.84
0.9
0.96

proprtion of observations in class 1

Gini Index

13 Data Analytics - Spring 2019

The Gini Index n records

 The algorithm will consider a split when the


The
weighted Gini Index of the child nodes is parent
node
smaller than the Gini Index of the parent node
 The weights for the child nodes is the n1 records n2 records

proportion of the records in the parent node


Child Child
that each child node “gets” node 1 node 2

 After calculating the Gini index for all possible


splits for all variables, it will pick the split that The weights would
be n1/n and n2/n
minimizes the weighted Gini Index of the
child nodes (if it is an improvement)

Data Analytics - Spring 2019

7
2/8/2019

The Gini Index


 Notation
 K - the number of classes
 In our case we will consider two classes
 pi - the percentage of records belonging to class i
 The Gini Index if defined as:
K
GI  1   pi2
i 1

 In our case this simplifies to

GI  1  p12  p22

15 Data Analytics - Spring 2019

The
parent
node

The Gini Index Child Child


node
node
 Assume we have 2 classes, class 1 and class 0 1 2

 We have a parent node that has 20 records, 50% is class 1 and 50%
is class 0
 We are considering a split that would result in two nodes:
 Child node 1: 5 class 1 records
 Child node 2: 5 class 1, 10 class 0 records
 Using the Gini index, would we make the split?
 Step 1: The current Gini Index:
 Step 2: The Gini index of node A:
The Gini index of node B:
 Step 3: The weighted average of the Gini index for nodes A and B :

 Step 4: The algorithm would/would not consider the split


K
GI  1   pi2
16 Data Analytics - Spring 2019
i 1

8
2/8/2019

Finding the Best Split: Summary


 Consider all possible splits
 Calculate the weighted impurity measure of the child nodes
 Select the split that minimizes the weighted impurity
measure of the child nodes

Data Analytics - Spring 2019

The Key Questions


All 1. How do we choose the split
data variable and the split value?
Income Income
≤38562 >38562 2. When should we stop?
3. What rule do we use for
Age Age Age Age
classification/prediction in
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Reg. Light Reg. 4. How do we classify a new
record?

18 Data Analytics - Spring 2019

9
2/8/2019

When should we stop growing the tree?


 One option is to stop when we can no longer find a split that
improves the impurity measures
 There is a (large) chance of overfitting if we keep splitting the
data until we only have very few points at each node
The goal is to arrive
Stop here
at a tree that
captures the patterns
but not the noise in
the training data,
therefore
maximizing the
prediction accuracy
on new data

The tree models relationship The tree models noise in the training set
19 between the outcome and the
predictors

Avoiding Overfitting: Stopping Rules


 There are a number of Stopping Rules that one can use to
avoid overfitting:
 Set a minimum number of records at a node
 Set a maximum number of splits
 Statistical significance of the split
 There is no simple good way to determine the right stopping
point (depends on the dataset)

20 Data Analytics - Spring 2019

10
2/8/2019

Avoiding Overfitting: Pruning


 Pruning refers to using the validation sample to prune back (cut
branches off) the full grown tree

Prune here
Validation Pruning has
been proven
more successful
in practice than
stopping rules

 Note, pruning uses the validation sample to select the best tree:
the performance of the pruned tree on the validation data is not
fully reflective of the performance on completely new data
21 Data Analytics - Spring 2019

The Beer Example


 Selecting the
“right” tree

How many decision nodes in the selected tree?

22 Data Analytics - Spring 2019

11
2/8/2019

The Key Questions


All 1. How do we choose the split
data variable and the split value?
Income Income
≤38562 >38562 2. When should we stop?
3. What rule do we use for
Age Age Age Age
classification/prediction in
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Reg. Light Reg. 4. How do we classify a new
record?

23 Data Analytics - Spring 2019

Decision Rules in the End Nodes


 Default: majority vote
 In the 2-class case: majority vote corresponds to setting the cut-
off to 0.5
 Changing the cut off:
 If you change the cut off for the algorithm, the labeling will
change
 For each end node, the probability p of class 1 is calculated
based on the training data
 If p is above the cut off all members are labeled as 1, otherwise
as 0

24 Data Analytics - Spring 2019

12
2/8/2019

The Key Questions


All 1. How do we choose the split
data variable and the split value?
Income Income
≤38562 >38562 2. When should we stop?
3. What rule do we use for
Age Age Age Age
classification/prediction in
≤37.5 >37.5 ≤48.5 >48.5 the end nodes
Light Dark Light Dark 4. How do we classify a new
record?
 Answer: We walk down the
How would you classify a 40 year old tree
person with $40,000 in annual income?

25 Data Analytics - Spring 2019

Converting a Tree into Rules


 We can translate the tree into decision rules
 If Income is less than or equal to $38,562
and Age <37.5 then we predict the All
data
person to like ________ beer
Income Income
 If Income is less than or equal to ≤38562 >38562

$38,562 and Age is greater than


37.5 then we predict the
Age Age Age Age
person to prefer ________beer ≤37.5 >37.5 ≤48.5 >48.5

 etc. Regul Regu


Light Light
ar lar

26 Data Analytics - Spring 2019

13
2/8/2019

Running Classification Trees in XLMiner


 We create the model in the same manner as before
A stopping rule, the Stopping rules, the algorithm will
algorithm will not split NOT grow the tree beyond the
up nodes with #records #of levels/splits/nodes provided
≤ value provided

If these
options are
selected,
XLMiner will
draw pictures
of these trees
– up to seven
levels deep

27

Running Classification Trees in XLMiner


 In XLMiner 2018
You want to make sure
that you use either the
best pruned or the
minimum error tree for
scoring if we have a
Possible stopping rules if validation sample
you do not have a
validation set

Select the trees that you


Summarizes variable want a picture of!
importance (new in
XLMiner 2018)

14
2/8/2019

Regression Trees (Numerical Outcome)


 Regression trees extend classification to a numerical
outcome variable
 The trees are built in the same manner, except:
 Labels of leaf nodes are the averages of the observations in the
node
 The impurity measures are different, to reflect the numerical
values of the outcome
 An nice alternative to linear regression, works especially well
with large datasets and many variables

29 Data Analytics - Spring 2019

Regression Tree – Airline Example


Fe a t ur e I mpor t a nc e
DI STANCE 100. 00
COUPON 38. 00
VACATI ON 31. 33
SW 27. 13
S_POP 24. 07
PAX 21. 20
S_I NCOME 17. 69
HI 17. 14
GATE 9. 43
E_POP 8. 11
NEW 7. 50
E_I NCOME 6. 81
SLOT 0. 00

15
2/8/2019

Advantages and Disadvantages


 The good:
 Interpretability: Easy to explain and interpret tree (represent rules)
 Tree growing is highly automated (no need for variable selection)
 Predictors can be nominal, ordinal, or continuous
 Robust to outliers
 No distributional assumptions
 Can be used as exploratory tool
 The less good:
 Can require a large number of records
 Not useful if “rectangles” cannot capture the data structure

31 Data Analytics - Spring 2019

Case Study: Predicting the


Outcome of the Supreme Court

16
2/8/2019

Data Models vs. Experts


 The outcomes of the supreme court is of interest to non-
profits, voters and companies alike, and anybody interested in
long-term planning can benefit from knowing the outcomes
of the court
 Legal experts regularly make predictions of the Court’s
decisions
 In 2002 a group of political scientists and law scholars
decided to test if they could build a prediction model that
could outperform a group of experts
Sources
1. Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, and Kevin M.Quinn, Competing approaches to predicting supreme
court decision making, Perspectives on Politics Symposium 2 (2004), no. 4, 761-767. Available at:
http://wusct.wustl.edu/media/man1.pdf
2. Theodore W. Ruger, Pauline T. Kim, Andrew D. Martin, and Kevin M. Quinn. The supreme court forecasting project: Legal and
political science approaches to predicting supreme court decisionmaking, The Columbia Law Review 104 (2004), no. 4, 1150-
1210. Available at: http://wusct.wustl.edu/media/man2.pdf

The US Supreme Court


 The highest court in the United States
 It has the ultimate jurisdiction over all federal courts and
over state court cases involving issues of federal law, and
more
 The Court, consists of a chief justice and eight associate
justices who are nominated by the President
and confirmed by the United States Senate
 Once appointed, justices have life tenure

34 Data Analytics - Spring 2019

17
2/8/2019

The Path to the Supreme Court


 A case starts at a district court, where the initial decision is
made
 The circuit courts hear appeals from the district courts and
can change the decision that was made at the district court
level
 If a circuit court decision is appealed, then it
may make its way to the supreme court
 The cases often involve an interpretation
of the Constitution and may have social
political and economical consequences

35 Data Analytics - Spring 2019

How Would you Build the Model to Predict


the Outcome of the Supreme Court?
 Data?

 Methods?

 Dependent variable?

36 Data Analytics - Spring 2019

18
2/8/2019

Data Models vs. Experts


 The match-up: Classification trees vs. collective opinions of
experts
 A team of 83 experts in the field of law, based on writings,
training, referrals and experience with the supreme court,
including chaired professors (33) and former law school deans
(5)
 Each expert only predicted cases within their area of expertise
 When the Supreme Court term started in October 2002, the
model had been run and the experts had reached a decision
 The predictions were posted publicly on a website

37 Data Analytics - Spring 2019

A Two Stage Tree Model


 The first stage:
 Two trees that classify the ruling as unanimous liberal decision
and one to predict a unanimous conservative decision
 If these gave a conflicting response or both predicted “no” the
second stage was used to determine the prediction
 The second stage:
 A tree was built to predict the decision of each individual justice
 The majority decision of the nine trees are used as the final
prediction

38 Data Analytics - Spring 2019

19
2/8/2019

The Data
 Cases from 1994 through 2001
 The Supreme court had the same nine justices through out this period:
Breyer, Ginsburg, Kennedy, O’Connor, Chief Rehnquist, Scalia, Souter,
Stevens and Thomas
 The independent variables:
 There are 6 in total: Circuit court of origin, the issue of the case (for
example civil rights or federal taxation), the type of petitioner, the type of
respondent, the ideological direction of the lower court decision (liberal vs.
conservative), and a indicator variable noting whether or not the petitioner
argued that a law or practice was unconstitutional
 The dependent variable:
 There are 11 of them! A separate model is built for each justice, the
dependent variable is 1 if the justice decided to reverse the lower court
decision, 0 otherwise (affirm or maintain the lower court decision), plus two
dependent variables for unanimous decisions

39 Data Analytics - Spring 2019

A Really Cool Data JUSTICE THOMAS

Modeling Trick
 In their study not
only did they use
the 6 independent
variables, they
actually made it
possible to have
the prediction for
one justice
dependent on
another’s justice
prediction!

Data Analytics - Spring 2019 Source: Figure 9 in [2]. Available at:


40 http://wusct.wustl.edu/media/man2.pdf

20
2/8/2019

Trees for Individual Justices


The CT for The CT for
Justice Justice
O’Connor Rehnquist

41

The Results
 To the surprise of almost everyone, the model outperformed
the Experts
 The models and experts were compared using 68 cases
 Models predicted correctly in 75% of the cases, while the
experts predicted 59% (avg. individual performance) – 66%
(majority rule) of the cases correctly

42 Data Analytics - Spring 2019

21
2/8/2019

Predictions of Individual Justices


 Neither Models nor
expert dominate the
predictions for
individual justices

Source: Figure 2 in [2]. Available at:


43 Data Analytics - Spring 2019 http://wusct.wustl.edu/media/man2.pdf

Data Models vs. Experts


 William Grove and co-authors completed a meta analysis of
136 man vs machine studies of “human health and behavior”
 In 128 out of 136 the models outperformed the experts
 However the best models use human expertise for model
building and evaluation
 Combine human intuition and reasoning with consistent and
unemotional models

44 Data Analytics - Spring 2019

22
2/8/2019

Summary

Where are we?


 We started by discussing the data mining process and
different data mining tasks
 Our focus since has been on prediction and classification
Data Mining Methods Performance Selection
task measures (examples) criteria -
variable
selection
(regression)
Prediction Linear Regression, k- RMSE, MAD, Adjusted Adjusted R2,
NN, Regression Trees R2, “business impact” Mallow’s Cp,
probability
Classification Logistic Regression, Sensitivity, specificity, Mallow’s Cp,
k-NN, Classification misclassification costs, probability
Trees “business impact”

46 Data Analytics - Spring 2019

23
2/8/2019

And were are we going?


 In next lecture we will introduce a very powerful idea:
Ensemble methods – how to combine multiple models to
boost predictive performance
 On the horizon are a collection of topics that explore other
aspects of data mining
 Clustering
 Association Rules
 Time Series

47 Data Analytics - Spring 2019

On the horizon
 Tomorrow:
 R tutorial II
 Wednesday:
 Ensemble methods
 Friday:
 Office hours
 Individual Assignment III
 Team Projects:
 Start collecting the data and running some exploratory analysis
on it, even fitting first models to get a sense of what you have

48 Data Analytics - Spring 2019

24
2/8/2019

Appendix

Notes on the “Best Pruned Tree”*


 The XLMiner output from the pruning phase highlights another tree
besides the Minimum Error Tree. This is the Best Pruned Tree. The reason
this tree is important is that it is the smallest tree in the pruning
sequence that has an error that is within one standard error of the
Minimum Error Tree.
 The estimate of error that we get from the validation data is just that: it
is an estimate. If we had had another set of validation data the minimum
error would have been different. The minimum error rate we have
computed can be viewed as an observed value of a random variable with
standard error equal to:
Emin (1  Emin )
N val
where Emin is the error rate (as a fraction) for the minimum error tree
and Nval is the number of observations in the validation data set

50 Data Analytics - Spring 2019


* Adapted with minor changes from MIT Open Courseware

25

You might also like