Professional Documents
Culture Documents
Today’s Lecture
Adding to our toolbox: Classification and regression trees
The main ideas behind classification trees
The key methodological questions
Case Study
1
2/8/2019
Classification Trees
…and Regression Trees
2
2/8/2019
All
Beer Preferences
data 100
90
Income Income
≤38,562 >38,562 80
70
60
Age
50
Regular
40
Light
30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
All
Beer Preferences
data 100
90
Income Income
≤38562 >38562 80
70
60
Age
50
Regular
Age Age Age Age 40
Light
≤37.5 >37.5 ≤48.5 >48.5 30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
3
2/8/2019
In-class Exercise
The following tree was obtained on using a random 70/30
split. What does the decision boundary look like?
4
2/8/2019
Beer Preferences
100
90
80
70
60
Age
50
Regular
40 Light
30
20
10
0
$0 $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income
10 Data Analytics - Spring 2019
5
2/8/2019
1. 0
2. 1
3. 2
4. 3
5. 4
6. 5
7. 6
Twoing
The one used by XLMiner, emphasizes equal splits
over very narrow splits, for details see page 316
of Breiman et al, 1984, Classification and Regression
12 Data Analytics - Spring 2019 Trees
6
2/8/2019
Is maximized when
0.3
0.2
___________________
0.1
0
0
0.06
0.12
0.18
0.24
0.3
0.36
0.42
0.48
0.54
0.6
0.66
0.72
0.78
0.84
0.9
0.96
Gini Index
7
2/8/2019
GI 1 p12 p22
The
parent
node
We have a parent node that has 20 records, 50% is class 1 and 50%
is class 0
We are considering a split that would result in two nodes:
Child node 1: 5 class 1 records
Child node 2: 5 class 1, 10 class 0 records
Using the Gini index, would we make the split?
Step 1: The current Gini Index:
Step 2: The Gini index of node A:
The Gini index of node B:
Step 3: The weighted average of the Gini index for nodes A and B :
8
2/8/2019
9
2/8/2019
The tree models relationship The tree models noise in the training set
19 between the outcome and the
predictors
10
2/8/2019
Prune here
Validation Pruning has
been proven
more successful
in practice than
stopping rules
Note, pruning uses the validation sample to select the best tree:
the performance of the pruned tree on the validation data is not
fully reflective of the performance on completely new data
21 Data Analytics - Spring 2019
11
2/8/2019
12
2/8/2019
13
2/8/2019
If these
options are
selected,
XLMiner will
draw pictures
of these trees
– up to seven
levels deep
27
14
2/8/2019
15
2/8/2019
16
2/8/2019
17
2/8/2019
Methods?
Dependent variable?
18
2/8/2019
19
2/8/2019
The Data
Cases from 1994 through 2001
The Supreme court had the same nine justices through out this period:
Breyer, Ginsburg, Kennedy, O’Connor, Chief Rehnquist, Scalia, Souter,
Stevens and Thomas
The independent variables:
There are 6 in total: Circuit court of origin, the issue of the case (for
example civil rights or federal taxation), the type of petitioner, the type of
respondent, the ideological direction of the lower court decision (liberal vs.
conservative), and a indicator variable noting whether or not the petitioner
argued that a law or practice was unconstitutional
The dependent variable:
There are 11 of them! A separate model is built for each justice, the
dependent variable is 1 if the justice decided to reverse the lower court
decision, 0 otherwise (affirm or maintain the lower court decision), plus two
dependent variables for unanimous decisions
Modeling Trick
In their study not
only did they use
the 6 independent
variables, they
actually made it
possible to have
the prediction for
one justice
dependent on
another’s justice
prediction!
20
2/8/2019
41
The Results
To the surprise of almost everyone, the model outperformed
the Experts
The models and experts were compared using 68 cases
Models predicted correctly in 75% of the cases, while the
experts predicted 59% (avg. individual performance) – 66%
(majority rule) of the cases correctly
21
2/8/2019
22
2/8/2019
Summary
23
2/8/2019
On the horizon
Tomorrow:
R tutorial II
Wednesday:
Ensemble methods
Friday:
Office hours
Individual Assignment III
Team Projects:
Start collecting the data and running some exploratory analysis
on it, even fitting first models to get a sense of what you have
24
2/8/2019
Appendix
25