Professional Documents
Culture Documents
1x
Prof U Dinesh Kumar, IIMB
Decision Trees
Introduction to CHAID
CHAID
CHAID partitions the data into mutually exclusive, exhaustive, subsets that
best describe the dependent categorical variable.
CHAID is an iterative procedure that examines the predictors (or
classification variables) and uses them in the order of their statistical
significance.
The splitting in CHAID is based on the type of the dependent variable (or target
variable).
For continuous dependent variable F test is used.
For categorical dependent variable chi-square test of independence is used.
CHAID Example
Contingency Table
Checking account
balance
Total
0 DM
135
139
274
165
300
561
700
726
1000
Other than
0 DM
Total
Default
O E
2
statistic
Observed
frequency
Expected
Frequency
(O-E)^2/E
0DM-1
135
82.2
33.91
0DM-0
139
191.8
14.53
N0DM-1
165
217.8
12.8
N0DM-0
561
508.2
5.48
1000
1000
66.73
P-value
3.10E-16
CHAID Procedure
Step 1: Examine each predictor variable for its statistical significance with the dependent variable
using F test (for continuous dependent) or chi-square test for categorical dependent).
Step 2: Determine the most significant among the predictors (predictor with smallest p value
after Bonferonni correction).
Step 3: Divide the data by levels of the most significant predictor Each of these groups will be
examined individually further.
Step 4: For each sub-group, determine the most significant variable from the remaining predictor
and divide the data again.
Step 5: Repeat step 4 till stopping criteria is reached.
All Rights Reserved, Indian Institute of Management Bangalore
Merging - CHAID
CHAID Input
Significance level for partitioning a variable.
Y=1
Y=0
0DM = 1
99
110
209
0DM = 0
140
451
591
Total
239
561
800
Predictor
CHAID Tree
Total
0DM = 1
0DM = 0
Total
Expected
Response Variable
Y=1
Y=0
62.43875
146.56
176.5613
414.44
239
561
Total
209
591
800
Business Rules
Business rules are always derived for leaf nodes (node
without any branch)
Node 1: If 0 DM = 0 then classify the observation as 0
(good credit) which will be accurate 76% of the times
CHAID Tree
Business Rule
Node 7: If the Checking account balance is not less than 200 DM and
there are no other installments, then classify the observation as Y = 0.
Tree Pruning
Gini(k) P ( j k )(1 P ( j k ))
j 1
Entropy Calculation
Entropy(k) P( j k ) log( P( j k ))
j 1
Node t
Node tL Node tR
Reduction in
impurity
CART Tree
Ensemble Methods
Random Forest
Facebook
Blogs and Microsites
YouTube, Twitter, Instagram
C1 = Positive
C2 = Neutral
C3 = Negative
Output:
a learned classifier Y: dnew c
52
All Rights Reserved, Indian Institute of Management Bangalore
Vocabulary
V = {Awesome, humor, action, nonsense, romance, stereotype, Mindless , Boring,
Overacting, trash }
Document Model
Quiet shocking to see actor of Shahrukh s caliber and seniority doing such an ordinary role. Perfomance wise
much more expected from Shahrukh .In fact his sarcasm was felt very repetitive and far from funny.Only
good thing about movie is Deepika s Perfomance .She has done extraordinary job and should be given the full
credit for the success of movie.
Ref: Hiroshi Shimodaira, Text Classification using Nave Bayes, Learning and Data Note, University of Edinburgh
All Rights Reserved, Indian Institute of Management Bangalore
P ( D | C ) P (C )
P (C | D)
P ( D | C ) P (C )
P( D)
Nk
P(C k )
N
P( D | C ) P(C )
P(C | D)
P( D | C ) P(C )
P ( D)
All Rights Reserved, Indian Institute of Management Bangalore
N k 200
P(C k )
0.4
N 500
P ( D | C ) P (C )
P (C | D )
P ( D | C ) P (C )
P( D)
Documents and converted into feature vectors
bi= {0, 1, 0, 0, 1, 0, 0, 0, 0, 0}
P(Di| C) in terms of the individual word likelihoods P(wt |C) is given by:
N
Probability of finding
wt given class C
nk ( wt )
P( wt | C k )
Nk
nk ( wt ) 50
P( wt | C k )
0.25
Nk
200
V = {awesome, humor, action, nonsense, romance, stereotype, Mindless, Boring, Overacting, trash}
1
P 0
T
1
1
0
0
1
0
0
0
0
1
0
0
0
1
0
0
1
1
0
1
1
1
0
0
1
0
1
1
1
1
0
0
1
0
1
0
1
1
11
01
00
11
11
11
N 0
T
1
0
1
1
1
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
1
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
0
0
1
1
1
1
1
0
0
0
0
1
P( D | C ) P(C )
P(C | D)
P ( D | C ) P(C )
P( D)
P( D | C ) P(C )
P(C | D)
P( D | C ) P(C )
P ( D)
i 1
Np(w1) = 4
Np(w2) = 1
Np(w3) = 2
Np(w4) = 3
Np(w5) = 3
Np(w6) = 4
Np(w7) = 4
Np(w8) = 4
Np(w9) = 5
Np(w10) = 4
Np(wt) = Number of times word wt has occurred in positive documents in the training data.
P(w1|P) = 4/6
P(w2|P) = 1/6
P(w3|P) = 2/6
P(w4|P) = 3/6
P(w5|P) = 3/6
P(w6|P) = 4/6
P(w7|P) = 4/6
P(w8|P) = 4/6
P(w9|P) = 5/6
P(w10|P) = 4/6
NN(w1) = 2
NN(w2) = 3
NN(w3) = 3
NN(w4) = 1
NN(w5) = 1
NN(w6) =1
NN(w7) = 3
NN(w8) = 1
NN(w9) =2
NN(w10) = 1
NN(wt) = Number of times word wt has occurred in negative documents in the training data.
P(w1|N) = 2/6
P(w2|N) = 3/6
P(w3|N) = 3/6
P(w4|N) = 1/6
P(w5|N) = 1/6
P(w6|N) =1/6
P (w7|N) = 3/6
P(w8|N) = 1/6
P (w7|N) = 2/6
P(w8|N) = 1/6
D1 = (1, 1, 1, 1, 1, 1, 1, 1,1,1)
D1 = (1, 1, 1, 1, 1, 1, 1, 1)
|V |
4 1 2 3 3 4 4 4 5 4
(6/12) * 0.046
6 6 6 6 2 6 6 6 6 6
|V |
2 3 3 1 1 1 3 1 2 1
(6/12) * 8.9 10 7
6 6 6 6 2 6 6 6 6 6
The document will be classified as P since it has the higher probability
All Rights Reserved, Indian Institute of Management Bangalore
Challenges
Difficult to classify sarcastic comments
All my life I thought Air was free until I bought a bag of chips.
You have been so incredibly helpful and thanks (for nothing)
Text Pre-Processing
Feature Extraction
Feature extractor is used to convert each comment to a feature set.
Can be used for creating vocabulary
Stemming
Different forms of the same word. Stemming is a process of transforming a word
into its stem.
N Gram Analysis
N-Gram is a sequence of n consecutive words(e.g. machine learning is 2-gram)