Professional Documents
Culture Documents
Data Mining
Data Mining
An example
Building a telecom
customer retention model
Given a customers
telecom behavior, predict
if the customer will stay or
leave
External:
Market surveys, GIS systems
Independent Variables
Outlook
Temp
Humidity
Dependent
Variable
Windy
Play
sunny
85
85
FALSE
no
sunny
80
90
TRUE
no
overcast
83
86
FALSE
yes
rainy
70
96
FALSE
yes
rainy
68
80
FALSE
yes
rainy
65
70
TRUE
no
overcast
64
65
TRUE
yes
sunny
72
95
FALSE
no
sunny
69
70
FALSE
yes
rainy
75
80
FALSE
yes
sunny
75
70
TRUE
yes
overcast
72
90
TRUE
yes
overcast
81
75
FALSE
yes
rainy
71
91
TRUE
no
Measures of Dispersion
Variance
1 m
2
(
x
)
i
m 1 i 1
2
1/ 2
1
2
( xi )
Standard deviation
m 1 i 1
Heterogeneity Measures
The Gini index (Wiki:
The Gini
coefficient (also known as the Gini index or
Gini ratio) is a measure of statistical
dispersion developed by the Italian
statistician and sociologist Corrado Gini and
published in his 1912 paper "Variability and
Mutability" (Italian: Variabilit e mutabilit) )
G 1 fh
i 1
E f h log 2 f h
i1
Test of Significance
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5000 instances
Confidence Intervals
Given a frequency of (f) is 25%. How close is
this to the true probability p?
Prediction is just like tossing a biased coin
Head is a success, tail is an error
Confidence intervals
We can say: p lies within a certain specified
interval with a certain specified confidence
Example: S=750 successes in N=1000 trials
Estimated success rate: f=75%
How close is this to true success rate p?
Answer: with 80% confidence p[73.2,76.7]
10
Pr[ z X z ] c
Pr[ z X z ] 1 (2 * Pr[ X z ])
-Z/2
11
Z1- /2
Transforming f
Transformed value for f:
f p
p(1 p) / N
Resulting equation:
Pr z
Solving for p:
f p
z c
p(1 p) / N
2
2
2
z
f
f
z
p f
z
2
N
N
N
4
N
z2
1
N
12
50
100
500
1000 5000
p(lower)
0.670
0.711
0.763
0.774
0.789
p(upper)
0.888
0.866
0.833
0.824
0.811
1-
0.99
0.98
0.95
0.90
Z
2.58
2.33
1.96
1.65
13
Confidence limits
Confidence limits for the normal distribution with 0 mean
and a variance of 1:
Pr[Xz]
0.1%
3.09
0.5%
2.58
1%
2.33
5%
1.65
10%
1.28
20%
0.84
40%
0.25
Examples
f=75%, N=1000, c=80% (so that z=1.28):
p [0.732,0.767 ]
15
Implications
First, the more test data the better
N is large, thus confidence level is large
16
Hold aside one group for testing and use the rest to build model
Repeat
Test
iteration
17
Confidence
2% error in 100 tests
2% error in 10000 tests
Tradeoff:
# of Folds = # of Data N
Leave One Out CV
Trained model very close to
final model, but test data =
very biased
# of Folds = 2
Trained Model very unlike
final model, but test data =
close to training distribution
18
ROC curve plots TP (on the y-axis) against FP (on the xaxis)
Performance of each classifier represented as a point on
the ROC curve
changing the threshold of algorithm, sample distribution or cost
matrix changes the location of the point
19
Class=Yes
ACTUAL
CLASS Class=No
a
(TP)
c
(FP)
b
(FN)
d
(TN)
Widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
20
P(+|A)
True
Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
Predicted10
by classifier
0.25
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
TP
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
Class
P
Threshold
>=
ROC Curve:
22
No model consistently
outperform the other
l M1 is better for
small FPR
l M2 is better for
large FPR
Ideal:
Area
=1
Random guess:
Area
= 0.5
23