Professional Documents
Culture Documents
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
ClassificationA Two-Step
Process
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
May 23, 2016
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Concepts and Techniques
Tenured?
Data cleaning
Data transformation
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted
attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
This
follows
an
example
of
Quinlans
ID3
(Playing
Tennis)
May 23, 2016
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
8
age?
<=30
31..40
overcast
student?
no
no
>40
credit rating?
yes
excellent
yes
yes
no
fair
yes
10
| D j | D into v
Information needed (after using
A
to
split
Info A ( D)
I (D j )
partitions) to classify D:
j 1 | D |
v
11
I (3,2) 0.694
no
14
14 14
14
14
5
I (2,3) means age <=30 has
14 5 out of 14 samples, with 2
yeses and 3 nos. Hence
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
May
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data no
Mining: Concepts and Techniques
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
12
Split:
13
SplitInfo A ( D)
j 1
|D|
log 2 (
| Dj |
|D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
| Dj |
SplitInfo A ( D )
4
4
6
6
4
4
log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14
14 14
14 14
14
14
gini( D) 1 p 2j
j 1
gini A ( D)
|D1|
|D |
gini( D1) 2 gini( D 2)
|D|
|D|
Reduction in Impurity:
15
9
5
gini ( D) 1
0.459
14
14
Suppose the attribute income partitions D into 10 in D 1: {low,
10
4
medium} and 4 in Dgini
2
( D)
Gini ( D )
Gini ( D )
income{low , medium}
14
14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
May need other tools, e.g., clustering, to get the possible split
values
16
Information gain:
Gain ratio:
Gini index:
17
C-SEP: performs better than info. gain and gini index in certain cases
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
CART: finds multivariate splits based on a linear comb. of attrs.
18
19
Attribute construction
20
21
22
Bayesian Classification:
Why?
23
24
Bayesian Theorem
P( H | X) P(X | H )P(H )
P(X)
25
P(X | C )P(C )
i
i
P(C | X)
i
P(X)
26
1
standard deviation
g ( x, , )
e 2
2
and P(xk|Ci) is
May 23, 2016
P ( X | C i ) g ( xk , C i , C i )
Data Mining: Concepts and Techniques
27
Class:
C1:buys_computer =
yes
C2:buys_computer = no
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
28
P(Ci):
29
30
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
31
Size ordering: assign the highest priority to the triggering rules that
has the toughest requirement (i.e., with the most attribute test)
Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
32
<=30
31..40
student?
no
>40
credit rating?
yes
yes
excellent
yes
yes
no
THEN buys_computer = no
IF age = mid-age
fair
THEN buys_computer = no
33
Rules are learned sequentially, each for a given class C i will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Each time a rule is learned, the tuples covered by the rules are
removed
34
Discriminative Classifiers
Advantages
prediction accuracy is generally high
Criticism
long training time
difficult to understand the learned function (weights)
35
Classification by
Backpropagation
36
Weakness
Strength
37
38
Applications:
39
SVMGeneral Philosophy
Small Margin
Large Margin
Support Vectors
40
41
SVM
Deterministic algorithm
Nice Generalization
properties
Neural Network
Relatively old
Nondeterministic
algorithm
Generalizes well but
doesnt have strong
mathematical
foundation
Can easily be learned in
incremental fashion
To learn complex
functionsuse
multilayer perceptron
(not that trivial)
42
Associative Classification
Associative classification
Why effective?
43
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM01 )
CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM03 )
RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.
SIGMOD05)
44
45
_
_
May 23, 2016
.
+
+
xq
46
Methodology
Challenges
47
48
49
Fuzzy Set
Approaches
50
What Is Prediction?
51
Linear Regression
w
1
(x
i 1
x )( yi y )
| D|
(x
x )2
w yw x
0
1
i 1
Multiple linear regression:
involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),, (X|D|, y|D|)
52
Nonlinear Regression
53
54
55
Minimal generalization
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
56
57
58
Predicted Label
Positive
Negative
Positive
True Positive
(TP)
False Negative
(FN)
Negative
False Positive
(FP)
True Negative
(TN)
Known Label
59
Confusion Matrix
Measure
Formula
Precision
TP / (TP + FP)
Intuitive Meaning
The percentage of
positive predictions
that are correct.
TP / (TP + FN)
The percentage of
positive labeled
instances that were
predicted as positive.
Specificity
TN / (TN + FP)
The percentage of
negative labeled
instances that were
predicted as negative.
Accuracy
The percentage of
(TP + TN) / (TP + TN +
predictions that are
FP + FN)
correct.
Recall / Sensitivity
60
Classifier Accuracy
Measures
classes
buy_computer =
yes
buy_computer =
no
total
recognition(%
)
buy_computer =
yes
6954
46
7000
99.34
buy_computer =
412
2588
3000
86.27
no of a classifier M, acc(M): percentage of test set tuples that are
Accuracy
correctly
model M
totalclassified by the
7366
2634
1000
95.52
Error rate (misclassification rate) of M = 1 acc(M) 0
Given m classes, CM , an entry in a confusion matrix, indicates # of
i,j
tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos
/* true positive recognition rate */
specificity = t-neg/neg
/* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
This model can also be used for cost-benefit analysis
May 23, 2016
61
Loss function: measures the error betw. yi and the predicted value
yi
Test error (generalization derror): the average loss over thed test set
2
| y
i 1
yi ' |
| y
i 1
d
| y
i 1
i 1
yi ' )
d
( yi yi ' ) 2
d
(y
yi ' |
y|
i 1
d
(y
i 1
y)2
62
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
63
Bootstrap
Suppose we are given a data set of d tuples. The data set is sampled
d times, with replacement, resulting in a training set of d samples.
The data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data will end up in
the bootstrap, and the remaining 36.8% will form the test set (since (1
1/d)d e-1 = 0.368)
64
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M , M , , M ,
1
2
k
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of
classifiers
Ensemble: combining a set of heterogeneous
classifiers
May 23, 2016
Data Mining: Concepts and Techniques
65
Bagging: Boostrap
Aggregation
66
Boosting
67
log
1 error ( M i )
error ( M i )
68
Summary (I)
70
Summary (II)
Significance tests and ROC curves are useful for model selection
No single method has been found to be superior over all others for
all data sets
71