Professional Documents
Culture Documents
Prediction
is prediction?
Issues regarding
Other classification
methods
Classification by decision
Prediction
tree induction
Bayesian classification
measures
Ensemble methods
Model selection
Summary
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
ClassificationA Two-Step
Process
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
5
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Concepts and
Techniques
Tenured?
is prediction?
Issues regarding
Other classification
methods
Classification by decision
Prediction
tree induction
Bayesian classification
Ensemble methods
(SVM)
Model selection
Summary
measures
Data cleaning
Data transformation
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted
attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
10
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
11
This
follows
an
example
of
Quinlans
ID3
(Playing
Tennis)
November 16, 2015
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
12
age?
<=30
31..40
overcast
student?
no
no
>40
credit rating?
yes
yes
yes
excellent
no
fair
yes
13
Techniques
14
15
16
I (3,2) 0.694
no
14
14 14
14
14
5
I (2,3) means age <=30 has
14 5 out of 14 samples, with 2
yeses and 3 nos. Hence
Gain(age) Info( D) Infoage ( D) 0.246
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
November 16, 2015
17
Split:
Techniques
18
SplitInfo A ( D)
j 1
|D|
log 2 (
| Dj |
|D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
| Dj |
SplitInfo A ( D )
4
4
6
6
4
4
log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14
14 14
14 14
14
19
gini( D) 1 p 2j
j 1
gini A ( D)
|D1|
|D |
gini( D1) 2 gini( D 2)
|D|
|D|
Reduction in Impurity:
20
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D 1: {low,
10
4
medium} and 4 in Dgini
2
( D) Gini( D ) Gini( D )
income{low, medium}
14
14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
May need other tools, e.g., clustering, to get the possible split
values
21
Information gain:
Gain ratio:
Gini index:
Techniques
22
C-SEP: performs better than info. gain and gini index in certain cases
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
CART: finds multivariate splits based on a linear comb. of attrs.
23
24
Attribute construction
25
26
27
28
AVC-set on Age
Age
Buy_Computer
AVC-set on income
income
yes
no
high
medium
low
yes
no
<=30
31..40
>40
AVC-set on Student
student
Buy_Computer
AVC-set on
credit_rating
Buy_Computer
Buy_Computer
yes
no
Credit
rating
yes
no
yes
fair
no
excellent
29
30
31
32
33
34
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
35
Bayesian Classification:
Why?
36
37
Bayesian Theorem
P( H | X) P(X | H ) P(H )
P(X)
38
P(X | C )P(C )
i
i
P(C | X)
i
P(X)
39
1
standard deviation
g ( x, , )
e 2
2
and P(xk|Ci) is
November 16, 2015
P ( X | C i ) g ( xk , C i , C i )
Data Mining: Concepts and
Techniques
40
Class:
C1:buys_computer =
yes
C2:buys_computer = no
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
41
P(Ci):
42
43
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
44
X
Z
November 16, 2015
Y is the parent of P
No dependency between Z and P
Has no loops or cycles
Data Mining: Concepts and
Techniques
45
Smoker
LungCancer
Emphysema
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
Dyspnea
n
P ( x1 ,..., xn ) P ( x i | Parents (Y i ))
i 1
46
Several scenarios:
Given both the network structure and all
variables observable: learn only the CPTs
Network structure known, some hidden variables:
gradient descent (greedy hill-climbing) method,
analogous to neural network learning
Network structure unknown, all variables
observable: search through the model space to
reconstruct network topology
Unknown structure, all hidden variables: No good
algorithms known for this purpose
Ref. D. Heckerman: Bayesian networks for data
mining
47
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
48
Size ordering: assign the highest priority to the triggering rules that
has the toughest requirement (i.e., with the most attribute test)
Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
49
<=30
31..40
student?
no
>40
credit rating?
yes
yes
excellent
yes
fair
yes
no
THEN buys_computer = no
THEN buys_computer = yes
THEN buys_computer = no
50
Rules are learned sequentially, each for a given class C i will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Each time a rule is learned, the tuples covered by the rules are
removed
51
How to Learn-One-Rule?
pos'
pos
log 2
)
pos' neg '
pos neg
It favors rules that have high accuracy and cover many positive tuples
pos neg
negcovered by R.
Pos/neg are # of positive/negativepos
tuples
FOIL _ Prune( R)
52
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
53
Classification: A Mathematical
Mapping
Classification:
predicts categorical class labels
E.g., Personal homepage classification
x = (x , x , x , ), y = +1 or 1
i
1
2
3
i
x1 : # of a word homepage
x2 : # of a word welcome
Mathematically
x X = n, y Y = {+1, 1}
We want a function f: X Y
54
Linear Classification
x
x
x
x
x
x
x
x
ooo
o
o
o o
x
o
o
o o
o
o
Binary Classification
problem
The data above the red
line belongs to class x
The data below red line
belongs to class o
Examples: SVM,
Perceptron,
Probabilistic Classifiers
55
Discriminative Classifiers
Advantages
prediction accuracy is generally high
Criticism
long training time
difficult to understand the learned function (weights)
56
x2
Scalar: x, y, w
Input:
{(x1, y1), }
x1
November 16, 2015
Winnow: update W
multiplicatively
57
Classification by
Backpropagation
58
Weakness
Strength
59
A Neuron (= a perceptron)
x0
w0
x1
w1
xn
- k
wn
output y
For Example
Input
weight
vector x vector w
weighted
sum
Activation
function
y sign( wi xi k )
i 0
60
Err j O j (1 O j ) Errk w jk
k
j j (l) Err j
wij wij (l ) Err j Oi
Err j O j (1 O j )(T j O j )
Hidden layer
wij
Oj
1
I j
1 e
I j wij Oi j
Input layer
Input vector: X
November 16, 2015
61
Inputs are fed simultaneously into the units making up the input
layer
The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
62
63
Backpropagation
Steps
Initialize weights (to small random #s) and biases in the
network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
64
Backpropagation and
Interpretability
65
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
66
67
Applications:
68
SVMGeneral Philosophy
Small Margin
Large Margin
Support Vectors
69
70
Let data D be (X1, y1), , (X|D|, y|D|), where Xi is the set of training
tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but
we want to find the best one (the one that minimizes classification
error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e.,
maximum marginal hyperplane
(MMH)
Data Mining:
Concepts and
November 16, 2015
Techniques
71
SVMLinearly Separable
H2: w0 + w1 x1 + w2 x2 1 for yi = 1
Quadratic
Programming (QP)
November 16, 2015
Techniques
72
73
SVMLinearly Inseparable
Techniques
74
SVMKernel functions
75
76
Data to
Mining:
Concepts
and accuracy
Selective de-clustering
ensure
high
Techniques
77
78
79
Selective Declustering
80
81
82
SVM
Deterministic algorithm
Nice Generalization
properties
Neural Network
Relatively old
Nondeterministic
algorithm
Generalizes well but
doesnt have strong
mathematical
foundation
Can easily be learned in
incremental fashion
To learn complex
functionsuse
multilayer perceptron
(not that trivial)
83
SVM Website
http://www.kernel-machines.org/
Representative implementations
84
SVMIntroduction
Literature
C.J.C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition .
Knowledge Discovery and Data Mining, 2(2), 1998.
Better than the Vapniks book, but still written too hard for
introduction, and the examples are so not-intuitive
85
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
86
Associative Classification
Associative classification
Why effective?
87
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM01 )
CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM03 )
RCBT (Mining top-k covering rule groups for gene expression data, Cong et al.
SIGMOD05)
Techniques
88
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution
of class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R and R , if the antecedent of R is more
1
2
1
general than that of R2 and conf(R1) conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are not
positively correlated, based on a 2 test of statistical significance
Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule
If a rule set S satisfies X, CMAR
Techniques
89
90
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
91
92
Instance-based learning:
Store training examples and delay the
processing (lazy evaluation) until a new
instance must be classified
Typical approaches
k-nearest neighbor approach
93
_
_
November 16, 2015
.
+
+
xq
94
95
Methodology
Challenges
96
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
97
98
99
Fuzzy Set
Approaches
100
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
101
What Is Prediction?
102
Linear Regression
w
1
(x
i 1
x )( yi y )
| D|
(x
x )2
w yw x
0
1
i 1
Multiple linear regression:
involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),, (X|D|, y|D|)
103
Nonlinear Regression
104
105
106
Minimal generalization
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
107
108
109
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
110
Classifier Accuracy
Measures
classes
buy_computer =
yes
buy_computer =
yes
6954
C1
C1
C2
True positive
False
negative
C2
False
buy_computer =positive
total
no
46
7000
True negative
recognition(%
)
99.34
buy_computer =
412
2588
3000
86.27
no of a classifier M, acc(M): percentage of test set tuples that are
Accuracy
correctly
model M
totalclassified by the
7366
2634
1000
95.52
Error rate (misclassification rate) of M = 1 acc(M) 0
Given m classes, CM , an entry in a confusion matrix, indicates # of
i,j
tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos
/* true positive recognition rate */
specificity = t-neg/neg
/* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
This model can also be used for cost-benefit analysis
November 16, 2015
111
Loss function: measures the error betw. yi and the predicted value
yi
Test error (generalization derror): the average loss over thed test set
2
| y
i 1
yi ' |
| y
i 1
d
| y
i 1
i 1
yi ' )
d
( yi yi ' ) 2
d
(y
yi ' |
y|
i 1
d
(y
i 1
y)2
112
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
113
Bootstrap
Suppose we are given a data set of d tuples. The data set is sampled
d times, with replacement, resulting in a training set of d samples.
The data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data will end up in
the bootstrap, and the remaining 36.8% will form the test set (since (1
1/d)d e-1 = 0.368)
114
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
115
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M , M , , M ,
1
2
k
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of
classifiers
Ensemble: combining a set of heterogeneous
Data Mining: Concepts and
classifiers
November 16, 2015
Techniques
116
Bagging: Boostrap
Aggregation
117
Boosting
118
log
1 error ( M i )
error ( M i )
119
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
120
Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive
rate
The plot also shows
a diagonal line
A model with perfect
accuracy will have 121
(SVM)
is prediction?
Issues regarding
Associative classification
your neighbors)
Classification by decision
tree induction
Bayesian classification
Prediction
Ensemble methods
Model selection
Summary
Rule-based classification
Classification by back
propagation
122
Summary (I)
123
Summary (II)
Significance tests and ROC curves are useful for model selection
No single method has been found to be superior over all others for
all data sets
124