Professional Documents
Culture Documents
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Target marketing
Medical diagnosis
Fraud detection
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
June 18, 2018 Data Mining: Concepts and Techniques 5
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
June 18, 2018 Data Mining: Concepts and Techniques 7
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Accuracy
classifier accuracy: predicting class label
age?
<=30 overcast
31..40 >40
no yes yes
manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected attributes
j 1 | D |
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
2
and P(xk|Ci) is
P(X | C i) g ( xk , Ci , Ci )
June 18, 2018 Data Mining: Concepts and Techniques 40
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
June 18, 2018 Data Mining: Concepts and Techniques 41
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
June 18, 2018 Data Mining: Concepts and Techniques 44
Bayesian Belief Networks
Classification:
predicts categorical class labels
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = , y Y = {+1, –1}
n
We want a function f: X Y
Advantages
prediction accuracy is generally high
• Perceptron: update W
additively
• Winnow: update W
multiplicatively
x1
- k
x0 w0
x1
w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k )
vector x vector w sum function i 0
Output vector
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input layer
I j wij Oi j
i
Input vector: X
June 18, 2018 Data Mining: Concepts and Techniques 61
How A Multi-Layer Neural Network Works?
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Inseparable
space
SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
Nondeterministic
Deterministic algorithm
algorithm
Nice Generalization Generalizes well but
properties doesn’t have strong
Hard to learn – learned mathematical foundation
in batch mode using Can easily be learned in
SVM Website
http://www.kernel-machines.org/
Representative implementations
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
space.
Locally weighted regression
Case-based reasoning
based inference
June 18, 2018 Data Mining: Concepts and Techniques 93
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _
.
+
_ .
+
xq + . . .
June 18, 2018
_ + .
Data Mining: Concepts and Techniques 94
Discussion on the k-NN Algorithm
Non-linear regression
(x x )( yi y )
w w y w x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
d d
d
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
The bagged classifier M* counts the votes and assigns the class