Classification and Prediction

Classification and Prediction
What is classification? What is prediction?

Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Prediction
Linear Regression
Non linear regression
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
2
ClassificationA Two-Step
Process
Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that

are correctly classified by the model
Test set is independent of training set, otherwise overfitting will occur

If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
3
Classification Process (1):

Model Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
4
Classification Process (2): Use

the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
Supervised vs. Unsupervised

Learning
Supervised learning (classification)
Supervision: The training data (observations,

measurements, etc.) are accompanied by
labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations,

etc. with the aim of establishing the existence
of classes or clusters in the data
April 27, 2016
Data Mining: Concepts and

Techniques
Issues Regarding Classification and

Prediction (1): Data Preparation
Data cleaning
Relevance analysis (feature selection)
Preprocess data in order to reduce noise and

handle missing values
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
April 27, 2016

Techniques
Issues regarding classification and

prediction (2): Evaluating Classification
Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules
decision tree size
compactness of classification rules
April 27, 2016

Techniques
Training Dataset
This
follows
an
example
from
Quinlans
ID3
April 27, 2016
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
income student credit_rating

high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
9
Output: A Decision Tree for buys_computer
age?
<=30
student?
30..40
overcast
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
April 27, 2016

Techniques
10
Algorithm for Decision Tree

Induction
Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-andconquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting is employed for classifying the leaf
There are no samples left
April 27, 2016

Techniques
11
Decision Trees
Rules for classifying data using attributes.

The tree consists of decision nodes and leaf
nodes.
A decision node has two or more branches, each
representing values for the attribute tested.
A leaf node attribute produces a homogeneous
result (all in one class), which does not require
additional classification testing.
Attribute Selection Measure:

Information Gain (ID3/C4.5)
Select the attribute with the highest information

gain
S contains si tuples of class Ci for i = {1, , m}
information measures info
required
to classify
m
si
si
s1,s2,...,sm ) log 2
any arbitrary I(tuple
s
i 1 s
v
entropy of attribute
s1 jA
...with
smj values {a1,a2,,av}
E(A)
j 1
I ( s1 j ,..., smj )
information gained by branching on attribute A
April 27, 2016
Gain(A) I(s 1, s 2 ,..., sm) E(A)

Techniques
13
Entropy
In thermodynamics, entropy is a measure of

how ordered or disordered a system is.
In information theory, entropy is a measure
of how certain or uncertain the value of a
random variable is (or will be).
Varying degrees of randomness, depending
on the number of possible values and the
total size of the set.
Information Gain
IG calculates effective change in entropy

after making a decision based on the value
of an attribute.
For decision trees, its ideal to base
decisions on the attribute that provides the
largest change in entropy, the attribute with
the highest gain.
Attribute Selection by
Information Gain Computation
Class P: buys_computer = yes

Class N: buys_computer = no
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
3040
>40
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
April
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971

high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes
fair
low
yes
excellent
low
yes
excellent
medium
no
fair
low
yes
fair
medium
yes
fair
medium
yes
excellent
medium
no
excellent
high
yes
fair
27,
2016
medium
no
excellent
5
4
I ( 2,3)
I (4,0)
14
14
5
I (3,2) 0.694
14
5
I (2,3) means age <=30 has
14 5 out of 14 samples, with 2
E ( age)
yeses and 3 nos. Hence
Gain(age) I ( p, n) E (age) 0.246
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
Data
yes Mining: Concepts and
Techniques
no
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
16
Generating Classification
Rules
Disadvantages of using ID3

Iterative Dichotomiser 3 Algorithm
Data may be over-fitted or over-classified, if a small

sample is tested.
Only one attribute at a time is tested for making a
decision.
Classifying continuous data may be computationally
expensive, as many trees must be generated to see where
to break the continuum.
Other Attribute Selection

Measures
Gini index (CART, IBM IntelligentMiner)
All attributes are assumed continuous-valued
Assume there exist several possible split

values for each attribute
May need other tools, such as clustering, to

get the possible split values
Can be modified for categorical attributes
April 27, 2016

Techniques
19
Gini Index (IBM

IntelligentMiner)
If a data set T contains examples from n classes, gini

index, gini(T) is defined gini
as (T ) 1 n p 2
j 1
where pj is the relative frequency of class j in T.

If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the
split data contains examples from n classes, the gini
index gini(T) is defined asN
N
gini split (T )
T 1)
1 gini(
gini(T 2)
The attribute provides the smallest ginisplit(T) is

chosen to split the node (need to enumerate all
possible splitting points for each attribute).
20
Extracting Classification Rules from

Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = <=30 AND student = no THEN buys_computer = no
IF age = <=30 AND student = yes THEN buys_computer = yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = yes
IF age = <=30 AND credit_rating = fair THEN buys_computer =
no
21
Avoid Overfitting in
Classification
Overfitting: An induced tree may overfit the training

data
Too many branches, some may reflect anomalies
due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction earlydo not split
a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold

Postpruning: Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training data

to decide which is the best pruned tree
22
Approaches to Determine the

Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to

estimate whether expanding or pruning a node
may improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding

is minimized
23
Enhancements to basic
decision tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes

that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that

are sparsely represented
This reduces fragmentation, repetition, and

replication
24
Classification in Large Databases
Classificationa classical problem extensively studied by

statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples

and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?

relatively faster learning speed (than other
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
April 27, 2016

Techniques
25
Classification and Prediction
What is classification? What is prediction?

Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association
rule mining
Other Classification Methods
Prediction
Classification accuracy
Summary
April 27, 2016

Techniques
26
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities

for hypothesis, among the most practical approaches
to certain types of learning problems
Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
April 27, 2016

Techniques
27
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown

Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given
that the hypothesis holds
April 27, 2016

Techniques
28
Bayesian
Theorem
Given training data X, posteriori probability of a

hypothesis H, P(H|X) follows the Bayes theorem
P(H | X ) P( X | H )P(H )
P( X )
Informally, this can be written as

posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
arg max P(h | D) arg max P(D | h)P(h).
MAP hH
hH
Practical difficulty: require initial knowledge of

many probabilities, significant computational cost
April 27, 2016

Techniques
29
Nave Bayes Classifier
A simplified assumption: attributes are conditionally

independent:
n
P( X | C i)
P( x k | C i)
k 1
The product of occurrence of say 2 elements x 1 and

x2, given the current class is C, is the product of the
probabilities of each element taken separately, given
the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count
the class distribution.
Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
April 27, 2016

Techniques
30
Training dataset
age
Class:
<=30
C1:buys_computer=<=30
yes
3040
C2:buys_computer=>40
>40
no
>40
3140
Data sample
<=30
X =(age<=30,
Income=medium, <=30
>40
Student=yes
<=30
Credit_rating=
3140
Fair)
3140
>40
April 27, 2016

high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Techniques
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
31
Nave Bayesian Classifier:

Example
Compute P(X/Ci) for each class
P(age=<30 | buys_computer=yes) = 2/9=0.222

P(age=<30 | buys_computer=no) = 3/5 =0.6
P(income=medium | buys_computer=yes)= 4/9 =0.444
P(income=medium | buys_computer=no) = 2/5 = 0.4
P(student=yes | buys_computer=yes)= 6/9 =0.667
P(student=yes | buys_computer=no)= 1/5=0.2
P(credit_rating=fair | buys_computer=yes)=6/9=0.667
P(credit_rating=fair | buys_computer=no)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=yes)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=no)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=yes) * P(buys_computer=yes)=0.028
P(X|buys_computer=yes) * P(buys_computer=yes)=0.007
X belongs to class buys_computer=yes
April 27, 2016

Techniques
32
Nave Bayesian Classifier:

Comments
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore loss
of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
April 27, 2016

Techniques
33
The k-Nearest Neighbor

Algorithm
All instances correspond to points in the n-D

space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or realvalued.
For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
Vonoroi
diagram: the decision surface induced
_
by 1-NN
for_ a _typical set of training examples.
_
_
_
April 27, 2016
.
+
+
xq

Techniques
34
Discussion on the k-NN

Algorithm
The k-NN algorithm for continuous-valued target

functions
Calculate the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors
according to their distance to the query point xq
giving greater weight to closer neighbors w
Similarly, for real-valued target functions

Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes.
To overcome it, axes stretch or elimination of the least
relevant attributes.
1
d ( xq , xi )2
April 27, 2016

Techniques
35
Classifier Evaluation Metrics:

Confusion Matrix
Confusion
Matrix:
Actual class\Predicted
C1
C1
True Positives
(TP)
False Negatives
(FN)
C1 Confusion
False
Positives
Example of
Matrix:
True Negatives
(TN)
class
C1
Actual
class\Predicted
class
buy_computer =
yes
(FP)
buy_comput buy_comput
er = yes
er = no
6954
46
Total
7000
buy_computer =
412
2588
3000
no
Given m classes,
an entry, CMi,j in a confusion matrix
indicates #Total
of tuples in class
i that were labeled
the
7366
2634 by10000
classifier as class j
May have extra rows/columns to provide totals
36
Classifier Evaluation Metrics:

Accuracy, Error Rate, Sensitivity
and Specificity
A\ C
Class Imbalance
P
T
P
F
N
F
P
T
N
Classifier Accuracy, or
P N Al
recognition rate: percentage
l
of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
Error rate: 1 accuracy, or
Error rate = (FP + FN)/All
Problem:
One class may be rare,
e.g. fraud, or HIVpositive
Significant majority of
the negative class and
minority of the positive
class
Sensitivity: True
Positive recognition rate
Sensitivity = TP/P
Specificity: True
37
Negative recognition
What Is Prediction?
Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical class

label
Prediction models continuous-valued functions
April 27, 2016

Techniques
38
Predictive Modeling in
Databases
Predictive modeling: Predict data values or construct

generalized linear models based on the database data.
One can only predict value ranges or category
distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the
prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
April 27, 2016

Techniques
39
Regress Analysis and Log-Linear

Models in Prediction
Linear regression: Y = + X
Two parameters , and specify the line and
are to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed
into the above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) =
April 27, 2016
ab acad bcd

Techniques
40

Classification and Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification and Prediction

Uploaded by

Copyright:

Available Formats

Classification and Prediction

What is classification? What is prediction?

Classification vs. Prediction

Model construction: describing a set of predetermined classes

Accuracy rate is the percentage of test set samples that

Test set is independent of training set, otherwise overfitting will occur

Classification Process (1):

Classification Process (2): Use

Supervised vs. Unsupervised

Supervised learning (classification)

Supervision: The training data (observations,

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations,

April 27, 2016

Data Mining: Concepts and

Issues Regarding Classification and

Relevance analysis (feature selection)

Preprocess data in order to reduce noise and

Generalize and/or normalize data

April 27, 2016

Data Mining: Concepts and

Issues regarding classification and

April 27, 2016

Data Mining: Concepts and

April 27, 2016

income student credit_rating

Output: A Decision Tree for buys_computer

April 27, 2016

Data Mining: Concepts and

Algorithm for Decision Tree

Basic algorithm (a greedy algorithm)

April 27, 2016

Data Mining: Concepts and

Rules for classifying data using attributes.

Attribute Selection Measure:

Select the attribute with the highest information

information gained by branching on attribute A

April 27, 2016

Gain(A) I(s 1, s 2 ,..., sm) E(A)

In thermodynamics, entropy is a measure of

IG calculates effective change in entropy

Class P: buys_computer = yes

income student credit_rating

yeses and 3 nos. Hence

Gain(age) I ( p, n) E (age) 0.246

Disadvantages of using ID3

Data may be over-fitted or over-classified, if a small

Other Attribute Selection

Gini index (CART, IBM IntelligentMiner)

All attributes are assumed continuous-valued

Assume there exist several possible split

May need other tools, such as clustering, to

Can be modified for categorical attributes

April 27, 2016

Data Mining: Concepts and

Gini Index (IBM

If a data set T contains examples from n classes, gini

where pj is the relative frequency of class j in T.

The attribute provides the smallest ginisplit(T) is

Extracting Classification Rules from

Represent the knowledge in the form of IF-THEN rules