2 Ch06

Classification and
Prediction
- The Course
OLAP
DM
DS
DS
Association
Classification
Clustering
DS
DP DW
DS = Data source
DW = Data warehouse
DM = Data Mining
DP = Staging Database
Chapter Objectives
Learn basic techniques for data classification
and prediction.
Realize the difference between the following
classifications of data:
supervised classification
prediction
unsupervised classification
Chapter Outline
What is classification and prediction of data?
How do we classify data by decision tree induction?
What are neural networks and how can they classify?
What is Bayesian classification?
Are there other classification techniques?
How do we predict continuous values?
What is Classification?
The goal of data classification is to organize and
categorize data in distinct classes.
A model is first created based on the data
distribution.
The model is then used to classify new data.
Given the model, a class can be predicted for new
data.
Classification = prediction for discrete and nominal
values
What is Prediction?
The goal of prediction is to forecast or deduce the value of an
attribute based on values of other attributes.
A model is first created based on the data distribution.
The model is then used to predict future or unknown values
In Data Mining
If forecasting discrete value Classification
If forecasting continuous value Prediction
Supervised and Unsupervised
Supervised Classification = Classification
We know the class labels and the number of
classes
Unsupervised Classification = Clustering
We do not know the class labels and may not
know the number of classes
Preparing Data Before
Classification
Data transformation:
Discretization of continuous data
Normalization to [-1..1] or [0..1]
Data Cleaning:
Smoothing to reduce noise
Relevance Analysis:
Feature selection to eliminate irrelevant attributes
Application
Credit approval
Target marketing
Medical diagnosis
Defective parts identification in manufacturing
Crime zoning
Treatment effectiveness analysis
Etc
Supervised learning process: 3 steps
Training
Data
1.
Classification Method
Classification Model
Accuracy
Test
Data
Classification Model 2.
New
Data
Class
3.
Classification is a 3-step process
1. Model construction (Learning):
Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
The set of all tuples used for construction of the model is
called training set.
The model is represented in the following forms:
Classification rules, (IF-THEN statements),
Decision tree
Mathematical formulae
1. Classification Process (Learning)
Name Income Age
Samir Low <30
Ahmed Medium [30...40]
Salah High <30
Ali Medium >40
Sami Low [30..40]
Emad Medium <30
Credit rating
bad
good
good
good
good
bad
IF Income = High
OR Age > 30
THEN Class = Good
OR
Decision Tree
OR
Mathematical For
class
Training Data
Classification is a 3-step process
2. Model Evaluation (Accuracy):
Estimate accuracy rate of the model based on a test set.
The known label of test sample is compared with the
classified result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is independent of training set otherwise over-fitting
will occur
2. Classification Process (Accuracy
Evaluation)
Name Income Age
Naser Low <30
Lutfi Medium <30
Adel High >40
Fahd Medium [30..40]
Credit rating
Bad
Bad
good
good
class
Accuracy
75%
Model
Bad
good
good
good
Classification is a three-step process
3. Model Use (Classification):
The model is used to classify unseen objects.
Give a class label to a new tuple
Predict the value of an actual attribute
3. Classification Process (Use)
Name Income Age
Adham Low <30
Credit rating
?
Classification Methods
Decision Tree Induction
Neural Networks
Bayesian Classification
Association-Based Classification
K-Nearest Neighbour
Case-Based Reasoning
Genetic Algorithms
Rough Set Theory
Fuzzy Sets
Etc.
Evaluating Classification Methods
Predictive accuracy
Ability of the model to correctly predict the class label
Speed and scalability
Time to construct the model
Time to use the model
Robustness
Handling noise and missing values
Scalability
Efficiency in large databases (not memory resident data)
Interpretability:
The level of understanding and insight provided by the
model
Chapter Outline
What is classification and prediction of data?
How do we classify data by decision tree induction?
What are neural networks and how can they
classify?
What is Bayesian classification?
Are there other classification techniques?
How do we predict continuous values?
Decision Tree
What is a Decision Tree?
A decision tree is a flow-chart-like tree structure.
Internal node denotes a test on an attribute
Branch represents an outcome of the test
All tuples in branch have the same value for the tested
attribute.
Leaf node represents class label or class label
distribution
Sample Decision Tree
Excellent customers
Fair customers
80
Income
YES No
< 6K
>= 6K
Age 50
20
10000
6000
2000
Income
80
Income
Age NO
NO
<6k
>=6k
<50
>=50
Yes
Age 50
Income
2000
20
6000 10000
Outlook Temp Humidity Windy
sunny hot high FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
sunny hot high
overcast hot high
rainy mild high
rainy cool normal
rainy cool Normal
overcast cool Normal
sunny mild High
sunny cool Normal
rainy mild Normal
sunny mild normal
overcast mild High
overcast hot Normal
rainy mild high
Play?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html
http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
Decision-Tree Classification Methods
The basic top-down decision tree generation
approach usually consists of two phases:
1. Tree construction
At the start, all the training examples are at the root.
Partition examples are recursively based on selected
attributes.
2. Tree pruning
Aiming at removing tree branches that may reflect noise
in the training data and lead to errors when classifying
test data improve classification accuracy
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Sports,
Luxury}
{Family}
CarType
{Family,
Luxury}
{Sports}
OR
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct
values.
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
What about this split?
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
Large}
{Medium}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal
interval bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
Splitting Based on Continuous Attributes
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best Split
fair customers
Good customers
Customers
Income
Age
>=10k
<10k
young old
How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:
pure
High degree
of impurity
Low degree
of impurity
50% red
50% green
75% red
25% green
100% red
0% green
Measures of Node Impurity
Information gain
Uses Entropy
Gain Ratio
Uses Information
Gain and Splitinfo
Gini Index
Used only for
binary splits
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized
in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority
voting is employed for classifying the leaf
There are no samples left
Classification Algorithms
ID3
Uses information gain
C4.5
Uses Gain Ratio
CART
Uses Gini
Entropy: Used by ID3
Entropy measures the impurity of S
S is a set of examples
p is the proportion of positive examples
q is the proportion of negative examples
Entropy(S) = - p log
2
p - q log
2
q
ID3
play
dont play
p
no = 5/14
p
yes = 9/14
Impurity = - p
yes
log
2
p
yes
- p
no
log
2
p
no
= - 9/14 log
2
9/14 - 5/14 log
2
5/14
= 0.94 bits
outl ook temperature humi di ty wi ndy pl ay
sunny hot high FALSE no
sunny hot high TRUE no
overcast hot high FALSE yes
rainy mild high FALSE yes
rainy cool normal FALSE yes
rainy cool normal TRUE no
overcast cool normal TRUE yes
sunny mild high FALSE no
sunny cool normal FALSE yes
rainy mild normal FALSE yes
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
rainy mild high TRUE no
ID3
play
dont play
amount of information required to specify class of an example given that it reaches node
0.0 bits
* 4/14
0.97 bits
* 5/14
0.97 bits
* 5/14
0.98 bits
* 7/14
0.59 bits
* 7/14
0.92 bits
* 6/14
0.81 bits
* 4/14
0.81 bits
* 8/14
1.0 bits
* 4/14
1.0 bits
* 6/14
outlook
sunny overcast rainy
+
= 0.69 bits
0.94 bits
gain: 0.25 bits
+
= 0.79 bits
+
= 0.91 bits
+
= 0.89 bits
gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits
play don't play
sunny 2 3
overcast 4 0
rainy 3 2
humidity temperature windy
high normal hot mild cool false true
play don't play
hot 2 2
mild 4 2
cool 3 1
play don't play
high 3 4
normal 6 1
play don't play
FALSE 6 2
TRUE 3 3
m
a
x
i
m
a
l

i
n
f
o
r
m
a
t
i
o
n

g
a
i
n
ID3
play
dont play
outlook
m
a
x
i
m
a
l

i
n
f
o
r
m
a
t
i
o
n

g
a
i
n
0.97 bits
0.0 bits
* 3/5
humidity temperature windy
high normal hot mild cool false true
+
= 0.0 bits
gain: 0.97 bits
+
= 0.40 bits
gain: 0.57 bits
+
= 0.95 bits
gain: 0.02 bits
0.0 bits
* 2/5
0.0 bits
* 2/5
1.0 bits
* 2/5
0.0 bits
* 1/5
0.92 bits
* 3/5
1.0 bits
* 2/5
ID3
play
dont play
outlook
sunny overcast
rainy
humidity
high normal
1.0 bits
*2/5
temperature windy
hot mild cool false true
+
= 0.95 bits
gain: 0.02 bits
+
= 0.95 bits
gain: 0.02 bits
+
= 0.0 bits
gain: 0.97 bits
humidity
high normal
0.92 bits
* 3/5
0.92 bits
* 3/5
1.0 bits
* 2/5
0.0 bits
* 3/5
0.0 bits
* 2/5
0.97 bits
ID3
play
dont play
outlook
windy
false true
humidity
high
normal
overcast hot high FALSE yes
overcast cool normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE yes
Yes
Yes
No No
Yes
C4.5
Information gain measure is biased towards attributes with a large
number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the
splitting attribute
)
| |
| |
( log
| |
| |
) (
2
1
D
D
D
D
D SplitInfo
j
v
j
j
A
=

=
926 . 0 )
14
5
( log
14
5
)
14
4
( log
14
4
)
14
5
( log
14
5
) (
2 2 2
= = D SplitInfo
A
CART
If a data set D contains examples from n classes, gini index,
gini(D) is defined as
where p
j
is the relative frequency of class j in D
If a data set D is split on A into two subsets D
1
and D
2
, the gini
index gini(D) is defined as
Reduction in Impurity:
The attribute provides the smallest gini
split
(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
=
=
n
j
p
j
D gini
1
2
1 ) (
) (
| |
| |
) (
| |
| |
) (
2
2
1
1
D
gini
D
D
D
gini
D
D
D
gini
A
+ =
) ( ) ( ) ( D gini D gini A gini
A
=
CART
Ex. D has 9 tuples in buys_computer = yes and 5 in no
Suppose the attribute income partitions D into 10 in D
1
: {low,
medium} and 4 in D
2
but gini
{medium,high}
is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split
values
Can be modified for categorical attributes
459 . 0
14
5
14
9
1 ) (
2 2
=
= D gini
) (
14
4
) (
14
10
) (
1 1 } , {
D Gini D Gini D gini
medium low income

Comparing Attribute Selection Measures

The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is
much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions
and purity in both partitions
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on
2
test
for independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistics: has a close approximation to
2
distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training
and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Underfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it
difficult to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Two approaches to avoid Overfitting
Prepruning:
Halt tree construction earlydo not split a node if this would
result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning:
Remove branches from a fully grown treeget a sequence of
progressively pruned trees
Use a set of data different from the training data to decide
which is the best pruned tree
Scalable Decision Tree Induction Methods
ID3, C4.5, and CART are not efficient when the training set
doesnt fit the available memory. Instead the following
algorithms are used
SLIQ
Builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT
Constructs an attribute list data structure
RainForest
Builds an AVC-list (attribute, value, class label)
BOAT
Uses bootstrapping to create several small samples
BOAT
BOAT (Bootstrapped Optimistic Algorithm for Tree
Construction)
Use a statistical technique called bootstrapping to create several
smaller samples (subsets), each fits in memory
Each subset is used to create a tree, resulting in several trees
These trees are examined and used to construct a new tree T
It turns out that T is very close to the tree that would be
generated using the whole data set together
Adv: requires only two scans of DB, an incremental alg.
Why decision tree induction in data mining?
Relatively faster learning speed (than other
classification methods)
Convertible to simple and easy to understand
classification rules
Comparable classification accuracy with other
methods
Converting Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High
Normal
Wind
Strong Weak
No Yes
Yes
Yes No
R
1
: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No
R
2
: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes
R
3
: IF (Outlook=Overcast) THEN Play=Yes
R
4
: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No
R
5
: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
Decision trees:
The Weka tool
@r el at i on weat her . symbol i c
@at t r i but e out l ook {sunny, over cast , r ai ny}
@at t r i but e t emper at ur e {hot , mi l d, cool }
@at t r i but e humi di t y {hi gh, nor mal }
@at t r i but e wi ndy {TRUE, FALSE}
@at t r i but e pl ay {yes, no}
@dat a
sunny, hot , hi gh, FALSE, no
sunny, hot , hi gh, TRUE, no
over cast , hot , hi gh, FALSE, yes
r ai ny, mi l d, hi gh, FALSE, yes
r ai ny, cool , nor mal , FALSE, yes
r ai ny, cool , nor mal , TRUE, no
over cast , cool , nor mal , TRUE, yes
sunny, mi l d, hi gh, FALSE, no
sunny, cool , nor mal , FALSE, yes
r ai ny, mi l d, nor mal , FALSE, yes
sunny, mi l d, nor mal , TRUE, yes
over cast , mi l d, hi gh, TRUE, yes
over cast , hot , nor mal , FALSE, yes
r ai ny, mi l d, hi gh, TRUE, no
http://www.cs.waikato.ac.nz/ml/weka/
Bayesian Classifier
Thomas Bayes (1702-1761)
Basic Statistics
Basic Statistics
X
D
16
6
74
Assume
D = All students
X = ICS students
C = SWE students
C
4
|X| = 10
|C| = 20
|D| = 100
P(X) = 10/100
P(C) = 20/100
P(X,C) = 4/100
P(X|C) = P(X,C)/P(C) = 4/20
P(C|X) = P(X,C)/P(X) = 4/10
P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)
Bayesian Classifier
Bayesian Classifier
Basic Equation
Basic Equation
P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)
Class Prior Probability Descriptor Posterior Probability
( )
( ) ( )
( ) X P
C X P C P
X C P
|
| =
Class Posterior Probability
Descriptor Prior Probability
Naive Bayesian Classifier
Naive Bayesian Classifier
( )
( ) ( )
( ) X P
C X P C P
X C P
|
| =
( ) ( ) ( ) ( ) ( )
X) (
) (
| .... | | | X |
1
1 1 3 1 2 1 1 1
P
C P
C x P C x P C x P C x P C P
n
=
( ) ( ) ( ) ( ) ( )
X)
.... X
(
) (
| | | | |
2
2 2 3 2 2 2 1 2
P
C P
n
=
( ) ( ) ( ) ( ) ( )
X)
.... X
(
) (
| | | | |
3 2 1
P
C P
m
m n m m m m
=
Independence assumption about descriptors
Training Data
Outlook Temp Humidity Windy
sunny hot high FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
sunny hot high
overcast hot high
rainy mild high
rainy cool normal
rainy cool Normal
overcast cool Normal
sunny mild High
sunny cool Normal
rainy mild Normal
sunny mild normal
overcast mild High
overcast hot Normal
rainy mild high
Play?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
P(yes) = 9/ 14
P(no) = 5/ 14
Bayesian Classifier
Bayesian Classifier
Probabilities for the weather data

Probabilities for the weather data
Frequency Tables
Outlook | No Yes
----------------------------------
Sunny | 3 2
----------------------------------
Overcast | 0 4
----------------------------------
Rainy | 2 3
Temp. | No Yes
----------------------------------
Hot | 2 2
----------------------------------
Mild | 2 4
----------------------------------
Cool | 1 3
Humidity | No Yes
----------------------------------
High | 4 3
----------------------------------
Normal | 1 6
Windy | No Yes
----------------------------------
False | 2 6
----------------------------------
True | 3 3
Outlook | No Yes
----------------------------------
Sunny | 3/5 2/9
----------------------------------
Overcast | 0/5 4/9
----------------------------------
Rainy | 2/5 3/9
Temp. | No Yes
----------------------------------
Hot | 2/5 2/9
----------------------------------
Mild | 2/5 4/9
----------------------------------
Cool | 1/5 3/9
Humidity | No Yes
----------------------------------
High | 4/5 3/9
----------------------------------
Normal | 1/5 6/9
Windy | No Yes
----------------------------------
False | 2/5 6/9
----------------------------------
True | 3/5 3/9
Likelihood Tables
Bayesian Classifier
Bayesian Classifier
Predicting a new day

Predicting a new day
Outlook Temp. Humidity Windy Play
sunny cool high true ?
X
Class?
P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes)
= 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205
P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no)
= 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795
Bayesian Classifier
Bayesian Classifier
zero frequency problem

zero frequency problem
What if a descriptor value doesnt occur with every class value
P(outlook=overcast|No)=0
Remedy: add 1 to the count for every descriptor-class combination
(Laplace Estimator)
Outlook | No Yes
----------------------------------
Sunny | 3+1 2+1
----------------------------------
Overcast | 0+1 4+1
----------------------------------
Rainy | 2+1 3+1
Temp. | No Yes
----------------------------------
Hot | 2+1 2+1
----------------------------------
Mild | 2+1 4+1
----------------------------------
Cool | 1+1 3+1
Humidity | No Yes
----------------------------------
High | 4+1 3+1
----------------------------------
Normal | 1+1 6+1
Windy | No Yes
----------------------------------
False | 2+1 6+1
----------------------------------
True | 3+1 3+1
Bayesian Classifier
Bayesian Classifier
General Equation
General Equation
( )
( ) ( )
( ) X
X
X
P
C P C P
C P
k k
k
|
| =
) | (
k
C P X
Likelihood:
( )

=
2
2
2 / 1 2
2
) (
exp
) 2 (
1
|
x
C x P
Continues variable:
Bayesian Classifier
Bayesian Classifier
Dealing with numeric attributes

Bayesian Classifier
Bayesian Classifier

Nave Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
X
Y
Z
P
Nodes: random variables
Links: dependency
X and Y are the parents of Z, and Y is
the parent of P
No dependency between Z and P
Has no loops or cycles
Bayesian Belief Network: An Example
The conditional probability table
(CPT) for variable LungCancer: Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
CPT shows the conditional probability for
each possible combination of its parents
Derivation of the probability of a
particular combination of values of X,
from CPT:
=
=
n
i
Y Parents
i
x
i
P x x P
n
1
)) ( | ( ) ,..., (
1
Training Bayesian Networks
Several scenarios:
Given both the network structure and all variables
observable: learn only the CPTs
Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
Network structure unknown, all variables observable:
search through the model space to reconstruct network
topology
Unknown structure, all hidden variables: No good
algorithms known for this purpose.
Support Vector Machines
Sabic
Email Mohammed S. Al-Shahrani
shahranims@sabic.com
Find a linear hyperplane (decision boundary) that will separate the
data
One Possible Solution
Another possible solution
Other possible solutions
Which one is better? B1 or B2?
How do you define better?
Find a hyper plane that maximizes the margin => B1 is better than B2
Support Vectors
Support Vectors
Support Vectors
0 = + b x w
r r
1 = + b x w
r r
1 + = + b x w
r r
+
+
=
1 b x w if 1
1 b x w if 1
) (
r r
r r
r
x f
2
|| ||
2
Margin
w
r
=
Finding the Decision Boundary
Let {x
1
, ..., x
n
} be our data set and let y
i
{1,-1} be the class
label of x
i
The decision boundary should classify all points correctly
The decision boundary can be found by solving the following
constrained optimization problem
This is a constrained optimization problem. Solving it is beyond
our course
We want to maximize:
Which is equivalent to minimizing:
But subjected to the following constraints:
This is a constrained optimization problem
Numerical approaches to solve it (e.g., quadratic
programming)
2
|| ||
2
Margin
w
r
=
+
+
=
1 b x w if 1
1 b x w if 1
) (
i
i
r r
r r
r
i
x f
2
|| ||
) (
2
w
w L
r
=
Classifying new Tuples
The decision boundary is determined only by the support vectors
Let t
j
(j=1, ..., s) be the indices of the s support vectors.
For testing with a new data z
Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
Support Vectors
What if the training set is not linearly separable?
Slack variables
i
can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.
i
What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
Subject to:
+ =

=
N
i
k
i
C
w
w L
1
2
2
|| ||
) (
r
+ +
+
=
i i
i i
1 b x w if 1
- 1 b x w if 1
) (
r r
r r
r
i
x f
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Non-linear SVMs
Datasets that are linearly separable with some noise work out
great:
But what are we going to do if the dataset is just too hard?
How about mapping data to a higher-dimensional space:
0
x
0
x
0
x
2
x
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
: x (x)
prediction
Linear Regression
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given
input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more predictor
variables and a response variable
Prediction
Training data
A
t
t
r
i
b
u
t
e

(
Y
)
R
e
s
p
o
n
s
e
Attribute (X)
Predictor
Types of Correlation
Positive correlation Negative correlation No correlation
Regression Analysis
Simple Linear regression
multiple regression
Non-linear regression
Other regression methods:
generalized linear model,
Poisson regression,
log-linear models,
regression trees
Simple Linear Regression
describes the linear relationship between a predictor variable,
plotted on the x-axis, and a response variable, plotted on the
y-axis
X
Y
1 o
Y X = +
1.0
1
Y
o
X
X
Y
X
Y

Fitting data to a linear model
1 i o i i
Y X = + +
intercept slope
residuals
How to fit data to a linear model?
Least Square Method
Least Squares Regression
Residual () =
X Y
1 0
+ =
Y Y
Model line:

2
)
( Y Y
Sum of squares of residuals =
we must find values of and that minimise
o

2
)
( Y Y
Linear Regression
A model line: y = w
0
+ w
1
x acquired by using
Method of least squares to estimates the best-fitting
straight line has:
x w y w
1 0
=
=
=

=
| |
1
2
| |
1
) (
) )( (
1
D
i
i
D
i
i i
x x
y y x x
w
Multiple Linear Regression
Multiple linear regression: involves more than one predictor
variable
The linear model with a single predictor variable X can easily
be extended to two or more predictor variables
Solvable by extension of least square method or using SAS,
S-Plus
1 1 2 2
...
o p p
Y X X X = + + + + +
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial
function
A polynomial regression model can be transformed into linear
regression model. For example,
y = w
0
+ w
1
x + w
2
x
2
+ w
3
x
3
convertible to linear with new variables: x
2
= x
2
, x
3
= x
3
y = w
0
+ w
1
x + w
2
x
2
+ w
3
x
3
Other functions, such as power function, can also be transformed
to linear model
Some models are intractable nonlinear
possible to obtain least square estimates through extensive
calculation on more complex formulae
Artificial Neural Networks
(ANN)
What is a ANN?
ANN is a data structure that supposedly simulates
the behavior of neurons in a biological brain.
ANN is composed of layers of units interconnected.
Messages are passed along the connections from
one unit to the other.
Messages can change based on the weight of the
connection and the value in the node
General Structure of ANN
k
-
f
w
0
w
1
w
n
x
0
x
1
x
n
ANN
X
1
X
2
X
3
Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
Output Y is 1 if at least two of the three inputs are equal to 1.
ANN
X
1
X
2
X
3
Y
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 0
0 1 0 0
0 1 1 1
0 0 0 0
=
> + + =
otherwise 0
true is if 1
) ( where
) 0 4 . 0 3 . 0 3 . 0 3 . 0 (
3 2 1
z
z I
X X X I Y
Artificial Neural Networks
Model is an assembly of
inter-connected nodes and
weighted links
Output node sums up each
of its input value according to
the weights of its links
Compare output node
against some threshold t
) ( t X w I Y
i
i i
=

Perceptron Model
) ( t X w sign Y
i
i i
=

or
Neural Networks
Advantages
prediction accuracy is generally high.
robust, works when training examples contain errors.
output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes.
fast evaluation of the learned target function.
Criticism
long training time.
difficult to understand the learned function (weights).
not easy to incorporate domain knowledge.
Learning Algorithms
Back propagation for classification
Kohonen feature maps for clustering
Recurrent back propagation for classification
Radial basis function for classification
Adaptive resonance theory
Probabilistic neural networks
Major Steps for Back Propagation
Network
Constructing a network
input data representation
selection of number of layers, number of nodes in
each layer.
Training the network using training data
Pruning the network
Interpret the results
A Multi-Layer Feed-Forward Neural Network
w
ij
+ =
i
j i ij j
O w I
j
I
j
e
O

+
=
1
1
How A Multi-Layer Neural Network Works?
The inputs to the network correspond to the attributes measured for
each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making
up the output layer, which emits the network's prediction
The network is feed-forward in that none of the weights cycles back to
an input unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training samples,
they can closely approximate any function
Defining a Network Topology
First decide the network topology: # of units in the input layer, #
of hidden layers (if > 1), # of units in each hidden layer, and # of
units in the output layer
Normalizing the input values for each attribute measured in the
training tuples to [0.01.0]
One input unit per domain value
Output, if for classification and more than two classes, one
output unit per class is used
Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
Backpropagation
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the
actual target value
Modifications are made in the backwards direction: from the
output layer, through each hidden layer down to the first hidden
layer, hence backpropagation
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Backpropagation
) )( 1 (
j j j j j
O T O O Err =
jk
k
k j j j
w Err O O Err

= ) 1 (
i j ij ij
O Err l w w ) ( + =
j j j
Err l) ( + =
Generated value
Correct value
Network Pruning
Fully connected network will be hard to articulate
n input nodes, h hidden nodes and m output nodes
lead to h(m+n) links (weights)
Pruning: Remove some of the links without affecting
classification accuracy of the network.
Other Classification Methods
Associative classification: Association rule based condSet
class
Genetic algorithm: Initial population of encoded rules are
changed by mutation and cross-over based on survival of
accurate once (survival).
K-nearest neighbor classifier: Learning by analogy.
Case-based reasoning: Similarity with other cases.
Rough set theory: Approximation to equivalence classes.
Fuzzy sets: Based on fuzzy logic (truth values between 0..1).
Lazy Learners
Lazy vs. Eager Learning
Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and waits
until it is given a test tuple
Eager learning (the above discussed methods): Given a
set of training set, constructs a classification model
before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Lazy Learner: Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing (lazy
evaluation) until a new instance must be classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Case-based reasoning
Uses symbolic representations and knowledge-
based inference
Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then its
probably a duck
Test
Record
Compute
Distance
Choose k of the
nearest records
Training
records
Instance-Based Classifiers
Atr1
...
AtrN Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1
...
AtrN
Unseen Case
Store the training records
Use training records to
predict the class label of
unseen cases
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean
distance, dist(X
1
, X
2
)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value
among the k training examples nearest to x
q
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples
.
.
.
.
.
.
_
+
_
x
q
+
_
_
+
_
_
+
Nearest-Neighbor Classifiers
Requires three things
The set of stored records
Distance Metric to compute
distance between records
The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
Compute distance to other training
records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
Unknown record
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-
nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d
2
=
i
i i
q p q p d
2
) ( ) , (
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
ACTUAL
CLASS
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
ACTUAL
CLASS
Most widely-used metric:
FN FP TN TP
TN TP
d c b a
d a
+ + +
+
=
+ + +
+
= Accuracy
Error Rate = 1 - Accuracy
Limitation of Accuracy
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
Accuracy is misleading because model does not
detect any class 1 example
Alternative Classifier Accuracy Measures
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
sensitivity = tp/pos /* true positive recognition rate */
specificity = tn/neg /* true negative recognition rate */
precision = tp/(tp + fp)
Predictor Error Measures
Test error (generalization error): the average loss over the test set
Mean absolute error:
Mean squared error:
Relative absolute error:
Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root
relative squared error
d
y y
d
i
i i
=

1
| ' |
d
y y
d
i
i i
=

1
2
) ' (
=
=
d
i
i
d
i
i i
y y
y y
1
1
| |
| ' |
=
=
d
i
i
d
i
i i
y y
y y
1
2
1
2
) (
) ' (
Evaluating Accuracy
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the
accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
At i-th iteration, use D
i
as test set and others as training set
Evaluating Accuracy
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
Several boostrap methods, and a common one is .632 boostrap
Suppose we are given a data set of d tuples. The data set is sampled
d times, with replacement, resulting in a training set of d samples. The
data tuples that did not make it into the training set end up forming the
test set. About 63.2% of the original data will end up in the bootstrap,
and the remaining 36.8% will form the test set (since (1 1/d)
d
e
-1
=
0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
) ) ( 368 . 0 ) ( 632 . 0 ( ) (
_
1
_ set train i
k
i
set test i
M acc M acc M acc + =
=
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by
aggregating predictions made by multiple classifiers
Use a combination of models to increase accuracy
Combine a series of k learned models, M
1
, M
2
, , M
k
, with the
aim of creating an improved model M*
Popular ensemble methods
Bagging
averaging the prediction over a collection of classifiers
Boosting
weighted vote with a collection of classifiers
General Idea
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D
i
of d
tuples is sampled with replacement from D (i.e., boostrap)
A classifier model M
i
is learned for each training set D
i
Classification: classify an unknown sample X
Each classifier M
i
returns its class prediction
The bagged classifier M* counts the votes and assigns the class
with the most votes to X
Prediction: can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test
tuple
Bagging: Boostrap Aggregation
Accuracy
Often significant better than a single classifier derived
from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
Boosting
Analogy: Consult several doctors, based on a combination of
weighted diagnosesweight assigned based on the previous
diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier M
i
is learned, the weights are updated to
allow the subsequent classifier, M
i+1
, to pay more attention to
the training tuples that were misclassified by M
i
The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
Boosting
The boosting algorithm can be extended for the
prediction of continuous values
Comparing with bagging: boosting tends to achieve
greater accuracy, but it also risks overfitting the
model to misclassified data
Boosting: Adaboost
Given a set of d class-labeled tuples, (X
1
, y
1
), , (X
d
, y
d
)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
Tuples from D are sampled (with replacement) to form a training set
D
i
of the same size
Each tuples chance of being selected is based on its weight
A classification model M
i
is derived from D
i
Its error rate is calculated using D
i
as a test set
If a tuple is misclassified, its weight is increased, otherwise it is
decreased
Error rate: err(X
j
) is the misclassification error of tuple X
j
. Classifier
M
i
error rate is the sum of the weights of the misclassified tuples:
The weight of classifier M
i
s vote is
=
d
j
j i
err w M error ) ( ) (
j
X
) (
) ( 1
log
i
i
M error
M error
Summary
Classification Vs prediction
Eager learners
Decision tree
Bayesian
Support vector Machines (SVM)
Neural Networks
Linear regression
Lazy learners
K-Nearest Neighbor (KNN)
Performance (Accuracy) Evaluation
Holdout
Cross validation
Bootstrap
Ensemble Methods
Bagging
Boosting
END

2 Ch06

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Ch06

Uploaded by

Copyright:

Available Formats

Classification and

Comparing Attribute Selection Measures

Probabilities for the weather data

Predicting a new day

zero frequency problem

Dealing with numeric attributes

Dealing with numeric attributes

Simple Linear Regression

You might also like