3 - Classification

13/02/2017
Data Mining
Classification
Jingpeng Li
University of Stirling 2017 CSCU9T6 Information Systems 1 of 27
What is Classification?
Assigning an object to a certain class
based on its similarity to previous
examples of other objects
Can be done with reference to original
data or based on a model of that data
E.g: Me: Its round, green, and edible
You: Its an apple!
1
13/02/2017
Usual Examples
Classifying transactions as genuine or
fraudulent e.g credit card usage,
insurance claims, cell phone calls
Classifying prospects as good or bad
customers
Classifying engine faults by their
symptoms
Certainty
As with most data mining solutions, a
classification usually comes with a degree
of certainty.
It might be the probability of the object
belonging to the class or it might be some
other measure of how closely the object
resembles other examples from that class
2
13/02/2017
Techniques
Non-parametric, e.g. K-nearest neighbour
Mathematical models, e.g. neural
networks
Rule based models, e.g. decision trees
Predictive / Definitive
Classification may indicate a propensity to
act in a certain way, e.g. A prospect is
likely to become a customer. This is
predictive.
Classification may indicate similarity to
objects that are definitely members of a
given class, e.g. small, round, green =
apple
3
13/02/2017
Simple Worked Example

Risk of making a claim on a motor
insurance policy
This is a predictive classification they
havent made the claim yet, but do they look
like other people who have?
To keep it simple, lets look at just age and
gender
The Data
Age Gender Claim?
30 Female No Claim
31 Male No
27 Male No No claim
20 Male Yes
29 Female No
32 Male No Age
46 Male No
45 Male No
33 Male No
25 Female No
38 Female No 30
21 Female No
38 Female No
42 Male No
29 Male No
37 Male No
40 Female No 8 of 27
Male Female
4
13/02/2017
K-Nearest Neighbour
Performed on Age
raw data
Count number
of other 30
examples that
are close
Winner is most
common Male Female
New person to classify

9 of 27
K-Nearest Neighbour
Should the test sample
(green circle) be the 1st
class (blue squares) or the
2nd class (red triangles)?
If k = 3 (solid line circle), it is
assigned to the 2nd class
because there are 2 triangles
and only 1 square inside the
inner circle.
If k = 5 (dashed line circle) it
is assigned to the 1st class
(3 squares vs. 2 triangles).
10 of 27
5
13/02/2017
Rule Based
If Gender = Age
Male and Age <
30 then Claim
If Gender = 30
Male and Age >

30 then No
Claim Male Female
Etc
New person to classify
11 of 27
Decision Trees
A good automatic rule discovery technique
is the decision tree
Produces a set of branching decisions that
end in a classification
Works best on nominal attributes
numeric ones need to be split into bins
6
13/02/2017
A Decision Tree
Legs Note: Not all attributes
are used in all decisions
4 0 2
Size Swims?
Med Small Y N
Cat Mouse Fish Snake Bird
Making a Classification
Each node represents a single variable
Each branch represents a value that variable
can take
To classify a single example, start at the top of
the tree and see which variable it represents
Follow the branch that corresponds to the value
that variable takes in your example
Keep going until you reach a leaf, where your
object is classified!
University of Stirling 2017 14 of 27
7
13/02/2017
Tree Structure
There are lots of ways to arrange a
decision tree
Does it matter which variables go where?
Yes:
You need to optimise the number of correct
classifications
You want to make the classification process
as fast as possible
A Tree Building Algorithm

Divide and Conquer:
Choose the variable that is at the top of the tree
Create a branch for each possible value
For each branch, repeat the process until there
are no more branches to make (i.e. stop when
all the instances at the current branch are in
the same class)
But how do you choose which variable to split?
8
13/02/2017
The ID3 Algorithm

Split on the variable that gives the greatest
information gain
Information can be thought of as a
measure of uncertainty
Information is a measure based on the
probability of something happening
Information Example
If I pick a random card form a deck and
you have to guess what it is, which would
you rather be told:
It is red (which has a probability of 0.5), or
it is a picture card (which has a probability
of 4/13 = 0.31)
18 of 27
9
13/02/2017
Calculating Information
The information associated with a single
event:
I(e) = -log(pe)
where pe is the probability of event e
occurring, and log is the base 2 log
I(Red) = -log(0.5) = 1
I(Picture card) = -log(0.31) = 1.7
Average Information
The weighted average information across all

possible values of a variable is called
Entropy.
It is calculated as the sum of the probability of
each possible event times its information
value:
H ( X ) P( xi ) I ( xi ) P( xi ) log( P( xi ))
where log is the base 2 log.
10
13/02/2017
Entropy of IsPicture?
I(Picture) = -log(4/13) = 1.7
I(Not Picture) = -log(9/13) = 0.53
H = (4/13)*1.7 + (9/13)*0.53 =0.89
Entropy H(X) is a measure of uncertainty
in variable X
The more even the distribution of X
becomes, the higher the entropy gets
Unfair Coin Entropy

The more even the distribution of X
becomes, the higher the entropy gets
11
13/02/2017
Conditional Entropy
We now introduce conditional entropy:
H(outcome | known)
The uncertainty about the outcome, given

that we know known
Information Gain
If we know H(Outcome)
And we know H(Outcome | Input)
We can calculate how much Input tells us
about Outcome simply as:
H(Outcome) - H(Outcome | Input)

This is the information gain of Input
12
13/02/2017
Picking the Top Node

ID3 picks the top node of the network by
calculating the information gain of the
output class for each input variable, and
picks the one that removes the most
uncertainty
It creates a branch for each value the
chosen variable can take
Adding Branches
Branches are added by making the same
information gain calculation for data
defined by the location on the tree of the
current branch
If all objects at the current leaf are in the
same class, no more branching is needed
The algorithm also stops when all the data
has been accounted for
13
13/02/2017
Person Hair
Length
Weight Age Class
Homer 0 250 36 M
Marge 10 150 34 F
Bart 2 90 10 M
Lisa 6 78 8 F
Maggie 4 20 1 F
Abe 1 170 70 M
Selma 8 160 41 F
Otto 10 180 38 M
Krusty 6 200 45 M
Comic 8 290 38 ?
p p n n
Entropy ( S ) log 2 log 2
pn pn pn pn
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911
yes no
Hair Length <= 5?
Let us try splitting

on Hair length
Gain( A) E (Current set ) E (all child sets )

Gain(Hair Length <= 5) = 0.9911 (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
14
13/02/2017
p p n n
pn pn pn pn

= 0.9911
yes no
Weight <= 160?

on Weight

Gain(Weight <= 160) = 0.9911 (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
p p n n
pn pn pn pn

= 0.9911
yes no
age <= 40?

on Age

Gain(Age <= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
15
13/02/2017
Of the 3 features we had, Weight

was best. But while people who
weigh over 160 are perfectly
classified (as males), the under 160
yes no
people are not perfectly Weight <= 160?
classified So we simply recurse!
This time we find that we
can split on Hair length, and
we are done! no
yes
Hair Length <= 2?
We dont need to keep the data

around, just the test conditions. Weight <= 160?
yes no
How would
Hair Length <= 2?
these people Male
be classified?
yes no
Male Female
16
13/02/2017
It is trivial to convert Decision

Weight <= 160?
Trees to rules
yes no
Hair Length <= 2?

Male
yes no
Male Female
Rules to Classify Males/Females
If Weight greater than 160, classify as Male

Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female
Other Classification Methods

You will meet a certain type of neural
network in a later lecture these too are
good at classification
There are many, many, many other
methods for building classification systems
17

3 - Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 - Classification

Uploaded by

Copyright:

Available Formats

13/02/2017

University of Stirling 2017 CSCU9T6 Information Systems 1 of 27

University of Stirling 2017 CSCU9T6 Information Systems 2 of 27

University of Stirling 2017 CSCU9T6 Information Systems 3 of 27

University of Stirling 2017 CSCU9T6 Information Systems 4 of 27

University of Stirling 2017 CSCU9T6 Information Systems 5 of 27

University of Stirling 2017 CSCU9T6 Information Systems 6 of 27

Simple Worked Example

University of Stirling 2017 CSCU9T6 Information Systems 7 of 27

New person to classify

Male and Age >

University of Stirling 2017 CSCU9T6 Information Systems 12 of 27

Cat Mouse Fish Snake Bird

University of Stirling 2017 CSCU9T6 Information Systems 13 of 27

University of Stirling 2017 14 of 27

University of Stirling 2017 CSCU9T6 Information Systems 15 of 27

A Tree Building Algorithm

University of Stirling 2017 CSCU9T6 Information Systems 16 of 27

The ID3 Algorithm

University of Stirling 2017 CSCU9T6 Information Systems 17 of 27

University of Stirling 2017 CSCU9T6 Information Systems 19 of 27

The weighted average information across all

where log is the base 2 log.

University of Stirling 2017 CSCU9T6 Information Systems 20 of 27

University of Stirling 2017 CSCU9T6 Information Systems 21 of 27

Unfair Coin Entropy

University of Stirling 2017 CSCU9T6 Information Systems 22 of 27

The uncertainty about the outcome, given

University of Stirling 2017 CSCU9T6 Information Systems 23 of 27

H(Outcome) - H(Outcome | Input)

University of Stirling 2017 CSCU9T6 Information Systems 24 of 27

Picking the Top Node

University of Stirling 2017 CSCU9T6 Information Systems 25 of 27

University of Stirling 2017 CSCU9T6 Information Systems 26 of 27

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting

Gain( A) E (Current set ) E (all child sets )

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting

Gain( A) E (Current set ) E (all child sets )

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting

Gain( A) E (Current set ) E (all child sets )

Of the 3 features we had, Weight

We dont need to keep the data

It is trivial to convert Decision

Hair Length <= 2?

Rules to Classify Males/Females

If Weight greater than 160, classify as Male

Other Classification Methods

University of Stirling 2017 CSCU9T6 Information Systems 27 of 27

You might also like