You are on page 1of 17

13/02/2017

Data Mining
Classification
Jingpeng Li

University of Stirling 2017 CSCU9T6 Information Systems 1 of 27

What is Classification?
Assigning an object to a certain class
based on its similarity to previous
examples of other objects
Can be done with reference to original
data or based on a model of that data
E.g: Me: Its round, green, and edible
You: Its an apple!

University of Stirling 2017 CSCU9T6 Information Systems 2 of 27

1
13/02/2017

Usual Examples
Classifying transactions as genuine or
fraudulent e.g credit card usage,
insurance claims, cell phone calls
Classifying prospects as good or bad
customers
Classifying engine faults by their
symptoms

University of Stirling 2017 CSCU9T6 Information Systems 3 of 27

Certainty
As with most data mining solutions, a
classification usually comes with a degree
of certainty.
It might be the probability of the object
belonging to the class or it might be some
other measure of how closely the object
resembles other examples from that class

University of Stirling 2017 CSCU9T6 Information Systems 4 of 27

2
13/02/2017

Techniques
Non-parametric, e.g. K-nearest neighbour
Mathematical models, e.g. neural
networks
Rule based models, e.g. decision trees

University of Stirling 2017 CSCU9T6 Information Systems 5 of 27

Predictive / Definitive
Classification may indicate a propensity to
act in a certain way, e.g. A prospect is
likely to become a customer. This is
predictive.
Classification may indicate similarity to
objects that are definitely members of a
given class, e.g. small, round, green =
apple

University of Stirling 2017 CSCU9T6 Information Systems 6 of 27

3
13/02/2017

Simple Worked Example


Risk of making a claim on a motor
insurance policy
This is a predictive classification they
havent made the claim yet, but do they look
like other people who have?
To keep it simple, lets look at just age and
gender

University of Stirling 2017 CSCU9T6 Information Systems 7 of 27

The Data
Age Gender Claim?
30 Female No Claim
31 Male No
27 Male No No claim
20 Male Yes
29 Female No
32 Male No Age
46 Male No
45 Male No
33 Male No
25 Female No
38 Female No 30
21 Female No
38 Female No
42 Male No
29 Male No
37 Male No
40 Female No 8 of 27
Male Female

4
13/02/2017

K-Nearest Neighbour

Performed on Age
raw data
Count number
of other 30
examples that
are close
Winner is most
common Male Female

New person to classify


9 of 27

K-Nearest Neighbour
Should the test sample
(green circle) be the 1st
class (blue squares) or the
2nd class (red triangles)?
If k = 3 (solid line circle), it is
assigned to the 2nd class
because there are 2 triangles
and only 1 square inside the
inner circle.
If k = 5 (dashed line circle) it
is assigned to the 1st class
(3 squares vs. 2 triangles).

10 of 27

5
13/02/2017

Rule Based

If Gender = Age
Male and Age <
30 then Claim
If Gender = 30

Male and Age >


30 then No
Claim Male Female
Etc
New person to classify
11 of 27

Decision Trees
A good automatic rule discovery technique
is the decision tree
Produces a set of branching decisions that
end in a classification
Works best on nominal attributes
numeric ones need to be split into bins

University of Stirling 2017 CSCU9T6 Information Systems 12 of 27

6
13/02/2017

A Decision Tree
Legs Note: Not all attributes
are used in all decisions
4 0 2

Size Swims?

Med Small Y N

Cat Mouse Fish Snake Bird

University of Stirling 2017 CSCU9T6 Information Systems 13 of 27

Making a Classification
Each node represents a single variable
Each branch represents a value that variable
can take
To classify a single example, start at the top of
the tree and see which variable it represents
Follow the branch that corresponds to the value
that variable takes in your example
Keep going until you reach a leaf, where your
object is classified!

University of Stirling 2017 14 of 27

7
13/02/2017

Tree Structure
There are lots of ways to arrange a
decision tree
Does it matter which variables go where?
Yes:
You need to optimise the number of correct
classifications
You want to make the classification process
as fast as possible

University of Stirling 2017 CSCU9T6 Information Systems 15 of 27

A Tree Building Algorithm


Divide and Conquer:
Choose the variable that is at the top of the tree
Create a branch for each possible value
For each branch, repeat the process until there
are no more branches to make (i.e. stop when
all the instances at the current branch are in
the same class)
But how do you choose which variable to split?

University of Stirling 2017 CSCU9T6 Information Systems 16 of 27

8
13/02/2017

The ID3 Algorithm


Split on the variable that gives the greatest
information gain
Information can be thought of as a
measure of uncertainty
Information is a measure based on the
probability of something happening

University of Stirling 2017 CSCU9T6 Information Systems 17 of 27

Information Example
If I pick a random card form a deck and
you have to guess what it is, which would
you rather be told:
It is red (which has a probability of 0.5), or
it is a picture card (which has a probability
of 4/13 = 0.31)

18 of 27

9
13/02/2017

Calculating Information
The information associated with a single
event:
I(e) = -log(pe)
where pe is the probability of event e
occurring, and log is the base 2 log
I(Red) = -log(0.5) = 1
I(Picture card) = -log(0.31) = 1.7

University of Stirling 2017 CSCU9T6 Information Systems 19 of 27

Average Information

The weighted average information across all


possible values of a variable is called
Entropy.
It is calculated as the sum of the probability of
each possible event times its information
value:
H ( X ) P( xi ) I ( xi ) P( xi ) log( P( xi ))

where log is the base 2 log.

University of Stirling 2017 CSCU9T6 Information Systems 20 of 27

10
13/02/2017

Entropy of IsPicture?
I(Picture) = -log(4/13) = 1.7
I(Not Picture) = -log(9/13) = 0.53
H = (4/13)*1.7 + (9/13)*0.53 =0.89
Entropy H(X) is a measure of uncertainty
in variable X
The more even the distribution of X
becomes, the higher the entropy gets

University of Stirling 2017 CSCU9T6 Information Systems 21 of 27

Unfair Coin Entropy


The more even the distribution of X
becomes, the higher the entropy gets

University of Stirling 2017 CSCU9T6 Information Systems 22 of 27

11
13/02/2017

Conditional Entropy
We now introduce conditional entropy:
H(outcome | known)

The uncertainty about the outcome, given


that we know known

University of Stirling 2017 CSCU9T6 Information Systems 23 of 27

Information Gain
If we know H(Outcome)
And we know H(Outcome | Input)
We can calculate how much Input tells us
about Outcome simply as:

H(Outcome) - H(Outcome | Input)


This is the information gain of Input

University of Stirling 2017 CSCU9T6 Information Systems 24 of 27

12
13/02/2017

Picking the Top Node


ID3 picks the top node of the network by
calculating the information gain of the
output class for each input variable, and
picks the one that removes the most
uncertainty
It creates a branch for each value the
chosen variable can take

University of Stirling 2017 CSCU9T6 Information Systems 25 of 27

Adding Branches
Branches are added by making the same
information gain calculation for data
defined by the location on the tree of the
current branch
If all objects at the current leaf are in the
same class, no more branching is needed
The algorithm also stops when all the data
has been accounted for

University of Stirling 2017 CSCU9T6 Information Systems 26 of 27

13
13/02/2017

Person Hair
Length
Weight Age Class

Homer 0 250 36 M
Marge 10 150 34 F
Bart 2 90 10 M
Lisa 6 78 8 F
Maggie 4 20 1 F
Abe 1 170 70 M
Selma 8 160 41 F
Otto 10 180 38 M
Krusty 6 200 45 M

Comic 8 290 38 ?

p p n n
Entropy ( S ) log 2 log 2
pn pn pn pn

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Hair Length <= 5?

Let us try splitting


on Hair length

Gain( A) E (Current set ) E (all child sets )


Gain(Hair Length <= 5) = 0.9911 (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

14
13/02/2017

p p n n
Entropy ( S ) log 2 log 2
pn pn pn pn

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Weight <= 160?

Let us try splitting


on Weight

Gain( A) E (Current set ) E (all child sets )


Gain(Weight <= 160) = 0.9911 (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

p p n n
Entropy ( S ) log 2 log 2
pn pn pn pn

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
age <= 40?

Let us try splitting


on Age

Gain( A) E (Current set ) E (all child sets )


Gain(Age <= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

15
13/02/2017

Of the 3 features we had, Weight


was best. But while people who
weigh over 160 are perfectly
classified (as males), the under 160
yes no
people are not perfectly Weight <= 160?
classified So we simply recurse!
This time we find that we
can split on Hair length, and
we are done! no
yes
Hair Length <= 2?

We dont need to keep the data


around, just the test conditions. Weight <= 160?

yes no

How would
Hair Length <= 2?
these people Male
be classified?
yes no

Male Female

16
13/02/2017

It is trivial to convert Decision


Weight <= 160?
Trees to rules
yes no

Hair Length <= 2?


Male
yes no

Male Female

Rules to Classify Males/Females

If Weight greater than 160, classify as Male


Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female

Other Classification Methods


You will meet a certain type of neural
network in a later lecture these too are
good at classification
There are many, many, many other
methods for building classification systems

University of Stirling 2017 CSCU9T6 Information Systems 27 of 27

17

You might also like