You are on page 1of 39

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228058014

Pattern Classification

Chapter · January 2001

CITATIONS READS

14,144 30,073

3 authors, including:

David G.Stork
Rambus
233 PUBLICATIONS   17,689 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Hybrid Optoelectronic Correlator View project

All content following this page was uploaded by David G.Stork on 05 April 2016.

The user has requested enhancement of the downloaded file.


Pattern
Classification

All materials in these slides were taken from


Pattern Classification (2nd ed) by R. O. Duda,
P. E. Hart and D. G. Stork, John Wiley &
Sons, 2000
with the permission of the authors and the
publisher
1

Pattern Classification, Chapter 2 (Part 1)


2

• Posterior, likelihood, evidence


• P(ωj | x) = P(x | ωj) . P (ωj) / P(x)
• Where in case of two categories
j=2
P ( x ) = ∑ P ( x | ω j )P ( ω j )
j =1

• Posterior = (Likelihood. Prior) / Evidence


Pattern Classification, Chapter 2 (Part 1)
3

Pattern Classification, Chapter 2 (Part 1)


4

• Decision given the posterior probabilities

X is an observation for which:

if P(ω1 | x) > P(ω2 | x) True state of nature = ω1


if P(ω1 | x) < P(ω2 | x) True state of nature = ω2

Therefore:
whenever we observe a particular x, the probability of
error is :
P(error | x) = P(ω1 | x) if we decide ω2
P(error | x) = P(ω2 | x) if we decide ω1
Pattern Classification, Chapter 2 (Part 1)
5

• Minimizing the probability of error


• Decide ω1 if P(ω1 | x) > P(ω2 | x);
otherwise decide ω2

Therefore:
P(error | x) = min [P(ω1 | x), P(ω2 | x)]
(Bayes decision)

Pattern Classification, Chapter 2 (Part 1)


6

Let {ω1, ω2,…, ωc} be the set of c states of nature


(or “categories”)

Let {α1, α2,…, αa} be the set of possible actions

Let λ(αi | ωj) be the loss incurred for taking

action αi when the state of nature is ωj

Pattern Classification, Chapter 2 (Part 1)


7
Overall risk
R = Sum of all R(αi | x) for i = 1,…,a

Conditional risk

Minimizing R Minimizing R(αi | x) for i = 1,…, a

j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1

for i = 1,…,a
Pattern Classification, Chapter 2 (Part 1)
8

• Two-category classification
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi | ωj)
loss incurred for deciding ωi when the true state of nature is ωj

Conditional risk:

R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x)


R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x)

Pattern Classification, Chapter 2 (Part 1)


9

Our rule is the following:


if R(α1 | x) < R(α2 | x)
action α1: “decide ω1” is taken

This results in the equivalent rule :


decide ω1 if:

(λ21- λ11) P(x | ω1) P(ω1) >


(λ12- λ22) P(x | ω2) P(ω2)

and decide ω2 otherwise


Pattern Classification, Chapter 2 (Part 1)
10

Likelihood ratio:

The preceding rule is equivalent to the following rule:

P ( x | ω 1 ) λ12 − λ 22 P ( ω 2 )
if > .
P ( x | ω 2 ) λ 21 − λ11 P ( ω 1 )

Then take action α1 (decide ω1)


Otherwise take action α2 (decide ω2)

Pattern Classification, Chapter 2 (Part 1)


11

Optimal decision property

“If the likelihood ratio exceeds a threshold value


independent of the input pattern x, we can take
optimal actions”

Pattern Classification, Chapter 2 (Part 1)


12

Minimum-Error-Rate Classification

• Actions are decisions on classes


If action αi is taken and the true state of nature is ωj then:
the decision is correct if i = j and in error if i ≠ j

• Seek a decision rule that minimizes the probability of error


which is the error rate

Pattern Classification, Chapter 2 (Part 1)


13

• Introduction of the zero-one loss function:

0 i = j
λ ( α i ,ω j ) =  i , j = 1 ,..., c
1 i ≠ j
Therefore, the conditional risk is:

j =c
R( α i | x ) = ∑ λ ( α i | ω j ) P ( ω j | x )
j =1

= ∑ P( ω j | x ) = 1 − P( ω i | x )
j ≠1

“The risk corresponding to this loss function is the average probability


error”

Pattern Classification, Chapter 2 (Part 1)


14

• Minimize the risk requires maximize P(ωi | x)


(since R(αi | x) = 1 – P(ωi | x))

• For Minimum error rate


• Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i

Pattern Classification, Chapter 2 (Part 1)


15

• Regions of decision and zero-one loss function,


therefore:

λ12 − λ 22 P ( ω 2 ) P( x | ω1 )
Let . = θ λ then decide ω 1 if : > θλ
λ 21 − λ11 P ( ω 1 ) P( x | ω 2 )

• If λ is the zero-one loss function which means:


 0 1
λ =  
1 0
P( ω 2 )
then θ λ = = θa
P( ω1 )
0 2  2 P( ω 2 )
if λ =   then θ λ = = θb
1 0 P( ω1 )
Pattern Classification, Chapter 2 (Part 1)
16

Pattern Classification, Chapter 2 (Part 1)


Classifiers, Discriminant Functions 17

and Decision Surfaces

• The multi-category case


• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class ωi
if:
gi(x) > gj(x) ∀j ≠ i

Pattern Classification, Chapter 2 (Part 1)


18

Pattern Classification, Chapter 2 (Part 1)


19
• Let gi(x) = - R(αi | x)
(max. discriminant corresponds to min. risk!)

• For the minimum error rate, we take


gi(x) = P(ωi | x)

(max. discrimination corresponds to max.


posterior!)
gi(x) ≡ P(x | ωi) P(ωi)

gi(x) = ln P(x | ωi) + ln P(ωi)


(ln: natural logarithm!)

Pattern Classification, Chapter 2 (Part 1)


20

• Feature space divided into c decision regions


if gi(x) > gj(x) ∀j ≠ i then x is in Ri
(Ri means assign x to ωi)

• The two-category case


• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x) ≡ g1(x) – g2(x)

Decide ω1 if g(x) > 0 ; Otherwise decide ω2

Pattern Classification, Chapter 2 (Part 1)


21

• The computation of g(x)


g( x ) = P ( ω 1 | x ) − P ( ω 2 | x )
P( x | ω1 ) P( ω1 )
= ln + ln
P( x | ω 2 ) P( ω 2 )

Pattern Classification, Chapter 2 (Part 1)


22

Pattern Classification, Chapter 2 (Part 1)


Discriminant Functions for the Normal 23

Density

• We saw that the minimum error-rate classification


can be achieved by the discriminant function

gi(x) = ln P(x | ωi) + ln P(ωi)

• Case of multivariate normal


−1
1 t d 1
g i ( x ) = − ( x − µ i ) ∑ ( x − µ i ) − ln 2π − ln Σ i + ln P ( ω i )
2 i 2 2

Pattern Classification, Chapter 2 (Part 1)


24

• Case Σi = σ2.I (I stands for the identity matrix)

g i ( x ) = w it x + w i 0 (linear discrimina nt function)


where :
µi 1 t
wi = 2 ; wi 0 = − 2
µ i µ i + ln P ( ω i )
σ 2σ
( ω i 0 is called the threshold for the ith category! )

Pattern Classification, Chapter 2 (Part 1)


25

• A classifier that uses linear discriminant functions


is called “a linear machine”

• The decision surfaces for a linear machine are


pieces of hyperplanes defined by:

gi(x) = gj(x)

Pattern Classification, Chapter 2 (Part 1)


26

Pattern Classification, Chapter 2 (Part 1)


27

• The hyperplane separating Ri and Rj


1 σ2 P( ω i )
x0 = ( µ i + µ j ) − 2
ln ( µi − µ j )
2 µi − µ j P( ω j )

always orthogonal to the line linking the means!

1
if P ( ω i ) = P ( ω j ) then x0 = ( µ i + µ j )
2

Pattern Classification, Chapter 2 (Part 1)


28

Pattern Classification, Chapter 2 (Part 1)


29

Pattern Classification, Chapter 2 (Part 1)


30

• Case Σi = Σ (covariance of all classes are


identical but arbitrary!)

• Hyperplane separating Ri and Rj

1
x0 = ( µ i + µ j ) −
[ ]
ln P ( ω i ) / P ( ω j )
.( µ i − µ j )
t −1
2 ( µi − µ j ) Σ ( µi − µ j )

(the hyperplane separating Ri and Rj is generally


not orthogonal to the line between the means!)

Pattern Classification, Chapter 2 (Part 1)


31

Pattern Classification, Chapter 2 (Part 1)


32

Pattern Classification, Chapter 2 (Part 1)


33
• Case Σi = arbitrary

• The covariance matrices are different for each category

g i ( x ) = x tWi x + w it x = w i 0
where :
1 −1
Wi = − Σ i
2
w i = Σ i− 1 µ i
1 t −1 1
w i 0 = − µ i Σ i µ i − ln Σ i + ln P ( ω i )
2 2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)

Pattern Classification, Chapter 2 (Part 1)


34

Pattern Classification, Chapter 2 (Part 1)


35

Pattern Classification, Chapter 2 (Part 1)


36
Bayes Decision Theory – Discrete
Features
• Components of x are binary or integer valued, x can take only one
of m discrete values
v1, v2, …, vm

• Case of independent binary features in 2 category problem


Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with
probabilities:
pi = P(xi = 1 | ω1)
qi = P(xi = 1 | ω2)

Pattern Classification, Chapter 2 (Part 1)


37

• The discriminant function in this case is:


d
g ( x ) = ∑ w i x i + w0
i =1

where :
pi ( 1 − q i )
w i = ln i = 1 ,..., d
q i ( 1 − pi )
and :
d
1 − pi P( ω1 )
w0 = ∑ ln + ln
i =1 1 − qi P( ω 2 )
decide ω 1 if g(x) > 0 and ω 2 if g(x) ≤ 0
Pattern Classification, Chapter 2 (Part 1)

View publication stats

You might also like