You are on page 1of 10

1 1

3. 3. Bayes Bayes and Normal Models and Normal Models


Aleix M. Martinez
aleix@ece.osu.edu
Handouts for ECE 874 Sp 2007
Handouts for ECE 874 Sp 2007
Why Bayesian? Why Bayesian?
If all our research (in PR) was to disappear
and you could only save one theory, which
one would you save?
Bayesian theory is probably the most
important one you should keep.
Its simple, intuitive and optimal.
Reverend Bayes (1763) and Laplace (1812)
set the foundations of what we now know a
Bayes theory.
State of nature: class
A sample (usually) corresponds to a state of
nature; e.g. salmon and sea bass.
The state of nature usually corresponds to a
set of discreet categories (classes). Note
that the continuous also exists.
Priors: some classes might occur more
often or might be more important, ). (
i
w P
Decision rule
We need a decision rule to help us determine
to which class a testing vector belongs.
Simplest (useless):
Posterior probability: , x is the
(observed) data.
Obviously, we do not have .
But we can estimate: &
) ( max arg
i i
w P C =
) | ( x
i
w p
) | ( x
i
w p
) | (
i
w p x ). (
i
w P
Bayes Theorem (yes, the famous
one)
Bayes decision rule:
) (
) ( ) | (
) | (
x
x
x
p
w P w p
w p
i i
i
=

=
=
c
j
i i
w P w p p
1
) ( ) | ( ) ( x x
) | ( max arg x
i i
w p
guarantees guarantees
1 ) | ( 0 s s x
i
w p
Rev. Thomas Bayes (1702-1761)
During his lifetime, Bayes was a defender of Isaac
Newton's calculus, and developed several important
results of which the Bayes Theorem is his most known
and, arguably, most elegant. This theorem and the
subsequent development of Bayesian theory are among
the most relevant topics in pattern recognition and have
found applicationsin almost every corner of the
scientific world. Bayes himself did not, however,
provide the derivations of the Bayes Theorem as this is
now known to us. Bayes develop the method for
uniform priors.This result was later extended by Laplace
and contemporaries.Nonetheless, Bayes is generally
acknowledge as the first to have established a
mathematical basis for probability inference.
2 2
Multiple random variables
To be mathematically precise, one should
write instead of , because
this probability density function depends on
a single random variable X.
In general there is no need for distinction
(e.g., & ). Shall this arise, we will use
the above notation.
(see Appendix A.4)
) | (
i X
w p x ) | (
i
w p x
X
p
Y
p
Loss function & decision risk
States exactly how costly each action is, and
is used to convert a probability
determination into a decision.
} ,..., {
1 c
w w } ,..., {
1 a

) | (
j i
w
classes: : actions: :
loss function: :
the cost ( the cost (risk risk) of going from ) of going from to to i

j
w
Conditional Risk:

=
=
c
j
j j i i
w p w R
1
) | ( ) | ( ) | ( x x
Bayes decision rule
The resulting minimum overall risk is called
the Bayes risk.

=
=
c
j
j j i i
w p w R
1
) | ( ) | ( ) | ( x x
) | ( min arg x
i i
R
Conditional risk: Conditional risk:
Bayes Bayes decision rule ( decision rule (Bayes Bayes risk): risk):
A simple example
Two-class classifier:
Decision rule: if
Applying Bayes:
) | ( ) | ( ) | (
2 12 1 11 1
x x x w p w p R + =
) | ( ) | ( ) | (
2 22 1 21 2
x x x w p w p R + =
), | ( ) | (
2 1
x x R R <
) (or
1 1
w
) | ( ) ( ) | ( ) (
2 22 12 1 11 21
x x w p w p >
.
) (
) (
) | (
) | (
1
2
11 21
22 12
2
1
w P
w P
w p
w p

>
x
x
threshold threshold
Notation:
Notation: Notation:
) | (
j i ij
w =
threshold threshold
Feature Space: Geometry
When , we have our d-dimensional
feature space.
Sometimes, this feature space is considered
to be an Euclidean space; but as well see
many other alternatives exists.
This allows for the study of PR problems
form a geometric point of view. This is key
to many algorithms.
d
9 e x
3 3
Discriminant functions
We can construct a set of discriminant
functions: , i = 1, , c.
We classify a feature vector as if:
The Bayes classifier is:
If errors are to be minimized, one needs to
minimize the probability of error:
i
w
i j g g
j i
= > ), ( ) ( x x
) (x
i
g
). | ( ) ( x x
i i
R g =

=
=
=
j i
j i
w
j i
1
0
) | (
minimum minimum- -error error- -rate rate
(zero (zero- -one loss) one loss)
If we use Bayes & minimum-error-rate
classification, we get:
Sometime we find more convenient to write this
equation as:
Geometry (key point): the goal is to divide the
feature space into c decision regions,
Classification is also known as hypothesis testing.

=
= =
c
j
j j
i i
i i
w P w p
w P w p
w p g
1
) ( ) | (
) ( ) | (
) | ( ) (
x
x
x x
constant constant
). ( ln ) | ( ln ) (
i i i
w P w p g + = x x
. ,...,
1 c
R R
Key: the effect of any decision rule is to divide the feature
space into decision regions.
Symbolic representation
Other criterion
In some applications the priors are not
known.
In this case, we usually attempt to minimize
the worst overall risk.
Two approaches for that are: the minimax
and the Neyman-Pearson criteria.
4 4
Normal Distributions & Bayes
So far we have used and to
specify the decision boundaries of a Bayes
classifier.
The Normal distribution is the most typical
PDF for p().
Recall the central limit theorem.
) | (
i
w p x ) (
i
w P
Central Limit theorem (simplified)
Assume that the random variables
are iid, each with finite mean and variance.
When , the standardized random
variable converges to a normal distribution.
(see Stark & Woods pp. 225-230)
n
X X ,...,
1
n
* *
= =
* *
= =
Univariate case
The Gaussian distribution is:
) , ( ~
2
1
exp
2
1
) (
2
2

N
x
x p
(
(

|
.
|

\
|
=
}


= dx x xp x E ) ( ) (
}


= dx x p x x E ) ( ) ( ) ) ((
2 2 2

Multivariate case (d>1)
) , ( ~ ) ( ) (
2
1
) 2 (
1
) (
1
2 / 1
2 /
E
(

E
E
=

N x x p
T
d
x
5 5
Distances
The general distance in a space is given by:
where is the covariance matrix of the
distribution (or data).
If then the above equation becomes
the Euclidean (norm 2) distance.
If is Normal, this distance is called
Mahalanobis distance.
) ( ) (
1 2
E =

x x
T
d
I = E
E
E
Example (2D Normals)
Heteroscedastic Heteroscedastic Homoscedastic Homoscedastic
Moments of the estimates
In statistics the estimates are generally
known as the moments of the data.
The first moment is the sample mean.
The second, the sample autocorrelation
matrix:
.
1
1

=
=
n
i
T
i i
n
S x x

Central moments
The variance and the covariance matrix are
special cases, because they depend on the
mean of the data which is unknown.
Usually we solve that by using the sample
mean:
This is the sample covariance matrix.
. ) )( (
1

=
= E
n
i
T
i i
n
x x
Whitening Transformation
Recall, it is sometime convenient to represent the
data in a space where its sample covariance matrix
equals the identity matrix, I.
I A A =
T
Linear transformations
An n-dimensional vector X can be
transformed linearly to another, Y, as:
The mean is then:
The cov.:
The order of the distances in the transformed
space is identical to the one in the original
space.
X A Y
T
=
X
T
Y
M E M A Y = = ) (
A A
X Y
T
=
6 6
Orthonormal transformation
Eigenanalysis:
Eigenvectors:
Eigenvalues:
The trasformation is then:
=
|
|
|
.
|

\
|
=
p

0
0
1

] [
p 1
,..., =
X Y
T
=

X Y
= =
T
I =
T
(recall, & )
1
=
T
Whitening
To obtain a covariance matrix equal to the
identity matrix we can apply the orthogonal
transformation first and then normalize the
result with :
2 / 1

I
X X Y
X X X X Y
= A A = =
= A =

2 / 1 2 / 1
2 / 1
T T
T
X
T
A
----------- -----------
Properties
Whitening transformations are not
orthogonal transformations.
Therefore, Euclidean distances are not
preserved.
After whitening, the covariance matrix is
invariant to any orthogonal transformation:
. I I = + + = + +
T T
Simultaneous diagonalization
It is usually the case where two or more
covariance matrices need to be diagonal.
Assume and are two covariance
matrices.
Our goal is to have: and
Homework: find the algorithm.
1

I A A =
1
T
.
2 2
A A =
T
7 7
Some advantages
Algorithms usually become much simpler after
diagonalization or whitening.
The general distance => a simple Euclidean
distance.
Whitened data is invariant to other orthogonal
transformations.
Some algorithms require whitening to have certain
properties (well see this latter in the course).
Discriminant Functions for
Normal PDFs
The discriminant function,
for the Normal density, , is:
Possible scenarios (or assumptions):
Sometimes, we might be able to assume
A more general case is when all covariance
matrices are identical ; i.e. homoscedastic.
The most complex case is when
that is, heteroscedastic.
) ( ln ln
2
1
2 ln
2
) ( ) (
2
1
) (
1
i i i i
T
i i
w p
d
g + E E =

x x x
), ( ln ) | ( ln ) (
i i i
w p w p g + = x x
) , (
i i
N E
.
2
I = E
i
E = E
i
; arbitrary
i
= E
I
2
= E
i
The Bayes bound is a d-1 dimensional hyperplane perpendicular
to the line that passes through both means.
) ( ln
2
) (
2
2
i i
w p g +

i
x
x
same priors same priors
I
2
= E
i
E = E
i
) ( ln ) ( ) (
2
1
) (
1
i
T
i
w p g + E =

x x x
Mahalanobis Mahalanobis
Homoscedastic Homoscedastic: :
8 8
In the 2-class case, the decision surface is
an hyperquadric (e.g. hyperplanes,
hyperspheres, hyperhyperboloids, etc.).
These decision boundaries may not be
connected.
Any hyperquadric can be given
(represented) by two Gaussian distributions.
arbitrary
i
= E
Heterodscedastic Heterodscedastic
arbitrary
i
= E
Project #1
1. Implement these three cases using Matlab
(see pp. 36-41 for details). 2D and/or 3D
plots.
2. Generalize the algorithm to more than two
classes Gaussians).
3. Simulate different Gaussians and distinct
priors.
Bayes Is Optimal
If our goal is to minimize the classification
error, then Bayes is optimal (you cannot do
better than Bayes ever).
In general, if , it is
preferable to classify x in so that the
smallest integral contributes to the error
(see next slide) => This is what Bayes does.
There is no possible smaller error.
) ( ) | ( ) ( ) | (
2 2 1 1
w P w p w P w p x x >
1
w
} }
+ =
= e + e =
1 2
) ( ) | ( ) ( ) | (
) , ( ) , ( ) (
2 2 1 1
2 1 1 2
R R
d w P w p d w P w p
w R P w R P error P
x x x x
x x
Bayes Bayes
9 9
The multiclass case:
. ) ( ) | (
) , ( ) (
1
1

=
=
=
= e =
C
i
R
i i
C
i
i i
i
d w P w p
w R P correct P
x x
x
Bayes yields the smallest error. But which is the
actual error? The above equation cannot be readily
computed, because the regions R
i
may be very
complex.
Error Bounds: How to calculate
the error?
Several approximations are easier to
compute (usually upper bounds):
Chernoff bound.
Bhattacharyya bound (assumes pdf are
homoscedastic).
These bound can only be applied to the 2-
class case only.
Chernoff Bound
For this, we need an integral eq. that we can
solve. For example,
where
}


= , ) ( ) | ( ) ( x x x d p error P error P

=
. decide we if ), | (
decide we if ), | (
) | (
1 2
2 1
w w P
w w P
error P
x
x
x
Or, we can also write:
Since, it is known that
we can now write:
If the conditional probabilities are normal,
we can solve this analytically:
( ) x x x d w P w p w P w p error P ) ( ) | ( ), ( ) | ( min ) (
2 2 1 1 }
=
, ) , min(
1 s s
b a b a

s
, 1 0 s s s
}

s x x x d w p w p w P w P error P
s s s s
) | ( ) | ( ) ( ) ( ) (
2
1
1 2
1
1
where
, ) | ( ) | (
) (
2
1
1
s k s s
e d w p w p

=
}
x x x
( ) ( ) ( )
.
) 1 (
ln
2
1
) 1 (
2
) 1 (
) (
1
2 1
2 1
1 2
1
2 1 1 2
s s
T
s s
s s
s s
s k

E E
E + E
+
+ E + E

=
Bhattacharyya Bound
When the data is homoscedastic, ,
the optimal solution is s=1/2.
This is the Bhattacharyya bound.
A tighter bound is the asymptotic nearest
neighbor error, which is derived from:
. ) ( ) | ( ) ( ) | (
) (
) ( ) | ( ) ( ) | (
2 ) (
2 2 1 1
2 2 1 1
x x x
x
x
x x
d w P w p w P w p
d
p
w P w p w P w p
error p
}
}
s
s
2 1
E = E
10 10
Closing Notes Closing Notes
Bayes is important because it minimizes the
probability of error. In that sense we say its
optimal.
Unfortunately, Bayse assumes that the conditional
densities and priors are known (or can be
estimated); which is not necessarily true.
In general, not even the form of these probabilities
is known.
Most PR approaches attempt to solve these
shortcomings. This is, in fact, what most of PR is
all about.
On the + side: a simple example
We want to predict whether a student will
pass or not a test.
Y=1 denotes pass. Y=0 failure.
The observation is a single random variable
X which specifies the hours of study.
Let . Then:
x c
x
x X Y P
+
= = = ) | 1 (

> > = =
=
otherwise 0
i.e. ; 2 / 1 ) | 1 ( if 1
) (
c x x X Y P
x g
Optional homework
Using Matlab generate n observations of
P(Y=1|X=x) and P(Y=0|X=x).
Approximate each using a Gaussian
distribution.
Calculate the Bayes decision boundary and
classification error.
Select several arbitrary values for c and see
how well you can approximate them.
Hints
Error = min [P(Y=1|X=x), P(Y=0|X=x)].
Plot the original distribution to help you.

You might also like