Professional Documents
Culture Documents
basic assumption:
quantities of interest are governed by probability distributions
optimal decisions can be made by reasoning about these
probabilities together with observed training data
practical difficulties:
initial knowledge of many probabilities is required
significant computational costs required
Bayes Theorem:
P (D|h)P (h)
P (h|D) =
P (D)
hM AP = argmax P (h|D)
h∈H
P (D|h)P (h)
= argmax
h∈H P (D)
= argmax P (D|h)P (h)
h∈H
hM L = argmax P (D|h)
h∈H
note that in this case P (h) can be dropped, because it is equal for
each h ∈ H
the available data is from a particular laboratory with two possible outcomes:
⊕ (positive) and (negative)
suppose a new patient is observed for whom the lab test returns a positive (⊕) result
.0078
P (cancer|⊕) = .0078+0.0298 = .21
.0298
P (¬cancer|⊕) = .0078+0.0298 = .79
P (D|h)P (h)
P (h|D) =
P (D)
hM AP = argmax P (h|D)
h∈H
assumptions
1. training data D is noise free (i.e., di = c(xi ))
2. target concept c is contained in H (i.e. (∃h ∈ H)[(∀x ∈ X)[h(x) = c(x)]])
3. no reason to believe that any hypothesis is more probable than any other
1
⇒ P (h) = for all h ∈ H
|H|
1 if di = h(xi ) for all di ∈ D
⇒ P (D|h) =
0 otherwise
0 · P (h)
P (h|D) = =0
P (D)
h is consistent with training data D
1 1
1· |H|
1· |H| 1
P (h|D) = = |V SH,D |
=
P (D) |V SH,D |
|H|
⇒ this analysis implies that, under these assumptions, each consistent hypothesis is a
MAP hypothesis, because for each consistent hypothesis P (h|D) = |V S1 |
H,D
evolution of probabilities
(a) all hypotheses have the same probability
(b) + (c) as training data accumulates, the posterior probability
of inconsistent hypotheses becomes zero while the total
probability summing to 1 is shared equally among the
remaining consistent hypotheses
FIND-S
outputs a consistent hypothesis and therefore a MAP hypothesis under the
probability distributions P (h) and P (D|h) defined above
i.e. for each P (h) that favors more specific hypotheses, FIND-S outputs a MAP
hypotheses
under certain assumptions any learning algorithm that minimizes the squared error
between the output hypothesis and the training data, will output a ML hypothesis
problem setting:
(∀h ∈ H)[h : X → <] and training examples of the form < xi , di >
unknown target function f : X → <
m training examples, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero
mean (di = f (xi ) + ei )
hML
hM L = argmax p(D|h)
h∈H
m
Y
hM L = argmax p(di |h)
h∈H i=1
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ 2 ,
each di must also obey a Normal distribution around the true targetvalue f (xi ).
Because we are writing the the expression for P (D|h), we assume h is the correct
description of f . Hence, µ = f (xi ) = h(xi )
m
Y 1 − 2σ12 (di −h(xi ))2
hM L = argmax √ e
h∈H i=1 2πσ 2
It is common to maximize the less complicated logarithm, which is justified because of the
monotonicity of this function.
m
X 1 1 2
hM L = argmax ln √ − (d i − h(xi ))
h∈H i=1 2πσ 2 2σ 2
The first term in this expression is a constant independent of h and can therefore be
discarded.
m
X 1
hM L = argmax − 2 (di − h(xi ))2
h∈H i=1
2σ
Maximizing this negative term is equivalent to minimizing the corresponding positive term.
m
X 1 2
hM L = argmin 2
(d i − h(xi ))
h∈H i=1
2σ
m
X
hM L = argmin (di − h(xi ))2
h∈H i=1
LCH (h) = − log2 P (h), where CH is the optimal code for hypothesis space H
LCD|h (D|h) = − log2 P (D|h), where CD|h is the optimal code for describing data
D assuming that both the sender and receiver know hypothesis h
simply applying hM AP is not the best solution (as one could wrongly
think of)
example:
H = {h1 , h2 , h3 } where P (h1 ) = .4, P (h2 ) = P (h3 ) = .3
hM AP = h1
consider a new instance x encountered, which is classified
positive by h1 and negative by h2 , h3
taking all hypotheses into account,
the probability that x is positive is .4 and
the probability that x is negative is .6
⇒ most probable classification 6= classification of hM AP
Lecture 9: Bayesian Learning – p. 2
Bayes Optimal Classifier
the most probable classification is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilites
X
P (vj , |D) = P (vj |hi )P (hi |D)
hi ∈H
X
argmax P (vj |hi )P (hi |D)
vj ∈V
hi ∈H
therefore
X
P (⊕|hi )P (hi |D) = .4
hi ∈H
X
P ( |hi )P (hi |D) = .6
hi ∈H
and
X
argmax P (vj |hi )P (hi |D) =
vj ∈{⊕, } h ∈H
i
Lecture 9: Bayesian Learning – p. 2
Naive Bayes Classifier
applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f (x)
can take on any value from some finite set V
training examples are described by < a1 , a2 , ..., an >
Bayesian approach
Y
vN B = argmax P (vj ) P (ai |vj )
vj ∈V i
Lecture 9: Bayesian Learning – p. 2
Illustrative Example
example days:
< Outlook = Sunny, T emperature = Cool, Humidity = High, W ind = Strong >
where
Y
P (ai |vj ) =P (Outlook = sunny|vj ) · P (T emperature = cool|vj )·
i
P (Humidity = high|vj ) · P (W ind = strong|vj )
estimation of probabilities
9 5
P (P layT ennis = yes) = 14
= .64 P (P layT ennis = no) = 14
= .36
normalization
.0206
.0206+.0053 = .795
nc + mp
n+m
where p is a prior estimate of the probability we wish to determine, and m is a
constant called the equivalent sample size which determines how heavily to weight p
relative to the observed data
nc
if m = 0, the m-estimate is equivalent to n
Storm BusTourGroup
Campfire
Thunder ForestFire
n
Y
P (y1 , y2 , ..., yn ) = P (yi |P arents(Yi ))
i=1
if values are known for all other variables in the network, the
inference is straightforward
in the more general case, values are only known for a subset of the
network variables