Cs229 HMM - Ps

Hidden Markov Models Fundamentals
Daniel Ramage
CS229 Se tion Notes
De ember 1, 2007
Abstra t
How an we apply ma hine learning to data that is represented as a
sequen e of observations over time? For instan e, we might be interested
in dis overing the sequen e of words that someone spoke based on an
audio re ording of their spee h. Or we might be interested in annotating
a sequen e of words with their part-of-spee h tags. These notes provides a
thorough mathemati al introdu tion to the on ept of Markov Models
a formalism for reasoning about states over time and Hidden Markov
Models where we wish to re over a series of states from a series of
observations. The nal se tion in ludes some pointers to resour es that
present this material from other perspe tives.
1 Markov Models
Given a set of states S = {s1 , s2 , ...s|S| } we an observe a series over time
~z ∈ S T . For example, we might have the states from a weather system S =
{sun, cloud, rain} with |S| = 3 and observe the weather over a few days {z1 =
ssun , z2 = scloud , z3 = scloud , z4 = srain , z5 = scloud } with T = 5.
The observed states of our weather example represent the output of a random
pro ess over time. Without some further assumptions, state sj at time t ould
be a fun tion of any number of variables, in luding all the states from times 1
to t−1 and possibly many others that we don't even model. However, we will
make two Markov assumptions that will allow us to tra tably reason about
time series.
The limited horizon assumption is that the probability of being in a
state at time t depends only on the state at time t − 1. The intuition underlying
this assumption is that the state at time t represents enough summary of the
past to reasonably predi t the future. Formally:
P (zt |zt−1 , zt−2 , ..., z1 ) = P (zt |zt−1 )
The stationary pro ess assumption is that the onditional distribution

over next state given urrent state does not hange over time. Formally:
1
P (zt |zt−1 ) = P (z2 |z1 ); t ∈ 2...T
As a onvention, we will also assume that there is an initial state and initial
observation z0 ≡ s0 , where s0 represents the initial probability distribution over
states at time 0. This notational onvenien e allows us to en ode our belief
about the prior probability of seeing the rst real state z1 as P (z1 |z0 ). Note
that P (zt |zt−1 , ..., z1 ) = P (zt |zt−1 , ..., z1 , z0 ) be ause we've dened z0 = s0 for
any state sequen e. (Other presentations of HMMs sometimes represent these
|S|
prior believes with a ve tor π ∈ R .)
We parametrize these transitions by dening a state transition matrix A ∈
R(|S|+1)×(|S|+1). The value Aij is the probability of transitioning from state i
to state j at any time t. For our sun and rain example, we might have following
transition matrix:
s0 ssun scloud srain

s0 0 .33 .33 .33
A= ssun 0 .8 .1 .1
scloud 0 .2 .6 .2
srain 0 .1 .2 .7
Note that these numbers (whi h I made up) represent the intuition that the
weather is self- orrelated: if it's sunny it will tend to stay sunny, loudy will
stay loudy, et . This pattern is ommon in many Markov models and an
be observed as a strong diagonal in the transition matrix. Note that in this
example, our initial state s0 shows uniform probability of transitioning to ea h
of the three states in our weather system.
1.1 Two questions of a Markov Model
Combining the Markov assumptions with our state transition parametrization

A, we an answer two basi questions about a sequen e of states in a Markov
hain. What is the probability of a parti ular sequen e of states ~z? And how
do we estimate the parameters of our model A su h to maximize the likelihood
of an observed sequen e ~z ?
1.1.1 Probability of a state sequen e
We an ompute the probability of a parti ular series of states ~z by use of the

hain rule of probability:
P (~z) = P (zt , zt−1 , ..., z1 ; A)

= P (zt , zt−1 , ..., z1 , z0 ; A)
= P (zt |zt−1 , zt−2 , ..., z1 ; A)P (zt−1 |zt−2 , ..., z1 ; A)...P (z1 |z0 ; A)
= P (zt |zt−1 ; A)P (zt−1 |zt−2 ; A)...P (z2 |z1 ; A)P (z1 |z0 ; A)
2
Y
T
= P (zt |zt−1 ; A)
t=1
Y
T
= Azt−1 zt
t=1
In the se ond line we introdu e z0 into our joint probability, whi h is allowed
by the denition of z0 above. The third line is true of any joint distribution
by the hain rule of probabilities or repeated appli ation of Bayes rule. The
fourth line follows from the Markov assumptions and the last line represents
these terms as their elements in our transition matrix A.
Let's ompute the probability of our example time sequen e from earlier. We
want P (z1 = ssun , z2 = scloud , z3 = srain , z4 = srain , z5 = scloud ) whi h an be
fa tored as P (ssun |s0 )P (scloud |ssun )P (srain |scloud )P (srain |srain )P (scloud |srain ) =
.33 × .1 × .2 × .7 × .2.
1.1.2 Maximum likelihood parameter assignment
From a learning perspe tive, we ould seek to nd the parameters A that maxi-
mize the log-likelihood of sequen e of observations ~z. This orresponds to nd-
ing the likelihoods of transitioning from sunny to loudy versus sunny to sunny,
et ., that make a set of observations most likely. Let's dene the log-likelihood
a Markov model.
l(A) = log P (~z; A)

Y
T
= log Azt−1 zt
t=1
X
T
= log Azt−1 zt
t=1
X XX
|S| |S| T
= 1{zt−1 = si ∧ zt = sj } log Aij
i=1 j=1 t=1
In the last line, we use an indi ator fun tion whose value is one when the
ondition holds and zero otherwise to sele t the observed transition at ea h
time step. When solving this optimization problem, it's important to ensure
that solved parameters A still make a valid transition matrix. In parti ular, we
need to enfor e that the outgoing probability distribution from state i always
sums to 1 and all elements of A are non-negative. We an solve this optimization
problem using the method of Lagrange multipliers.
max l(A)
A
3
X
|S|
s.t. Aij = 1, i = 1..|S|
j=1
Aij ≥ 0, i, j = 1..|S|
This onstrained optimization problem an be solved in losed form using the

method of Lagrange multipliers. We'll introdu e the equality onstraint into the
Lagrangian, but the inequality onstraint an safely be ignored the optimal
solution will produ e positive values for Aij anyway. Therefore we onstru t
the Lagrangian as:
X XX
|S| |S| T
X
|S|
X
|S|
L(A, α) = 1{zt−1 = si ∧ zt = sj } log Aij + αi (1 − Aij )
i=1 j=1 t=1 i=1 j=1
Taking partial derivatives and setting them equal to zero we get:
∂ X X
T |S|
∂L(A, α) ∂
= ( 1{zt−1 = si ∧ zt = sj } log Aij ) + αi (1 − Aij )
∂Aij ∂Aij t=1 ∂Aij j=1
1 X
T
= 1{zt−1 = si ∧ zt = sj } − αi ≡ 0
Aij t=1
⇒
1 X
T
Aij = 1{zt−1 = si ∧ zt = sj }
αi t=1
Substituting ba k in and setting the partial with respe t to α equal to zero:
∂L(A, β) X
|S|
= 1− Aij
∂αi j=1
X|S|
1 X
T
= 1− 1{zt−1 = si ∧ zt = sj } ≡ 0
α
j=1 i t=1
⇒
X X
|S| T
αi = 1{zt−1 = si ∧ zt = sj }
j=1 t=1
X
T
= 1{zt−1 = si }
t=1
Substituting in this value for αi into the expression we derived for Aij we
obtain our nal maximum likelihood parameter value for Âij .
4
PT
t=11{zt−1 = si ∧ zt = sj }
Âij = PT
t=1 1{zt−1 = si }
This formula en odes a simple intuition: the maximum likelihood probability

of transitioning from state i to state j is just the number of times we transition
from i to j divided by the total number of times we are in i. In other words, the
maximum likelihood parameter orresponds to the fra tion of the time when we
were in state i that we transitioned to j.
2 Hidden Markov Models

Markov Models are a powerful abstra tion for time series data, but fail to ap-
ture a very ommon s enario. How an we reason about a series of states if we
annot observe the states themselves, but rather only some probabilisti fun -
tion of those states? This is the s enario for part-of-spee h tagging where the
words are observed but the parts-of-spee h tags aren't, and for spee h re ogni-
tion where the sound sequen e is observed but not the words that generated it.
For a simple example, let's borrow the setup proposed by Jason Eisner in 2002
[1℄, I e Cream Climatology.
The situation: You are a limatologist in the year 2799, studying

the history of global warming. You an't nd any re ords of Balti-
more weather, but you do nd my (Jason Eisner's) diary, in whi h I
assiduously re orded how mu h i e ream I ate ea h day. What an
you gure out from this about the weather that summer?
A Hidden Markov Model (HMM) an be used to explore this s enario. We
don't get to observe the a tual sequen e of states (the weather on ea h day).
Rather, we an only observe some out ome generated by ea h state (how many
i e reams were eaten that day).
Formally, an HMM is a Markov model for whi h we have a series of observed
outputs x = {x1 , x2 , ..., xT } drawn from an output alphabet V = {v1 , v2 , ..., v|V | },
i.e. xt ∈ V, t = 1..T . As in the previous se tion, we also posit the existen e of se-
ries of states z = {z1 , z2 , ..., zT } drawn from a state alphabet S = {s1 , s2 , ...s|S| },
zt ∈ S, t = 1..T but in this s enario the values of the states are unobserved. The
transition between states i and j will again be represented by the orresponding
value in our state transition matrix Aij .
We also model the probability of generating an output observation as a
fun tion of our hidden state. To do so, we make the output independen e
assumption and dene P (xt = vk |zt = sj ) = P (xt = vk |x1 , ..., xT , z1 , ..., zT ) =
Bjk . The matrix B en odes the probability of our hidden state generating
output vk given that the state at the orresponding time was sj .
Returning to the weather example, imagine that you have logs of i e ream
onsumption over a four day period: ~x = {x1 = v3 , x2 = v2 , x3 = v1 , x4 = v2 }
5
where our alphabet just en odes the number of i e reams onsumed, i.e. V =
{v1 = 1 ice cream, v2 = 2 ice creams, v3 = 3 ice creams}. What questions an
an HMM let us answer?
2.1 Three questions of a Hidden Markov Model
There are three fundamental questions we might ask of an HMM. What is the
probability of an observed sequen e (how likely were we to see 3, 2, 1, 2 i e reams
onsumed)? What is the most likely series of states to generate the observations
(what was the weather for those four days)? And how an we learn values for
the HMM's parameters A and B given some data?
2.2 Probability of an observed sequen e: Forward pro e-

dure
In an HMM, we assume that our data was generated by the following pro ess:
posit the existen e of a series of states ~z over the length of our time series.
This state sequen e is generated by a Markov model parametrized by a state
transition matrix A. At ea h time step t, we sele t an output xt as a fun tion of
the state zt . Therefore, to get the probability of a sequen e of observations, we
need to add up the likelihood of the data ~x given every possible series of states.
X
P (~x; A, B) = P (~x, ~z; A, B)
~
z
X
= P (~x|~z; A, B)P (~z; A, B)
~
z
The formulas above are true for any probability distribution. However, the
HMM assumptions allow us to simplify the expression further:
X
P (~x; A, B) = P (~x|~z; A, B)P (~z; A, B)
~
z
X Y
T Y
T
= ( P (xt |zt ; B)) ( P (zt |zt−1 ; A))
~
z t=1 t=1
X Y
T Y
T
= ( Bzt xt ) ( Azt−1 zt )
~
z t=1 t=1
The good news is that this is a simple expression in terms of our parame-
ters. The derivation follows the HMM assumptions: the output independen e
assumption, Markov assumption, and stationary pro ess assumption are all used
to derive the se ond line. The bad news is that the sum is over every possible
assignment to ~z. Be ause zt an take one of |S| possible values at ea h time
step, evaluating this sum dire tly will require O(|S|T ) operations.
6
Algorithm 1 Forward Pro edure for omputing αi (t)
1. Base ase: αi (0) = A0 i , i = 1..|S|
P
2. Re ursion: αj (t) = |S|
i=1 αi (t − 1)Aij Bj xt , j = 1..|S|, t = 1..T
Fortunately, a faster means of omputing P (~x; A, B) is possible via a dy-

nami programming algorithm alled the Forward Pro edure. First, let's
dene a quantity αi (t) = P (x1 , x2 , ..., xt , zt = si ; A, B). αi (t) represents the
total probability of all the observations up through time t (by any state assign-
ment) and that we are in state si at time t. If we had su h a quantity, the
probability of our full set of observations P (~ x) ould be represented as:
P (~x; A, B) = P (x1 , x2 , ..., xT ; A, B)

X
|S|
= P (x1 , x2 , ..., xT , zT = si ; A, B)
i=1
X
|S|
= αi (T )
i=1
Algorithm 2.2 presents an e ient way to ompute αi (t). At ea h time step

we must do only O(|S|) operations, resulting in a nal algorithm omplexity
of O(|S| · T ) to ompute the total probability of an observed state sequen e
P (~x; A, B).
A similar algorithm known as the Ba kward Pro edure an be used to
ompute an analogous probability βi (t) = P (xT , xT −1 , .., xt+1 , zt = si ; A, B).
2.3 Maximum Likelihood State Assignment: The Viterbi

Algorithm
One of the most ommon queries of a Hidden Markov Model is to ask what
z ∈ S T given an observed series of outputs
was the most likely series of states ~
T
~x ∈ V . Formally, we seek:
P (~x, ~z ; A, B)
arg max P (~z|~x; A, B) = arg max P = arg max P (~x, ~z; A, B)
~
z ~
z ~z P (~
x, ~z; A, B) ~
z
The rst simpli ation follows from Bayes rule and the se ond from the
observation that the denominator does not dire tly depend on ~z. Naively, we
might try every possible assignment to ~z
and take the one with the highest
T
joint probability assigned by our model. However, this would require O(|S| )
operations just to enumerate the set of possible assignments. At this point, you
might think a dynami programming solution like the Forward Algorithm might
arg max~z with
save the day, and you'd be right. Noti e that if you repla ed the
P
~ , our urrent task is exa tly analogous to the expression whi h motivated
z
the forward pro edure.
7
Algorithm 2 Naive appli ation of EM to HMMs
Repeat until onvergen e {

(E-Step) For every possible labeling ~z ∈ S T , set
Q(~z) := p(~z|~x; A, B)
(M-Step) Set
X P (~x, ~z; A, B)
A, B := arg max Q(~z) log
A,B Q(~z)
~
z
X
|S|
s.t. Aij = 1, i = 1..|S|; Aij ≥ 0, i, j = 1..|S|
j=1
X
|V |
Bik = 1, i = 1..|S|; Bik ≥ 0, i = 1..|S|, k = 1..|V |
k=1
The Viterbi Algorithm is just like the forward pro edure ex ept that
instead of tra king the total probability of generating the observations seen so
far, we need only tra k the maximum probability and re ord its orresponding
state sequen e.
2.4 Parameter Learning: EM for HMMs
The nal question to ask of an HMM is: given a set of observations, what
are the values of the state transition probabilities A and the output emission
probabilities B that make the data most likely? For example, solving for the
maximum likelihood parameters based on a spee h re ognition dataset will allow
us to ee tively train the HMM before asking for the maximum likelihood state
assignment of a andidate spee h signal.
In this se tion, we present a derivation of the Expe tation Maximization
algorithm for Hidden Markov Models. This proof follows from the general for-
mulation of EM presented in the CS229 le ture notes. Algorithm 2.4 shows the
basi EM algorithm. Noti e that the optimization problem in the M-Step is now
onstrained su h that A and B ontain valid probabilities. Like the maximum
likelihood solution we found for (non-Hidden) Markov models, we'll be able to
solve this optimization problem with Lagrange multipliers. Noti e also that the
E-Step and M-Step both require enumerating all |S|T possible labellings of ~z.
We'll make use of the Forward and Ba kward algorithms mentioned earlier to
ompute a set of su ient statisti s for our E-Step and M-Step tra tably.
First, let's rewrite the obje tive fun tion using our Markov assumptions.
8
X P (~x, ~z; A, B)
A, B = arg max Q(~z) log
A,B Q(~z)
z
~
X
= arg max Q(~z) log P (~x, ~z; A, B)
A,B
z
~
X Y
T Y
T
= arg max Q(~z) log( P (xt |zt ; B)) ( P (zt |zt−1 ; A))
A,B
z
~ t=1 t=1
X X
T
= arg max Q(~z) log Bzt xt + log Azt−1 zt
A,B
z
~ t=1
X X XXX
|S| |S| |V | T
= arg max Q(~z) 1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij
A,B
z
~ i=1 j=1 k=1 t=1
In the rst line we split the log division into a subtra tion and note that
the denominator's term does not depend on the parameters A, B . The Markov
assumptions are applied in line 3. Line 5 uses indi ator fun tions to index A
and B by state.
Just as for the maximum likelihood parameters for a visible Markov model,
it is safe to ignore the inequality onstraints be ause the solution form naturally
results in only positive solutions. Constru ting the Lagrangian:
X X XXX
|S| |S| |V | T
L(A, B, δ, ǫ) = Q(~z) 1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij
~
z i=1 j=1 k=1 t=1
X
|S|
X
|V |
X
|S|
X
|S|
+ ǫj (1 − Bjk ) + δi (1 − Aij )
j=1 k=1 i=1 j=1
Taking partial derivatives and setting them equal to zero:
∂L(A, B, δ, ǫ) X 1 X
T
= Q(~z) 1{zt−1 = si ∧ zt = sj } − δi ≡ 0
∂Aij Aij t=1
~
z
1 X XT
Aij = Q(~z) 1{zt−1 = si ∧ zt = sj }
δi t=1
~
z
∂L(A, B, δ, ǫ) X 1 X
T
= Q(~z) 1{zt = sj ∧ xt = vk } − ǫj ≡ 0
∂Bjk Bjk t=1
~
z
1 X XT
Bjk = Q(~z) 1{zt = sj ∧ xt = vk }
ǫj t=1
~
z
9
Taking partial derivatives with respe t to the Lagrange multipliers and sub-
stituting our values of Aij and Bjk above:
∂L(A, B, δ, ǫ) X
|S|
= 1− Aij
∂δi j=1
X|S|
1 X XT
= 1− Q(~z) 1{zt−1 = si ∧ zt = sj } ≡ 0
j=1 i
δ t=1
~
z
XX
|S|
X
T
δi = Q(~z) 1{zt−1 = si ∧ zt = sj }
j=1 ~
z t=1
X X
T
= Q(~z) 1{zt−1 = si }
~
z t=1
∂L(A, B, δ, ǫ) X
|V |
= 1− Bjk
∂ǫj
k=1
X
|V |
1 X XT
= 1− Q(~z) 1{zt = sj ∧ xt = vk } ≡ 0
ǫj t=1
k=1 ~
z
XX |V |
X
T
ǫj = Q(~z) 1{zt = sj ∧ xt = vk }
k=1 ~
z t=1
X X
T
= Q(~z) 1{zt = sj }
~
z t=1
Substituting ba k into our expressions above, we nd that parameters Â and

B̂ that maximize our predi ted ounts with respe t to the dataset are:
P PT
~
z Q(~z) 1{zt−1 = si ∧ zt = sj }
t=1
Âij = P PT
z ) t=1 1{zt−1 = si }
~ Q(~
z
P PT
~ z ) t=1 1{zt = sj ∧ xt = vk }
z Q(~
B̂jk = P PT
z ) t=1 1{zt = sj }
~ Q(~
z
Unfortunately, ea h of these sums is over all possible labellings ~z ∈ S T . But

z ) was dened in the E-step
re all that Q(~ as P (~
z |~x; A, B) for parameters A and
B at the last time step. Let's onsider how to represent rst the numerator of
Âij in terms of our forward and ba kward probabilities, αi (t) and βj (t).
X X
T
Q(~z) 1{zt−1 = si ∧ zt = sj }
~
z t=1
10
X
T X
= 1{zt−1 = si ∧ zt = sj }Q(~z)
t=1 ~
z
X
T X
= 1{zt−1 = si ∧ zt = sj }P (~z|~x; A, B)
t=1 ~
z
1 XX T
= 1{zt−1 = si ∧ zt = sj }P (~z, ~x; A, B)
P (~x; A, B) t=1
~
z
1 X
T
= αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) t=1
In the rst two steps we rearrange terms and substitute in for our denition
of Q. Then we use Bayes rule in deriving line four, followed by the denitions
of α, β , A, and B , in line ve. Similarly, the denominator an be represented
by summing out over j the value of the numerator.
X X
T
Q(~z) 1{zt−1 = si }
z
~ t=1
X
|S|
X X
T
= Q(~z) 1{zt−1 = si ∧ zt = sj }
j=1 ~
z t=1
1 XX |S| T
P (~x; A, B) j=1 t=1
Combining these expressions, we an fully hara terize our maximum likeli-

hood state transitions Âij without needing to enumerate all possible labellings
as:
PT
t=1 αi (t)Aij Bj xt βj (t + 1)
Âij = P|S| PT
j=1 t=1 αi (t)Aij Bj xt βj (t + 1)
Similarly, we an represent the numerator for B̂jk as:
X X
T
Q(~z) 1{zt = sj ∧ xt = vk }
~
z t=1
1 XX T
= 1{zt = sj ∧ xt = vk }P (~z, ~x; A, B)
P (~x; A, B) t=1
~
z
1 X XX
|S| T
= 1{zt−1 = si ∧ zt = sj ∧ xt = vk }P (~z, ~x; A, B)
P (~x; A, B) i=1 t=1 ~
z
11
Algorithm 3 Forward-Ba kward algorithm for HMM parameter learning
Initialization: Set A and B as random valid probability matri es

where Ai0 = 0 and B0k = 0 for i = 1..|S| and k = 1..|V |.
Repeat until onvergen e {
(E-Step) Run the Forward and Ba kward algorithms to ompute αi and βi for
i = 1..|S|. Then set:
γt (i, j) := αi (t)Aij Bj xt βj (t + 1)
(M-Step) Re-estimate the maximum likelihood parameters as:
PT
t=1 γt (i, j)
Aij := P|S| PT
j=1 t=1 γt (i, j)
P|S| PT
i=1 t=1 1{xt = vk } γt (i, j)
Bjk := P|S| PT
i=1 t=1 γt (i, j)
1 XX |S| T
= 1{xt = vk }αi (t)Aij Bj xt βj (t + 1)
P (~x; A, B) i=1 t=1
And the denominator of B̂jk as:
X X
T
Q(~z) 1{zt = sj }
~
z t=1
1 XXX|S| T
= 1{zt−1 = si ∧ zt = sj }P (~z, ~x; A, B)
P (~x; A, B) i=1 t=1
~
z
1 XX
|S|
T
P (~x; A, B) i=1 t=1
Combining these expressions, we have the following form for our maximum
likelihood emission probabilities as:
P|S| PT
i=1 t=1 1{xt = vk }αi (t)Aij Bj xt βj (t + 1)
B̂jk = P|S| PT
i=1 t=1 αi (t)Aij Bj xt βj (t + 1)
Algorithm 2.4 shows a variant of the Forward-Ba kward Algorithm,

or the Baum-Wel h Algorithm for parameter learning in HMMs. In the
12
E-Step, rather than expli itly evaluating Q(~z) for all ~z ∈ S T , we ompute
a su ient statisti s γt (i, j) = αi (t)Aij Bj xt βj (t + 1) that is proportional to
the probability of transitioning between sate si and sj at time t given all of
our observations ~x. The derived expressions for Aij and Bjk are intuitively
appealing. Aij is omputed as the expe ted number of transitions from si to
sj divided by the expe ted number of appearan es of si . Similarly, Bjk is
omputed as the expe ted number of emissions of vk from sj divided by the
expe ted number of appearan es of sj .
Like many appli ations of EM, parameter learning for HMMs is a non- onvex
problem with many lo al maxima. EM will onverge to a maximum based on
its initial parameters, so multiple runs might be in order. Also, it is often
important to smooth the probability distributions represented by A and B so
that no transition or emission is assigned 0 probability.
2.5 Further reading
There are many good sour es for learning about Hidden Markov Models. For ap-
pli ations in NLP, I re ommend onsulting Jurafsky & Martin's draft se ond edi-
tion of Spee h and Language Pro essing 1 or Manning & S hütze's Foundations of
Statisti al Natural Language Pro essing. Also, Eisner's HMM-in-a-spreadsheet
[1℄ is a light-weight intera tive way to play with an HMM that requires only a
spreadsheet appli ation.
Referen es
[1℄ Jason Eisner. An intera tive spreadsheet for tea hing the forward-ba kward
algorithm. In Dragomir Radev and Chris Brew, editors, Pro eedings of the
ACL Workshop on Ee tive Tools and Methodologies for Tea hing NLP and
CL, pages 1018, 2002.
1 http://www. s. olorado.edu/~martin/slp2.html
13

Cs229 HMM - Ps

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cs229 HMM - Ps

Uploaded by

Copyright:

Available Formats

Hidden Markov Models Fundamentals

CS229 Se tion Notes

P (zt |zt−1 , zt−2 , ..., z1 ) = P (zt |zt−1 )

The stationary pro ess assumption is that the onditional distribution

s0 ssun scloud srain

1.1 Two questions of a Markov Model

Combining the Markov assumptions with our state transition parametrization

1.1.1 Probability of a state sequen e

We an ompute the probability of a parti ular series of states ~z by use of the

P (~z) = P (zt , zt−1 , ..., z1 ; A)

1.1.2 Maximum likelihood parameter assignment

l(A) = log P (~z; A)

This onstrained optimization problem an be solved in losed form using the

Taking partial derivatives and setting them equal to zero we get:

Substituting ba k in and setting the partial with respe t to α equal to zero:

This formula en odes a simple intuition: the maximum likelihood probability

2 Hidden Markov Models

The situation: You are a limatologist in the year 2799, studying

2.1 Three questions of a Hidden Markov Model

2.2 Probability of an observed sequen e: Forward pro e-

Fortunately, a faster means of omputing P (~x; A, B) is possible via a dy-

P (~x; A, B) = P (x1 , x2 , ..., xT ; A, B)

Algorithm 2.2 presents an e ient way to ompute αi (t). At ea h time step

2.3 Maximum Likelihood State Assignment: The Viterbi

Repeat until onvergen e {

2.4 Parameter Learning: EM for HMMs

Taking partial derivatives and setting them equal to zero:

Substituting ba k into our expressions above, we nd that parameters Â and

Unfortunately, ea h of these sums is over all possible labellings ~z ∈ S T . But

Combining these expressions, we an fully hara terize our maximum likeli-

Similarly, we an represent the numerator for B̂jk as:

Initialization: Set A and B as random valid probability matri es

(M-Step) Re-estimate the maximum likelihood parameters as:

And the denominator of B̂jk as:

Algorithm 2.4 shows a variant of the Forward-Ba kward Algorithm,

2.5 Further reading

You might also like

Algorithm 2.2 presents an e ient way to ompute αi (t). At ea h time step

Substituting ba k into our expressions above, we nd that parameters Â and