Professional Documents
Culture Documents
Probabilistic inference:
P(Q | E = e)
Algorithms:
• Manual: conditional independence + variable elimination
• Forward-backward: HMMs, exact
• Gibbs sampling, particle filtering: general, approximate
Supervised learning
Laplace smoothing
Parameters
θ (local conditional probabilities)
R P(R = r) = p(r)
Parameters:
Training data:
Dtrain = {1, 3, 4, 4, 4, 4, 4, 5, 5, 5}
Example:
Dtrain = {1, 3, 4, 4, 4, 4, 4, 5, 5, 5}
r p(r)
1 0.1
2 0.0
θ:
3 0.1
4 0.5
5 0.3
CS221 / Autumn 2015 / Liang 15
• Given the data, which consists of a set of ratings (the order doesn’t matter here), the natural thing to do
is to set each parameter p(r) to be the empirical fraction of times that r occurs in Dtrain .
Example: two variables
Variables:
• Genre G ∈ {drama, comedy}
• Rating R ∈ {1, 2, 3, 4, 5}
G R P(G = g, R = r) = pG (g)pR (r | g)
Dtrain = {(d, 4), (d, 4), (d, 5), (c, 1), (c, 5)}
Parameters: θ = (pG , pR )
Intuitive strategy:
• Estimate each local conditional distribution (pG and pR ) separately
• For each value of conditioned variable (e.g., g), estimate distribu-
tion over values of unconditioned variable (e.g., r)
g r pR (r | g)
g pG (g) d 4 2/3
θ: d 3/5 d 5 1/3
c 2/5 c 1 1/2
c 5 1/2
Dtrain = {(d, 0, 3), (d, 1, 5), (c, 0, 1), (c, 0, 5), (c, 1, 4)}
[whiteboard]
CS221 / Autumn 2015 / Liang 21
• While probabilistic inference with V-structures was quite subtle, learning parameters for V-structures is
really the same. We just need to remember that the parameters include the conditional probabilities for
each joint assignment to both parents.
• In this case, there are roughly 2 + 2 + (2 · 2 · 5) = 24 parameters to set (or 18 if you’re more clever). Given
the five data points though, most of these parameters will be zero (without smoothing, which we’ll talk
about later).
Example: three variables
Variables:
• Genre G ∈ {drama, comedy}
• Jim’s rating R1 ∈ {1, 2, 3, 4, 5}
• Martha’s rating R2 ∈ {1, 2, 3, 4, 5}
G
R1 R2
Dtrain = {(d, 4, 5), (d, 4, 4), (d, 5, 3), (c, 1, 2), (c, 5, 4)}
g r pR (r | g)
c 1 1/4
c 2 1/4
c 3 0/4
g pG (g) c 4 1/4
c 2/5 G c 5 1/4
d 3/5 d 1 0/6
d 2 0/6
d 3 1/6
R1 R2 d 4 3/6
d 5 2/6
W1 W2 ... WL
L
Y
P(Y = y, W1 = w1 , . . . , WL = wL ) = pgenre (y) pword (wj | y)
j=1
E1 E2 E3 E4 E5
n
Y
P(H = h, E = e) = ptrans (hi | hi−1 )pemit (ei | hi )
i=1
n
Y
P(X1 = x1 , . . . , Xn = xn ) = pdi (xi | xParents(i) )
i=1
Count:
For each x ∈ Dtrain :
For each variable xi :
Increment countdi (xParents(i) , xi )
Normalize:
For each d and local assignment xParents(i) :
Set pd (xi | xParents(i) ) ∝ countd (xParents(i) , xi )
Y
max P(X = x; θ)
θ
x∈Dtrain
Supervised learning
Laplace smoothing
p(H) = 1 p(T) = 0
When have less data, maximum likelihood overfits, want a more reason-
able estimate...
1 0
p(H) = 1 p(T) = 1
1+1 2 0+1 1
p(H) = 1+2 = 3 p(T) = 1+2 = 3
1+1 2 998+1
p(H) = 1+2 = 3 p(H) = 998+2 = 0.999
Supervised learning
Laplace smoothing
R1 R2
Dtrain = {(?, 4, 5), (?, 4, 4), (?, 5, 3), (?, 1, 2), (?, 5, 4)}
E-step:
• Compute q(h) = P(H = h | E = e; θ) for each h (use any
probabilistic inference algorithm)
• Create weighted points: (h, e) with weight q(h)
M-step:
• Compute maximum likelihood (just count and normalize) to
get θ
Repeat until convergence.
R1 R2
g r count pR (r | g)
M-step g count pG (g) c 1 0.5 0.21
θ: c 0.69 + 0.5 0.59 c 2 0.5 + 0.69 + 0.69 0.79
d 0.31 + 0.5 0.41 d 1 0.5 0.31
d 2 0.5 + 0.31 + 0.31 0.69
Plain: abcdefghijklmnopqrstuvwxyz
Cipher: plokmijnuhbygvtfcrdxeszaqw
E1 E2 E3 E4 E5
n
Y
P(H = h, E = e) = ptrans (hi | hi−1 )pemit (ei | hi )
i=1
Parameters:
• ptrans : character-level bigram model
• pemit : substitution table
E1 E2 E3 E4 E5
Strategy:
• Learn ptrans on tons of English text
• E-step: use forward-backward algorithm to guess P(H | E = e)
• M-step: estimate pemit
Intuition: rely on language model (ptrans ) to favor plaintexts h that look
like English
[live solution]
Parameters θ
Q|E⇒ ⇒ P(Q | E; θ)
(of Bayesian network)