Professional Documents
Culture Documents
Parameter learning
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn a probabilistic classification model
“The sushi & everything “The sushi was good,
else were awesome!” the service was OK”
P(y=+1| x= “The )
sushi & everything
else were awesome!” P(y=+1| x= “The
the service was )OK”
sushi was good,
= 0.99 = 0.55
Many classifiers provide a degree of certainty:
Output label Input sentence
P(y|x)
Extremely useful in practice
2 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
A (linear) classifier
• Will use training data to learn a weight
or coefficient for each word
Word Coefficient Value
ŵ0 -2.0
good ŵ1 1.0
great ŵ2 1.5
awesome ŵ3 2.7
bad ŵ4 -1.0
terrible ŵ5 -2.1
awful ŵ6 -3.3
restaurant, the, we, … ŵ7, ŵ8, ŵ9,… 0.0
… …
3 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression model
Score(xi) = ŵTh(xi)
-∞
0.0
+∞
0.5
0.0 1.0
⌃
P(y=+1|x,ŵ) = 1 .
1+e-ŵ h(x)
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
6 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning problem
Training data:
N observations (xi,yi)
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
Optimize
4 1 +1
1 1 +1
quality metric
on training ŵ
2 4 -1 data
0 3 -1
0 1 -1
2 1 +1
…
Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2
1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
13 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Data likelihood
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1
x1,y1 2 1 +1
x2,y2 0 2 -1
x6,y6 2 4 -1
x7,y7 0 3 -1
x8,y8 0 1 -1
x9,y9 2 1 +1
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
Data
x[1] x[2] y Choose w to maximize
point
ℓ(w) =
P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w)
N
Y
P (yi | xi , w)
i=1
17 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
FULL BODY SHOT
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
20 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
HEAD SHOT
…
Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2
1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
22 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood
Algorithm:
Algorithm:
In practice, stop when
Δ
ℓ(w)
(w) =
Algorithm:
@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1
Indicator function:
@`(w) X
N⇣
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w)
@wj i=1
(t)
w0 0
(t)
w1 1
(t)
w2 -2
Contribution to
x[1] x[2] y P(y=+1|xi,w)
derivative for w1
Total derivative:
2 1 +1 0.5
0 2 -1 0.02
3 3 -1 0.05
4 1 +1 0.88
@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1
yi=+1
yi=-1
• Advanced tip: can also try step size that decreases with
iterations, e.g.,
V E R Y
I O N A L
OP T
49 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
HEAD SHOT
V E R Y
I O N A L
OP T
54 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Using log to turn products into sums
• The log of the product of likelihoods becomes the sum of the logs:
N
Y
``(w) = ln P (yi | xi , w)
i=1
V E R Y
I O N A L
OP T
58 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression model: P(y=-1|x,w)
• Probability model predicts y=+1:
P(y=+1|x,w) = 1 .
1+e-w h(x)
T
V E R Y
I O N A L
OP T
61 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Plugging in logistic function for 1 data point
>
1 e w h(x)
P (y = +1 | x, w) = P (y = 1 | x, w) =
1+e w> h(x) 1 + e w> h(x)
V E R Y
I O N A L
OP T
64 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Gradient for 1 data point
⇣ >
⌘
``(w) = (1 1[yi = +1])w> h(xi ) ln 1 + e w h(xi )
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
70 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO FULL
BODY SHOT
z
-∞ +∞
0.0
1.0
sign(z)
⇢
+1 if z 0 But, sign(z) only outputs -1 or +1,
sign(z) =
1 otherwise no probabilities in between
73 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1
Score(xi) = ŵTh(xi)
-∞
74
+∞
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric: probability of data
⌃
P(y=+1|x,ŵ) = 1 .
1 + e-ŵ h(x)
T
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1
x1,y1 2 1 +1
x2,y2 0 2 -1
x6,y6 2 4 -1
x7,y7 0 3 -1
x8,y8 0 1 -1
x9,y9 2 1 +1
76 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
N
Y
P (yi | xi , w)
i=1