You are on page 1of 77

Linear classifiers:

Parameter learning
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn a probabilistic classification model
“The sushi & everything “The sushi was good,
else were awesome!” the service was OK”

Definite Not sure

P(y=+1| x= “The )
sushi & everything
else were awesome!” P(y=+1| x= “The
the service was )OK”
sushi was good,

= 0.99 = 0.55
Many classifiers provide a degree of certainty:
Output label Input sentence
P(y|x)
Extremely useful in practice
2 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
A (linear) classifier
•  Will use training data to learn a weight
or coefficient for each word
Word Coefficient Value
ŵ0 -2.0
good ŵ1 1.0
great ŵ2 1.5
awesome ŵ3 2.7
bad ŵ4 -1.0
terrible ŵ5 -2.1
awful ŵ6 -3.3
restaurant, the, we, … ŵ7, ŵ8, ŵ9,… 0.0
… …
3 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression model
Score(xi) = ŵTh(xi)
-∞
0.0
+∞

0.5
0.0 1.0



P(y=+1|x,ŵ) = 1 .

1+e-ŵ h(x)

4 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Quality metric for logistic regression:
Maximum likelihood estimation

5 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)

x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
6 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning problem
Training data:
N observations (xi,yi)
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
Optimize
4 1 +1
1 1 +1
quality metric
on training ŵ
2 4 -1 data
0 3 -1
0 1 -1
2 1 +1

7 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
HEAD SHOT

8 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients

x[1] = #awesome x[2] = #awful y = sentiment


2 1 +1
0 2 -1
3 3 -1
4 1 +1
1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1

9 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
0
3 2
3 -1 4 1 +1
2
3 4
3 -1 1 1 +1
0 3 -1 4
2 1 +1
0 1 -1 1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1

10 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1

P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0

Pick ŵ that makes


11 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric = Likelihood function

Negative data points Positive data points

P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0


No ŵ achieves perfect predictions (usually)
Likelihood ℓ(w): Measures quality of
fit for model with coefficients w
12 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Find “best” classifier
Maximize likelihood over all possible w0,w1,w2

ℓ(w0=0, w1=1, w2=-1.5) = 10-6


#awful

ℓ(w0=1, w1=1, w2=-1.5) = 10-5


Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2

1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
13 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Data likelihood

14 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Quality metric: probability of data

x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1

If model good, should predict: If model good, should predict:


Pick w to maximize: Pick w to maximize:

15 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Maximizing likelihood
(probability of data)
Data
x[1] x[2] y Choose w to maximize
point

x1,y1 2 1 +1

x2,y2 0 2 -1

x3,y3 3 3 -1 Must combine into single


x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1

x6,y6 2 4 -1

x7,y7 0 3 -1

x8,y8 0 1 -1

x9,y9 2 1 +1
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
Data
x[1] x[2] y Choose w to maximize
point

x1,y1 2 1 +1 P(y=+1|x[1]=2, x[2]=1,w)

x2,y2 0 2 -1 P(y=-1|x[1]=0, x[2]=2,w)

x3,y3 3 3 -1 P(y=-1|x[1]=3, x[2]=3,w)

x4,y4 4 1 +1 P(y=+1|x[1]=4, x[2]=1,w)

ℓ(w) =
P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w)
N
Y
P (yi | xi , w)
i=1
17 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
FULL BODY SHOT

18 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best linear classifier
with gradient ascent

19 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)

x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
20 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
HEAD SHOT

21 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Find “best” classifier
Maximize likelihood over all possible w0,w1,w2
N
Y
`(w) = P (yi | xi , w)
i=1
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
ℓ(w0=1, w1=1, w2=-1.5) = 10-5
#awful


Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2

1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
22 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood

Maximize function over all


possible w0,w1,w2
N
Y
max P (yi | xi , w)
w0,w1,w2 i=1

No closed-form solution è use gradient ascent ℓ(w0,w1,w2) is a


23
function of 3 variables
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
FULL BODY SHOT

24 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Review of gradient ascent

25 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
HEAD SHOT

26 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding the max
via hill climbing

Algorithm:

while not converged


w(t+1) ß w(t) + η dℓ
dw
w(t)

27 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Convergence criteria
For convex functions,
optimum occurs when

Algorithm:
In practice, stop when

while not converged


w(t+1) ß w(t) + η dℓ
dw
w(t)

28 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Moving to multiple dimensions:
Gradients

Δ
ℓ(w)
(w) =

29 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Contour plots

30 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Gradient ascent

Algorithm:

while not converged Δ


w(t+1) ß w(t) + η ℓ(w(t))

31 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
FULL BODY SHOT

32 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Learning algorithm for
logistic regression

33 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
HEAD SHOT

34 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Derivative of (log-)likelihood
Sum over Feature
data points value
Difference between truth and prediction

@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1

Indicator function:

@`(w) X
N⇣
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w)
@wj i=1

35 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Computing derivative
@`(w(t) ) X ⇣
N ⌘
(t)
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
@wj i=1

(t)
w0 0
(t)
w1 1
(t)
w2 -2

Contribution to
x[1] x[2] y P(y=+1|xi,w)
derivative for w1
Total derivative:
2 1 +1 0.5

0 2 -1 0.02

3 3 -1 0.05

4 1 +1 0.88

36 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Derivative of (log-)likelihood: Interpretation
Sum over Feature
data points value
Difference between truth and prediction

@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1

If hj(xi)=1: P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0

yi=+1

yi=-1

37 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Summary of gradient ascent
for logistic regression

init w(1)=0 (or randomly, or smartly), t=1


Δ
while || ℓ(w(t))|| > ε
for j=0,…,D N ⇣ ⌘
X
partial[j] = i=1 hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
(t)

wj(t+1) ß wj(t) + η partial[j]


tßt+1

38 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
FULL BODY SHOT

39 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Choosing the step size η

40 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
HEAD SHOT

41 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Learning curve:
Plot quality (likelihood) over iterations

42 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


If step size is too small, can take a
long time to converge

43 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Compare converge with
different step sizes

44 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Careful with step sizes that are too large

45 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Very large step sizes can even
cause divergence or wild oscillations

46 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simple rule of thumb for picking step size η
•  Unfortunately, picking step size
requires a lot of trial and error L
•  Try a several values, exponentially spaced
- Goal: plot learning curves to
•  find one η that is too small (smooth but moving too slowly)
•  find one η that is too large (oscillation or divergence)
•  Try values in between to find “best” η

•  Advanced tip: can also try step size that decreases with
iterations, e.g.,

47 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
FULL BODY SHOT

48 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Deriving the gradient for
logistic regression

V E R Y
I O N A L
OP T
49 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO
HEAD SHOT

50 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Log-likelihood function
•  Goal: choose coefficients w maximizing likelihood:
N
Y
`(w) = P (yi | xi , w)
i=1

•  Math simplified by using log-likelihood – taking (natural) log:


N
Y
``(w) = ln P (yi | xi , w)
i=1

51 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


The log trick, often used in ML…
•  Products become sums:

•  Doesn’t change maximum!


- If ŵ maximizes f(w):

- Then ŵ maximizes ln(f(w)):

52 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Insert next title slide before Slide 52,
around 4:55 in
PL7_DerivingtheGradient_1stEdit

53 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Expressing the log-likelihood

V E R Y
I O N A L
OP T
54 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Using log to turn products into sums

•  The log of the product of likelihoods becomes the sum of the logs:
N
Y
``(w) = ln P (yi | xi , w)
i=1

55 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Rewriting log-likelihood
•  For simpler math, we’ll rewrite likelihood with indicators:
N
X
``(w) = ln P (yi | xi , w)
i=1
N
X
= [1[yi = +1] ln P (y = +1 | xi , w) + 1[yi = 1] ln P (y = 1 | xi , w)]
i=1

56 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Insert next title slide before Slide 54,
around 7:33 in
PL7_DerivingtheGradient_1stEdit

57 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Deriving probability that y=-1 given x

V E R Y
I O N A L
OP T
58 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Logistic regression model: P(y=-1|x,w)
•  Probability model predicts y=+1:
P(y=+1|x,w) = 1 .
1+e-w h(x)
T

•  Probability model predicts y=-1:

59 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Insert next title slide before Slide 55,
around 9:15 in
PL7_DerivingtheGradient_1stEdit

60 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Rewriting the log-likelihood

V E R Y
I O N A L
OP T
61 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Plugging in logistic function for 1 data point
>
1 e w h(x)
P (y = +1 | x, w) = P (y = 1 | x, w) =
1+e w> h(x) 1 + e w> h(x)

``(w) = 1[yi = +1] ln P (y = +1 | xi , w) + 1[yi = 1] ln P (y = 1 | xi , w)

62 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Insert next title slide before Slide 56,
around 16:56 in
PL7_DerivingtheGradient_1stEdit

63 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Deriving gradient of log-likelihood

V E R Y
I O N A L
OP T
64 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Gradient for 1 data point
⇣ >

``(w) = (1 1[yi = +1])w> h(xi ) ln 1 + e w h(xi )

65 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finally, gradient for all data points
•  Gradient for one data point:
⇣ ⌘
hj (xi ) 1[yi = +1] P (y = +1 | xi , w)

•  Adding over data points:

66 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
FULL BODY SHOT

67 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Summary of logistic
regression classifier

68 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


MOVE TO
HEAD SHOT

69 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)

x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
70 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
MOVE TO FULL
BODY SHOT

71 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


What you can do now…
•  Measure quality of a classifier using
the likelihood function
•  Interpret the likelihood function as
the probability of the observed data
•  Learn a logistic regression model with
gradient descent
•  (Optional) Derive the gradient
descent update rule for logistic
regression

72 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simplest link function: sign(z)

z
-∞ +∞

0.0
1.0

sign(z)

+1 if z 0 But, sign(z) only outputs -1 or +1,
sign(z) =
1 otherwise no probabilities in between
73 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients

x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1

0.0 P(y=+1|xi,ŵ) 1.0

Score(xi) = ŵTh(xi)
-∞
74
+∞
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric: probability of data

P(y=+1|x,ŵ) = 1 .

1 + e-ŵ h(x)
T

x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1

If model good, should predict If model good, should predict


Increase probability y=+1 when Increase probability y=-1 when

Choose w to make Choose w to make

75 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Maximizing likelihood
(probability of data)
Data Choose w to
x[1] x[2] y
point maximize

x1,y1 2 1 +1

x2,y2 0 2 -1

x3,y3 3 3 -1 Must combine into single


x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1

x6,y6 2 4 -1

x7,y7 0 3 -1

x8,y8 0 1 -1

x9,y9 2 1 +1
76 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)

•  Choose coefficients w that maximize likelihood:

N
Y
P (yi | xi , w)
i=1

•  No closed-form solution è use gradient ascent

77 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

You might also like