Professional Documents
Culture Documents
Ronan Collobert
ronan@collobert.com
Introduction:
Introduction:
W1
tanh()
W2
score
Biological Neuron
Perceptron:
Rosenblatt (1957)
wx+b=0
Input: retina x Rn
Associative area: any kind of (fixed) function (x) Rd
Decision function:
1 if w (x) > 0
f (x) =
1 otherwise
t w t (xt)), given (xt, y t) Rd {1, 1}
max(0,
y
t
t
y (xt) if y t w (xt) 0
t+1
t
w
=w +
0
otherwise
Training: minimize
Perceptron:
u wt ||u|| ||wt||
2
||wt||
max
Assuming classes
are separable
=
x
u
u defines maximum
margin separating hyperplane...
x=
x=
u wt = u wt1 + y t u xt
u wt1 + 1
t
When we do a mistake...
2/||u||
4 R2
t 2
max
7
Adaline:
? Separable case:
does not find a hyperplane equidistant from the two classes
? Non-separable case: does not converge
Adaline (Widrow & Hoff, 1960) minimizes
1X t
(y wt (xt))2
2
t
Delta rule:
wt+1 = wt + (y t wt xt) xt
Perceptron:
Margin
max
2
R2
||wT ||
t w t (xt))
max(0,
1
y
t
t
y (xt) if y t w (xt) 1
t+1
t
w
=w +
0
otherwise
Finite number of updates:
2
t 2 ( + R2 )
max
Control on the margin:
1
max
2 + R2
9
Perceptron:
In Practice
Original Perceptron (10/40/60 iter)
10
Regularization
In many machine-learning algorithms (including SVMs!)
early stopping (on a validation set) is a good idea
Going Non-Linear:
(1/2)
1.5
3
2.5
2
0.5
1.5
()
0.5
1
0.5
0
3
2.5
4
1.5
3
2
1.5
0.5
2
2
0
0
1.5
0.5
0.5
(x21,
1.5
2 x1 x2, x22)
12
Going Non-Linear:
(2/2)
2
E.g., for (x) = (x1, x2) = (x1, 2 x1 x2, x22) a possible kernel is
K(x, xt) = (x xt)2
K(, ) is a kernel if g
Z
such that
g(x)2dx <
Z
then
14
Going Non-Linear:
Adding Layers
(1/2)
15
Going Non-Linear:
Adding Layers
(2/2)
Multi-Layer Perceptron
W1
tanh()
W2
score
g : Rd R
can be approximated (on a compact) by a two-layer neural network
W1
tanh()
W2
score
Cybenko used
17
Gradient Descent
(1/4)
C() =
T
X
c(f (xt), y t)
t=1
C()
c(f (xt), y t)
Gradient Descent:
Learning Rate
(2/4)
Gradient Descent:
Caveats
(3/4)
w1
tanh()
w2
log(1 + ey )
Gradient Descent:
(4/4)
C()
+ TH()
C( + ) C() +
2C
k2
k
21
Gradient Backpropagation
(1/2)
f1 ()
f2 ()
f3 ()
f4 ()
How to compute fl
w
??
w1
1
2 (y
? f1(x) = w1 x
? f2(f1) = 21 (y f1)2
f
f2 f1
=
w1
f1 |{z}
w1
|{z}
chain rule
=yf1 =x
22
Gradient Backpropagation
x
(2/2)
f1 ()
f2 ()
f3 ()
f4 ()
Brutal way:
fl+1 fl
f
fL fL1
=
l
f
f
fl wl
w
L1
L2
f fl
f
=
fl1 fl fl1
f
f fl
=
l
fl wl
w
Often, gradients are efficiently computed using outputs of the module
Do a forward before each backward
23
Examples Of Modules
For simplicity, we denote
?
?
?
?
x
z
y
y
Forward
y=Wx
W T y
y = 21 (x z)2
xz
y = tanh(x)
y (1 y 2)
y = 1/(1 + ex)
y (1 y) y
Linear
MSE Loss
Tanh
Sigmoid
Backward Gradient
y xT
1zx0
(1/2)
log
T
Y
t=1
p(y t|xt) =
T
X
t=1
efy (x)
p(y|x) = P
fi(x)
ie
X
log p(y|x) = fy (x) log
efi(x)
i
25
(2/2)
ef1(x)
ef1(x) + ef1(x)
ef1(x)
ef1(x) + ef1(x)
= log(1 + ey (f1(x)f1(x)))
= log(1 + ey (f1(x)f1(x)))
4
3
2
1
0
4
2
z
26
y|x
N (f (x), 2)
In this case,
1
log p(y|x) = 2 ||y f (x)||2 + cste
2
27
Unsupervised Training
(1/2)
W
tanh()
2
W
tanh()
3
Caveats:
28
Unsupervised Training
(2/2)
x
1
W
tanh()
2
W
tanh()
3
Possible improvements:
h
iT
2
3
1
? No W layer, W = W
(Bengio et al., 2006)
29
Specialized Layers:
RBF
RBFW 1
W2
f1,i(x) =
1 ||2
||xW,i
2 2
e
= +
Gradient is zero if W 1 colums are far from training examples
Specialized Layers:
x
Conv 1D
1D Convolutions
Subsampl. 1D
tanh()
Conv 1D
tanh()
W2
W W W W
X = (X 1, X 2 )
input (matrix)
X 1 X 2
W X 2 X 3 convolution (local embedding
X 3 X 4
for each input column)
Robustness to time shifts:
Apply sub-sampling (as convolution, but W,i contains single value)
Also called Time Delay Neural Networks (TDNNs)
31
Specialized Layers:
2D Convolutions
W2
W3
W1
32
Specialized Training:
Non-Linear CRF
(1/2)
x,1
class
class
class
class
x,2
x,3
x,4
x,5
x,6
1
2
3
4
..
.
Sentence score for a class label path [i]T1
s([x]T1 ,
[i]T1 ,
=
)
T
X
t=1
[i]t, t, )
= s([x]T , [y]T , )
logadd s([x]T , [j]T , )
33
Specialized Training:
Non-Linear CRF
(2/2)
t(j) =
Termination:
= logAddi T (i)
logadd s([x]T1 , [j]T1 , )
[j]T1
34