You are on page 1of 32

Lecture 12

EE531 Statistical Learning Theory

Contents
Introduction
Artificial neuron model
Feed-forward, recurrent network

Neural network: Activation function


Example
Linear model for regression/classification

Learning
Back-propagation
General comments

Restricted Boltzmann Machine (RBM)


Summary

2
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Introduction(1)
Neural network is a non-linear classifier/regression. One strategy is to make
networks of units-simple parameterized functions. This was inspired by
the brain, but has pretty clearly demonstrated not to be what the brain does.

3
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Introduction(2)
Feed-forward network: data flows from input to output with no cycles.

Outputs

Input
layer

Hidden
layer 1

Hidden
layer 2

Output
layer

Recurrent network: backward, forward, and even links in a layer. This


network has memory, and may be revisited later.
We will consider only feed-forward network.

4
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Introduction(3)
Sometimes the activation function for output units is the same as for hidden
units.
What if we let be the identity? Then we have a linear combination of linear
combinations, which remains linear.
Could use binary threshold units:
If inputs are also in {0,1}, then you can implement any Boolean function this
way. Neurons act as logic gates.
If inputs are real-valued, you can make a 3-layer network that gets arbitrarily
close to any decision boundary.
Problem with threshold units is that no-one knows a good way to set the
weights based on data. In fact, its NP-hard.
5
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Neural Network: Activation


function(1)
Activation function: Sigmoidal Units

1
0.8

Log-sigmoid

0.6
0.4

g(a)

0.2
0
-0.2
-0.4
-0.6
-0.8
-1

Hyperbolic tangent sigmoid

-30

-20

-10

0
a

10

20

-20

-10

0
a

10

20

30

1
0.8
0.6
0.4

g(a)

0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-30

30

The Hyperbolic tangent sigmoid is sometimes better to use than log-sigmoid,


because it can be trained more efficiently.
Use a linear output unit for regression, to get the full output range : can use
sigmoid for classification.
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Neural Network: Activation


function(2)
Sigmoid unit approximately linear for small weights; approximately
threshold for large ones.

2-layer sigmoid networks can approximate an arbitrary function (but may


require lots of units!)
Can make much more complex topologies mostly useful if they help you
build in bias.

7
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Example(1)
The linear model for regression/classification is based on linear combinations
of fixed nonlinear basis functions
and take the form

8
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Example(2)
For standard regression problems, the activation function of output unit, ,
is the identity so that
.
For multiple binary classification problem, is the logistic sigmoid function.
For multiclass problems,

is the softmax activation function.

9
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation


Sigmoids are differentiable, so the whole network represents a differentiable
function of all the weights. Set the weights to minimize a Error function on
the data, using gradient descent (or other fancier gradient-based searches,
like conjugate gradient.)
Error-backpropagation
Compute gradient updates locally in the network.
Assume every unit has a fixed extra input with +1 activation
General unit:

Zs can be inputs, hidden units, or output units.

10
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation(1)

Input
layer

Hidden
layer 1

Hidden
layer 2

Output
layer

Outputs

Step 1: Apply an input vector


to the network and forward propagate
through the network using
and
to find the
activations of all the hidden and output units.

11
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation(2)

Input
layer

Hidden
layer 1

Step 2: Evaluate the

Hidden
layer 2

Output
layer

Outputs

for all the output units.

12
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation(3)

Input
layer

Hidden
layer 1

Hidden
layer 2

Step 3: Backpropagate the

Output
layer

Outputs

using

13
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation(4)

Input
layer

Hidden
layer 1

Hidden
layer 2

Step 4: Evaluate the derivative of

Output
layer

Outputs

with respect to a weight

14
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Learning: Back propagation(5)

Input
layer

Hidden
layer 1

Hidden
layer 2

Output
layer

Outputs

Step 5: Update the weights with gradient descent.

15
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

General comments
Pragmatics
Can be very sensitive to initial conditions.
starting at zero never moves
big weights Saturation never moves
start with small random weights
Overfitting avoidance
stop training early (use validation set to decide when): keeps weights
from getting too big
weight decay : penalize sum of squared weights in error function

Standardize inputs (make it easier to pick reasonable starting weights)

16
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

General comments
Dont use 0,1 as targets if you have a sigmoid output unit requires weight
to go .

Local optima : no guarantee that any given run will converge to anything, let
alone the global optimum. Start multiple time from different initial
conditions. Use the apparent best or vote the outcomes.
Learning rate : Should decrease over time there are methods for adapting the
learning rate. Still, back-propagation is slow. Conjugate gradient is usually
better (but hairier to implement.)

17
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

BB-RBM Structure
Undirected bipartite graphical model
Generative model
Special case of product of expert

: visible units

: hidden units

: connection between v and h

Model parameter :
and

are biases of each layer

In this case:

18
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

BB-RBM Inference
Given the visible layer , all the hidden units
independent, and vice versa: e.g.,

are conditionally

can be used as features

19
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

GB-RBM Structure
Undirected bipartite graphical model

: visible units

: hidden units

: connection between v and h

Model parameter :
and

are biases of each layer

In this case:

20
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

GB-RBM Inference
Given the visible layer , all the hidden units
and vice versa

are conditionally independent,

can be used as features


21

EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

RBM Training
Maximum likelihood learning
Objective: maximize the log-likelihood of the given training data

where

Use gradient method to update the parameters


Computing exact gradient is intractable, hence it is common to use stochastic gradient
method via sampling based approximation
e.g., contrastive divergence

22
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Summary
Introduction
Artificial neuron model
Feed-forward, recurrent network

Neural network: Activation function


Example
Linear model for regression/classification

Learning
Back-propagation
General comments

23
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : RBM Training


Model distribution:
Find parameters that maximize

Data distribution
(posterior of
given

Need to compute

and

Model distribution
)

, and gradient w.r.t parameters

e.g.,

24
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : RBM Training


However, computing
computationally intractable

given training data is tractable, but

is

Replace the average over all possible inputs by samples

Run the Markov chain Monte Carlo method (Gibbs Sampling) to


approximate the second term

25
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : RBM Training


If
, we can generate samples from the model distribution, but this is
also computationally intractable

Equilibrium
distribution
[G. Hinton et al., 2006]

26
EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : Contrastive Divergence


Equivalence between ML and minimizing KL

If
, the samples from the Markov chain converge to samples from the
model distribution, and the bias goes away
The objective is to learn the parameters of the model (RBM) to well-represent the
distribution of given data

Learning rule revisited

Maximizing the log-probability of the training data is equivalent to minimizing


the KL divergence between the empirical distribution,
and the model
distribution,

[G. Hinton et al., 2006] 27


EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : Contrastive Divergence


Kullback-Leibler divergence between
(data) and
that is
produced by prolonged Gibbs sampling from the generative model is:

Therefore maximizing the log-likelihood can be accomplished by minimizing


the KL divergence

[G. Hinton et al., 2006] 28


EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : Contrastive Divergence


When using k-step Gibbs sampling the model distribution is approximated by
, which causes an error given by the difference of the true model
distribution and the approximation by:

So that CD actually does not minimize


Instead it minimize the so called contrastive divergence given by:

Therefore the updates can be written as:

[G. Hinton et al., 2006] 29


EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : Contrastive Divergence


Contrastive Divergence

Start with a training vector on the


visible units
Update all the hidden units in
parallel
Update the all the visible units in
parallel to get a reconstruction
data

reconstructio
n

Update the hidden units again

[G. Hinton et al., 2006] 30


EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : RBM Training


Training via stochastic gradient
Note that,

Where
is a sample from k-step CD
Can derive similar update rule for biases and
Mini-batch (around 100 samples) strategy is used to reduce the variance of the
gradient estimate
Implemented in ~25 lines of MATLAB code

[G. Hinton et al., 2006] 31


EE531 Statistical Learning Theory

Spring, 2016

Korea Advanced Institute of Science and Technology

Appendix : RBM Training


The RBM itself is a powerful model for learning features, leading to state-ofthe-art performance in many tasks

RBMs are used for constructing deep models to build more powerful
generative model such as Deep Belief Networks (DBNs)
Deep models can represent hierarchical features

Bases learned in the higher hidden layer are


combinations of the bases learned in the previous
hidden layer

[Slide Credit: H. Lee]


EE531 Statistical Learning Theory

Spring, 2016

32

Korea Advanced Institute of Science and Technology

You might also like