Lec12 20160427

Lecture 12
EE531 Statistical Learning Theory
Contents
Introduction
Artificial neuron model
Feed-forward, recurrent network
Neural network: Activation function

Example
Linear model for regression/classification
Learning
Back-propagation
General comments
Restricted Boltzmann Machine (RBM)

Summary
2
Spring, 2016
Korea Advanced Institute of Science and Technology
Introduction(1)
Neural network is a non-linear classifier/regression. One strategy is to make
networks of units-simple parameterized functions. This was inspired by
the brain, but has pretty clearly demonstrated not to be what the brain does.
3
Spring, 2016
Introduction(2)
Feed-forward network: data flows from input to output with no cycles.
Outputs
Input
layer
Hidden
layer 1
Hidden
layer 2
Output
layer
Recurrent network: backward, forward, and even links in a layer. This

network has memory, and may be revisited later.
We will consider only feed-forward network.
4
Spring, 2016
Introduction(3)
Sometimes the activation function for output units is the same as for hidden
units.
What if we let be the identity? Then we have a linear combination of linear
combinations, which remains linear.
Could use binary threshold units:
If inputs are also in {0,1}, then you can implement any Boolean function this
way. Neurons act as logic gates.
If inputs are real-valued, you can make a 3-layer network that gets arbitrarily
close to any decision boundary.
Problem with threshold units is that no-one knows a good way to set the
weights based on data. In fact, its NP-hard.
5
Spring, 2016
Neural Network: Activation

function(1)
Activation function: Sigmoidal Units
1
0.8
Log-sigmoid
0.6
0.4
g(a)
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
Hyperbolic tangent sigmoid
-30
-20
-10
0
a
10
20
-20
-10
0
a
10
20
30
1
0.8
0.6
0.4
g(a)
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-30
30
The Hyperbolic tangent sigmoid is sometimes better to use than log-sigmoid,

because it can be trained more efficiently.
Use a linear output unit for regression, to get the full output range : can use
sigmoid for classification.
Spring, 2016
Neural Network: Activation

function(2)
Sigmoid unit approximately linear for small weights; approximately
threshold for large ones.
2-layer sigmoid networks can approximate an arbitrary function (but may

require lots of units!)
Can make much more complex topologies mostly useful if they help you
build in bias.
7
Spring, 2016
Example(1)
The linear model for regression/classification is based on linear combinations
of fixed nonlinear basis functions
and take the form
8
Spring, 2016
Example(2)
For standard regression problems, the activation function of output unit, ,
is the identity so that
.
For multiple binary classification problem, is the logistic sigmoid function.
For multiclass problems,
is the softmax activation function.
9
Spring, 2016
Learning: Back propagation

Sigmoids are differentiable, so the whole network represents a differentiable
function of all the weights. Set the weights to minimize a Error function on
the data, using gradient descent (or other fancier gradient-based searches,
like conjugate gradient.)
Error-backpropagation
Compute gradient updates locally in the network.
Assume every unit has a fixed extra input with +1 activation
General unit:
Zs can be inputs, hidden units, or output units.
10
Spring, 2016
Learning: Back propagation(1)
Input
layer
Hidden
layer 1
Hidden
layer 2
Output
layer
Outputs
Step 1: Apply an input vector

to the network and forward propagate
through the network using
and
to find the
activations of all the hidden and output units.
11
Spring, 2016
Input
layer
Hidden
layer 1
Step 2: Evaluate the
Hidden
layer 2
Output
layer
Outputs
for all the output units.
12
Spring, 2016
Input
layer
Hidden
layer 1
Hidden
layer 2
Step 3: Backpropagate the
Output
layer
Outputs
using
13
Spring, 2016
Input
layer
Hidden
layer 1
Hidden
layer 2
Step 4: Evaluate the derivative of
Output
layer
Outputs
with respect to a weight
14
Spring, 2016
Input
layer
Hidden
layer 1
Hidden
layer 2
Output
layer
Outputs
Step 5: Update the weights with gradient descent.
15
Spring, 2016
General comments
Pragmatics
Can be very sensitive to initial conditions.
starting at zero never moves
big weights Saturation never moves
start with small random weights
Overfitting avoidance
stop training early (use validation set to decide when): keeps weights
from getting too big
weight decay : penalize sum of squared weights in error function
Standardize inputs (make it easier to pick reasonable starting weights)
16
Spring, 2016
General comments
Dont use 0,1 as targets if you have a sigmoid output unit requires weight
to go .
Local optima : no guarantee that any given run will converge to anything, let
alone the global optimum. Start multiple time from different initial
conditions. Use the apparent best or vote the outcomes.
Learning rate : Should decrease over time there are methods for adapting the
learning rate. Still, back-propagation is slow. Conjugate gradient is usually
better (but hairier to implement.)
17
Spring, 2016
BB-RBM Structure
Undirected bipartite graphical model
Generative model
Special case of product of expert
: visible units
: hidden units
: connection between v and h
Model parameter :
and
are biases of each layer
In this case:
18
Spring, 2016
BB-RBM Inference
Given the visible layer , all the hidden units
independent, and vice versa: e.g.,
are conditionally
can be used as features
19
Spring, 2016
GB-RBM Structure
Undirected bipartite graphical model
: visible units
: hidden units
: connection between v and h
Model parameter :
and
are biases of each layer
In this case:
20
Spring, 2016
GB-RBM Inference
Given the visible layer , all the hidden units
and vice versa
are conditionally independent,
can be used as features

21
Spring, 2016
RBM Training
Maximum likelihood learning
Objective: maximize the log-likelihood of the given training data
where
Use gradient method to update the parameters

Computing exact gradient is intractable, hence it is common to use stochastic gradient
method via sampling based approximation
e.g., contrastive divergence
22
Spring, 2016
Summary
Introduction
Artificial neuron model
Feed-forward, recurrent network
Neural network: Activation function

Example
Linear model for regression/classification
Learning
Back-propagation
General comments
23
Spring, 2016
Appendix : RBM Training

Model distribution:
Find parameters that maximize
Data distribution
(posterior of
given
Need to compute
and
Model distribution
)
, and gradient w.r.t parameters
e.g.,
24
Spring, 2016

However, computing
computationally intractable
given training data is tractable, but
is
Replace the average over all possible inputs by samples
Run the Markov chain Monte Carlo method (Gibbs Sampling) to

approximate the second term
25
Spring, 2016

If
, we can generate samples from the model distribution, but this is
also computationally intractable
Equilibrium
distribution
[G. Hinton et al., 2006]
26
Spring, 2016
Appendix : Contrastive Divergence

Equivalence between ML and minimizing KL
If
, the samples from the Markov chain converge to samples from the
model distribution, and the bias goes away
The objective is to learn the parameters of the model (RBM) to well-represent the
distribution of given data
Learning rule revisited
Maximizing the log-probability of the training data is equivalent to minimizing

the KL divergence between the empirical distribution,
and the model
distribution,
[G. Hinton et al., 2006] 27

Spring, 2016

Kullback-Leibler divergence between
(data) and
that is
produced by prolonged Gibbs sampling from the generative model is:
Therefore maximizing the log-likelihood can be accomplished by minimizing

the KL divergence

Spring, 2016

When using k-step Gibbs sampling the model distribution is approximated by
, which causes an error given by the difference of the true model
distribution and the approximation by:
So that CD actually does not minimize

Instead it minimize the so called contrastive divergence given by:
Therefore the updates can be written as:

Spring, 2016

Contrastive Divergence
Start with a training vector on the

visible units
Update all the hidden units in
parallel
Update the all the visible units in
parallel to get a reconstruction
data
reconstructio
n
Update the hidden units again

Spring, 2016

Training via stochastic gradient
Note that,
Where
is a sample from k-step CD
Can derive similar update rule for biases and
Mini-batch (around 100 samples) strategy is used to reduce the variance of the
gradient estimate
Implemented in ~25 lines of MATLAB code

Spring, 2016

The RBM itself is a powerful model for learning features, leading to state-ofthe-art performance in many tasks
RBMs are used for constructing deep models to build more powerful
generative model such as Deep Belief Networks (DBNs)
Deep models can represent hierarchical features
Bases learned in the higher hidden layer are

combinations of the bases learned in the previous
hidden layer
[Slide Credit: H. Lee]

Spring, 2016
32

Lec12 20160427

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec12 20160427

Uploaded by

Copyright:

Available Formats

Lecture 12

EE531 Statistical Learning Theory

Neural network: Activation function

Restricted Boltzmann Machine (RBM)

Korea Advanced Institute of Science and Technology

Korea Advanced Institute of Science and Technology

Recurrent network: backward, forward, and even links in a layer. This

Korea Advanced Institute of Science and Technology

Korea Advanced Institute of Science and Technology

Neural Network: Activation

Hyperbolic tangent sigmoid

The Hyperbolic tangent sigmoid is sometimes better to use than log-sigmoid,

Korea Advanced Institute of Science and Technology

Neural Network: Activation

2-layer sigmoid networks can approximate an arbitrary function (but may

Korea Advanced Institute of Science and Technology

Korea Advanced Institute of Science and Technology

is the softmax activation function.

Korea Advanced Institute of Science and Technology

Learning: Back propagation

Zs can be inputs, hidden units, or output units.

Korea Advanced Institute of Science and Technology

Learning: Back propagation(1)

Step 1: Apply an input vector

Korea Advanced Institute of Science and Technology

Learning: Back propagation(2)

Step 2: Evaluate the

for all the output units.

Korea Advanced Institute of Science and Technology

Learning: Back propagation(3)

Step 3: Backpropagate the

Korea Advanced Institute of Science and Technology

Learning: Back propagation(4)

Step 4: Evaluate the derivative of

with respect to a weight

Korea Advanced Institute of Science and Technology

Learning: Back propagation(5)

Step 5: Update the weights with gradient descent.

Korea Advanced Institute of Science and Technology

Standardize inputs (make it easier to pick reasonable starting weights)

Korea Advanced Institute of Science and Technology

Korea Advanced Institute of Science and Technology

: connection between v and h

are biases of each layer

Korea Advanced Institute of Science and Technology

can be used as features

Korea Advanced Institute of Science and Technology

: connection between v and h

are biases of each layer

Korea Advanced Institute of Science and Technology

are conditionally independent,

can be used as features

EE531 Statistical Learning Theory

Korea Advanced Institute of Science and Technology

Use gradient method to update the parameters

Korea Advanced Institute of Science and Technology

Neural network: Activation function

Korea Advanced Institute of Science and Technology

Appendix : RBM Training

, and gradient w.r.t parameters

Korea Advanced Institute of Science and Technology

Appendix : RBM Training