Neurl 231 Net

Courses/231/
Neural net
Practical Recommendations for Gradient-Based Training of Deep
Architectures
arXiv:1206.5533v2 [cs.LG] 16 Sep 2012
Yoshua Bengio
Version 2, Sept. 16th, 2012
Abstract of practice, focusing on learning algorithms aiming

at training deep neural networks, but leaving most
Learning algorithms related to artificial neural net- of the material specific to the Boltzmann machine
works and in particular for Deep Learning may seem family to another chapter (Hinton, 2013).
to involve many bells and whistles, called hyper- Although such recommendations come out of a liv-
parameters. This chapter is meant as a practical ing practice that emerged from years of experimenta-
guide with recommendations for some of the most tion and to some extent mathematical justification,
commonly used hyper-parameters, in particular in they should be challenged. They constitute a good
the context of learning algorithms based on back- starting point for the experimenter and user of learn-
propagated gradient and gradient-based optimiza- ing algorithms but very often have not been formally
tion. It also discusses how to deal with the fact that validated, leaving open many questions that can be
more interesting results can be obtained when allow- answered either by theoretical analysis or by solid
ing one to adjust many hyper-parameters. Overall, it comparative experimental work (ideally by both). A
describes elements of the practice used to successfully good indication of the need for such validation is that
and efficiently train and debug large-scale and often different researchers and research groups do not al-
deep multi-layer neural networks. It closes with open ways agree on the practice of training neural net-
questions about the training difficulties observed with works.
deeper architectures.
Several of the recommendations presented here can
be found implemented in the Deep Learning Tutori-
als1 and in the related Pylearn2 library2 , all based on
1 Introduction the Theano library (discussed below) written in the
Python programming language.
Following a decade of lower activity, research in arti-
ficial neural networks was revived after a 2006 break- The 2006 Deep Learning break-
through (Hinton et al., 2006; Bengio et al., 2007; through (Hinton et al., 2006; Bengio et al., 2007;
Ranzato et al., 2007) in the area of Deep Learning, Ranzato et al., 2007) centered on the use of un-
based on greedy layer-wise unsupervised pre-training supervised representation learning to help learning
of each layer of features. See (Bengio, 2009) for a internal representations3 by providing a local train-
review. Many of the practical recommendations that
justified the previous edition of this book are still 1 http://deeplearning.net/tutorial/
2 http://deeplearning.net/software/pylearn2
valid, and new elements were added, while some sur- 3 A neural network computes a sequence of data transfor-
vived longer by virtue of the practical advantages mations, each step encoding the raw input into an intermediate
they provided. The panorama presented in this chap- or internal representation, in principle to make the prediction
ter regards some of these surviving or novel elements or modeling task of interest easier.
1
ing signal at each level of a hierarchy of features4 . tion. The main section of this chapter is Section 3,
Unsupervised representation learning algorithms can which explains hyper-parameters in general, their op-
be applied several times to learn different layers timization, and specifically covers the main hyper-
of a deep model. Several unsupervised represen- parameters of neural networks. Section 4 briefly de-
tation learning algorithms have been proposed scribes simple ideas and methods to debug and visu-
since then. Those covered in this chapter (such as alize neural networks, while Section 5 covers paral-
auto-encoder variants) retain many of the properties lelism, sparse high-dimensional inputs, symbolic in-
of artificial multi-layer neural networks, relying puts and embeddings, and multi-relational learning.
on the back-propagation algorithm to estimate The chapter closes (Section 6) with open questions
stochastic gradients. Deep Learning algorithms on the difficulty of training deep architectures and
such as those based on the Boltzmann machine improving the optimization methods for neural net-
and those based on auto-encoder or sparse coding works.
variants often include a supervised fine-tuning stage.
This supervised fine-tuning as well as the gradient
descent performed with auto-encoder variants also 1.1 Deep Learning and Greedy Layer-
involves the back-propagation algorithm, just as Wise Pretraining
like when training deterministic feedforward or
recurrent artificial neural networks. Hence this The notion of reuse, which explains the power of
chapter also includes recommendations for training distributed representations (Bengio, 2009), is also
ordinary supervised deterministic neural networks at the heart of the theoretical advantages behind
or more generally, most machine learning algorithms Deep Learning. Complexity theory of circuits,
relying on iterative gradient-based optimization of e.g. (Håstad, 1986; Håstad and Goldmann, 1991),
a parametrized learner with respect to an explicit (which include neural networks as special cases) has
training criterion. much preceded the recent research on deep learning.
The depth of a circuit is the length of the longest
This chapter assumes that the reader already un-
path from an input node of the circuit to an out-
derstands the standard algorithms for training su-
put node of the circuit. Formally, one can change
pervised multi-layer neural networks, with the loss
the depth of a given circuit by changing the defini-
gradient computed thanks to the back-propagation
tion of what each node can compute, but only by a
algorithm (Rumelhart et al., 1986). It starts by
constant factor (Bengio, 2009). The typical compu-
explaining basic concepts behind Deep Learning
tations we allow in each node include: weighted sum,
and the greedy layer-wise pretraining strategy (Sec-
product, artificial neuron model (such as a mono-
tion 1.1), and recent unsupervised pre-training al-
tone non-linearity on top of an affine transforma-
gorithms (denoising and contractive auto-encoders)
tion), computation of a kernel, or logic gates. Theo-
that are closely related in the way they are trained
retical results (Håstad, 1986; Håstad and Goldmann,
to standard multi-layer neural networks (Section 1.2).
1991; Bengio et al., 2006b; Bengio and LeCun, 2007;
It then reviews in Section 2 basic concepts in it-
Bengio and Delalleau, 2011) clearly identify families
erative gradient-based optimization and in particu-
of functions where a deep representation can be expo-
lar the stochastic gradient method, gradient com-
nentially more efficient than one that is insufficiently
putation with a flow graph, automatic differenta-
deep. If the same set of functions can be represented
4 In standard multi-layer neural networks trained using
from within a family of architectures associated with
a smaller VC-dimension (e.g. less hidden units5 ),
back-propagated gradients, the only signal that drives param-
eter updates is provided at the output of the network (and learning theory would suggest that it can be learned
then propagated backwards). Some unsupervised learning al-
gorithms provide a local source of guidance for the parameter 5 Note that in our experiments, deep architectures tend to
update in each layer, based only on the inputs and outputs of generalize very well even when they have quite large numbers
that layer. of parameters.
2
with fewer examples, yielding improvements in both to some objective of interest. Combining unsuper-
computational efficiency and statistical efficiency. vised pre-training and supervised fine-tuning usu-
Another important motivation for feature learning ally gives better generalization than pure supervised
and Deep Learning is that they can be done with un- learning from a purely random initialization. The
labeled examples, so long as the factors (unobserved unsupervised representation learning algorithms for
random variables explaining the data) relevant to the pre-training proposed in 2006 were the Restricted
questions we will ask later (e.g. classes to be pre- Boltzmann Machine or RBM (Hinton et al., 2006),
dicted) are somehow salient in the input distribution the auto-encoder (Bengio et al., 2007) and a spar-
itself. This is true under the manifold hypothesis, sifying form of auto-encoder similar to sparse cod-
which states that natural classes and other high-level ing (Ranzato et al., 2007).
concepts in which humans are interested are asso-
ciated with low-dimensional regions in input space 1.2 Denoising and Contractive Auto-
(manifolds) near which the distribution concentrates,
and that different class manifolds are well-separated
Encoders
by regions of very low density. It means that a small An auto-encoder has two parts: an encoder func-
semantic change around a particular example can tion f that maps the input x to a representation
be captured by changing only a few numbers in a h = f (x), and a decoder function g that maps h
high-level abstract representation space. As a conse- back in the space of x in order to reconstruct x.
quence, feature learning and Deep Learning are in- In the regular auto-encoder the reconstruction func-
timately related to principles of unsupervised learn- tion r(·) = g(f (·)) is trained to minimize the average
ing, and they can work in the semi-supervised setting value of a reconstruction loss on the training exam-
(where only a few examples are labeled), as well as in ples. Note that reconstruction loss should be high for
the transfer learning and multi-task settings (where most other input configurations7. The regularization
we aim to generalize to new classes or tasks). The mechanism makes sure that reconstruction cannot be
underlying hypothesis is that many of the underlying perfect everywhere, while minimizing the reconstruc-
factors are shared across classes or tasks. Since rep- tion loss at training examples digs a hole in recon-
resentation learning aims to extract and isolate these struction error where the density of training exam-
factors, representations can be shared across classes ples is large. Examples of reconstruction loss func-
and tasks. 2
P include ||x − r(x)|| (for real-valued inputs) and
tions
One of the most commonly used approaches for − i xi log ri (x) + (1 − xi ) log(1 − ri (x)) (when in-
training deep neural networks is based on greedy terpreting xi as a bit or a probability of a binary
layer-wise pre-training (Bengio et al., 2007). The event). Auto-encoders capture the input distribu-
idea, first introduced in Hinton et al. (2006), is to tion by learning to better reconstruct more likely in-
train one layer of a deep architecture at a time us- put configurations. The difference between the recon-
ing unsupervised representation learning. Each level struction vector and the input vector can be shown to
takes as input the representation learned at the pre- be related to the log-density gradient as estimated by
vious level and learns a new representation. The the learner (Vincent, 2011; Bengio et al., 2012) and
learned representation(s) can then be used as input the Jacobian matrix of the reconstruction with re-
to predict variables of interest, for example to clas- spect to the input gives information about the second
sify objects. After unsupervised pre-training, one can derivative of the density, i.e., in which direction the
also perform supervised fine-tuning of the whole sys- density remains high when you are on a high-density
tem6 , i.e., optimize not just the classifier but also 7 Different regularization mechanisms have been proposed
the lower levels of the feature hierarchy with respect to push reconstruction error up in low density areas: denoising
criterion, contractive criterion, and code sparsity. It has been
6 The whole system composes the computation of the rep- argued that such constraints play a role similar to the partition
resentation with computation of the predictor’s output. function for Boltzmann machines (Ranzato et al., 2008a).
3
manifold (Rifai et al., 2011a; Bengio et al., 2012). In and as the number of examples increases, so long as
the Denoising Auto-Encoder (DAE) and the Con- capacity is limited (the number of parameters is small
tractive Auto-Encoder (CAE), the training procedure compared to the number of examples), training er-
also introduces robustness (insensitivity to small vari- ror and generalization approach each other. In the
ations), respectively in the reconstruction r(x) or in regime of such large datasets, we can consider that
the representation f (x). In the DAE (Vincent et al., the learner sees an unending stream of examples (e.g.,
2008, 2010), this is achieved by training with stochas- think about a process that harvests text and images
tically corrupted inputs, but trying to reconstruct the from the web and feeds it to a machine learning algo-
uncorrupted inputs. In the CAE (Rifai et al., 2011a), rithm). In that context, it is most efficient to simply
this is achieved by adding an explicit regularizing update the parameters of the model after each exam-
term in the training criterion, proportional to the ple or few examples, as they arrive. This is the ideal
(x) 2
norm of the Jacobian of the encoder, || ∂f∂x || . But online learning scenario, and in a simplified setting,
the CAE and the DAE are very related (Bengio et al., we can even consider each new example z as being
2012): when the noise is Gaussian and small, the sampled i.i.d. from an unknown generating distribu-
denoising error minimized by the DAE is equiva- tion with probability density p(z). More realistically,
lent to minimizing the norm of the Jacobian of the examples in online learning do not arrive i.i.d. but
reconstruction function r(·) = g(f (·)), whereas the instead from an unknown stochastic process which
CAE minimizes the norm of the Jacobian of the en- exhibits serial correlation and other temporal depen-
coder f (·). Besides Gaussian noise, another interest- dencies. Many learning algorithms rely on gradient-
ing form of corruption has been very successful with based numerical optimization of a training criterion.
DAEs: it is called the masking corruption and con- Let L(z, θ) be the loss incurred on example z when
sists in randomly zeroing out a large fraction (like the parameter vector takes value θ. The gradient
20% or even 50%) of the inputs, where the zeroed vector for the loss associated with a single example
out subset is randomly selected for each example. In is ∂L(z,θ)
∂θ .
addition to the contractive effect, it forces the learned If we consider the simplified case of i.i.d. data,
encoder to be able to rely only on an arbitrary subset there is an interesting observation to be made: the
of the input features. online learner is performing stochastic gradient de-
Another way to prevent the auto-encoder from per- scent on its generalization error. Indeed, the gener-
fectly reconstructing everywhere is to introduce a alization error C of a learner with parameters θ and
sparsity penalty on h, discussed below (Section 3.1). loss function L is
Z
C = E[L(z, θ)] = p(z)L(z, θ)dz
1.3 Online Learning and Optimization
of Generalization Error while the stochastic gradient from sample z is
The objective of learning is not to minimize training ∂L(z, θ)
error or even the training criterion. The latter is a ĝ =
∂θ
surrogate for generalization error, i.e., performance
on new (out-of-sample) examples, and there are no with z a random variable sampled from p. The gra-
hard guarantees that minimizing the training crite- dient of generalization error is
rion will yield good generalization error: it depends ∂C ∂
Z Z
∂L(z, θ)
on the appropriateness of the parametrization and = p(z)L(z, θ)dz = p(z) dz = E[ĝ]
∂θ ∂θ ∂θ
training criterion (with the corresponding prior they
imply) for the task at hand. showing that the online gradient ĝ is an unbiased es-
Many learning tasks of interest will require huge timator of the generalization error gradient ∂C∂θ . It
quantities of data (most of which will be unlabeled) means that online learners, when given a stream of
4
non-repetitive training data, really optimize (maybe examples:
not in the optimal way, i.e., using a first-order gra-
B(t+1)
dient technique) what we really care about: general- 1 X ∂L(zt′ , θ)
θ(t) ← θ(t−1) − ǫt . (1)
ization error. B ∂θ
t′ =Bt+1
With B = 1 we are back to ordinary online gradient

2 Gradients descent, while with B equal to the training set size,
this is standard (also called “batch”) gradient de-
scent. With intermediate values of B there is gener-
2.1 Gradient Descent and Learning
ally a sweet spot. When B increases we can get more
Rate multiply-add operations per second by taking advan-
The gradient or an estimator of the gradient is tage of parallelism or efficient matrix-matrix multipli-
used as the core part the computation of parame- cations (instead of separate matrix-vector multiplica-
ter updates for gradient-based numerical optimizations), often gaining a factor of 2 in practice in overall
tion algorithms. For example, simple online (or training time. On the other hand, as B increases, the
stochastic) gradient descent (Robbins and Monro, number of updates per computation done decreases,
1951; Bottou and LeCun, 2004) updates the param- which slows down convergence (in terms of error vs
eters after each example is seen, according to number of multiply-add operations performed) be-
cause less updates can be done in the same computing
∂L(zt , θ) time. Combining these two opposing effects yields a
θ(t) ← θ(t−1) − ǫt typical U-curve with a sweet spot at an intermediate
∂θ
value of B.
where zt is an example sampled at iteration t and Keep in mind that even the true gradient direction
where ǫt is a hyper-parameter that is called the learn- (averaging over the whole training set) is only the
ing rate and whose choice is crucial. If the learn- steepest descent direction locally but may not point
ing rate is too large8 , the average loss will increase. in the right direction when considering larger steps.
The optimal learning rate is usually close to (by a In particular, because the training criterion is not
factor of 2) the largest learning rate that does not quadratic in the parameters, as one moves in param-
cause divergence of the training criterion, an observa- eter space the optimal descent direction keeps chang-
tion that can guide heuristics for setting the learning ing. Because the gradient direction is not quite the
rate (Bengio, 2011), e.g., start with a large learning right direction of descent, there is no point in spend-
rate and if the training criterion diverges, try again ing a lot of computation to estimate it precisely for
with 3 times smaller learning rate, etc., until no di- gradient descent. Instead, doing more updates more
vergence is observed. frequently helps to explore more and faster, especially
See Bottou (2013) for a deeper treatment of with large learning rates. In addition, smaller values
stochastic gradient descent, including suggestions to of B may benefit from more exploration in parame-
set learning rate schedule and improve the asymp- ter space and a form of regularization both due to the
totic convergence through averaging. “noise” injected in the gradient estimator, which may
In practice, we use mini-batch updates based on explain the better test results sometimes observed
an average of the gradients9 inside each block of B with smaller B.
When the training set is finite, training proceeds
8 above a value which is approximately 2 times the largest by sweeps through the training set called an epoch,
eigenvalue of the average loss Hessian matrix and full training usually requires many epochs (iter-
9 Compared to a sum, an average makes a small change in
B have only a small effect on the optimal learning rate, with an

ations through the training set). Note that stochas-
increase in B generally allowing a small increase in the learning tic gradient (either one example at a time or with
rate because of the reduced variance of the gradient. mini-batches) is different from ordinary gradient de-
5
scent, sometimes called “batch gradient descent”, 2.2 Gradient Computation and Auto-
which corresponds to the case where B equals the matic Differentiation
training set size, i.e., there is one parameter update
per epoch). The great advantage of stochastic gra- The gradient can be either computed manually or
dient descent and other online or minibatch update through automatic differentiation. Either way, it
methods is that their convergence does not depend helps to structure this computation as a flow graph,
on the size of the training set, only on the number in order to prevent mathematical mistakes and make
of updates and the richness of the training distribu- sure an implementation is computationally efficient.
tion. In the limit of a large or infinite training set, The computation of the loss L(z, θ) as a function of
a batch method (which updates only after seeing all θ is laid out in a graph whose nodes correspond to
the examples) is hopeless. In fact, even for ordinary elementary operations such as addition, multiplica-
datasets of tens or hundreds of thousands of exam- tion, and non-linear operations such as the neural
ples (or more!), stochastic gradient descent converges networks activation function (e.g., sigmoid or hyper-
much faster than ordinary (batch) gradient descent, bolic tangent), possibly at the level of vectors, matri-
and beyond some dataset sizes the speed-up is al- ces or tensors. The flow graph is directed and acyclic
most linear (i.e., doubling the size almost doubles the and has three types of nodes: input nodes, internal
gain)10 . It is really important to use the stochastic nodes, and output nodes. Each of its nodes is as-
version in order to get reasonable clock-time conver- sociated with a numerical output which is the result
gence speeds. of the application of that computation (none in the
case of input nodes), taking as input the output of
As for any stochastic gradient descent method (in- previous nodes in a directed acyclic graph. Example
cluding the mini-batch case), it is important for ef- z and parameter vector θ (or their elements) are the
ficiency of the estimator that each example or mini- input nodes of the graph (i.e., they do not have in-
batch be sampled approximately independently. Be- puts themselves) and L(z, θ) is a scalar output of the
cause random access to memory (or even worse, to graph. Note that here, in the supervised case, z can
disk) is expensive, a good approximation, called in- include an input part x (e.g. an image) and a target
cremental gradient (Bertsekas, 2010), is to visit the part y (e.g. a target class associated with an object
examples (or mini-batches) in a fixed order corre- in the image). In the unsupervised case z = x. In
sponding to their order in memory or disk (repeating a semi-supervised case, there is a mix of labeled and
the examples in the same order on a second epoch, if unlabeled examples, and z includes y on the labeled
we are not in the pure online case where each exam- examples but not on the unlabeled ones.
ple is visited only once). In this context, it is safer if In addition to associating a numerical output oa to
the examples or mini-batches are first put in a ran- each node a of the flow graph, we can associate a gra-
dom order (to make sure this is the case, it could dient ga = ∂L(z,θ)
∂oa . The gradient will be defined and
be useful to first shuffle the examples). Faster con- computed recursively in the graph, in the opposite
vergence has been observed if the order in which the direction of the computation of the nodes’ outputs,
mini-batches are visited is changed for each epoch, i.e., whereas oa is computed using outputs op of pre-
which can be reasonably efficient if the training set decessor nodes p of a, ga will be computed using the
holds in computer memory. gradients gs of successor nodes s of a. More precisely,
the chain rule dictates
X ∂os
ga = gs
s
∂oa
10 On the other hand, batch methods can be parallelized where the sum is over immediate successors of a.
easily, which becomes an important advantage with currently Only output nodes have no successor, and in par-
available forms of computing power. ticular for the output node that computes L, the
6
gradient is set to 1 since ∂L∂L = 1, thus initializing semantics of the output (given the input) but yield-
the recursion. Manual or automatic differentiation ing smaller (or more numerically stable or more effi-
then only requires to define the partial derivative as- ciently computed) graphs (e.g., removing redundant
sociated with each type of operation performed by computations). To take advantage of the fact that
any node of the graph. When implementing gradi- computing the loss gradient includes as a first step
ent descent algorithms with manual differentiation computing the loss itself, it is advantageous to struc-
the result tends to be verbose, brittle code that lacks ture the code so that both the loss and its gradient are
modularity – all bad things in terms of software en- computed at once, with a single graph having multi-
gineering. A better approach is to express the flow ple outputs. The advantages of performing gradient
graph in terms of objects that modularize how to computations symbolically are numerous. First of all,
compute outputs from inputs as well as how to com- one can readily compute gradients over gradients, i.e.,
pute the partial derivatives necessary for gradient de- second derivatives, which are useful for some learn-
scent. One can pre-define the operations of these ob- ing algorithms. Second, one can define algorithms or
jects (in a “forward propagation” or fprop method) training criteria involving gradients themselves, as re-
and their partial derivatives (in a “backward prop- quired for example in the Contractive Auto-Encoder
agation” or bprop method) and encapsulate these (which uses the norm of a Jacobian matrix in its
computations in an object that knows how to com- training criterion, i.e., really requires second deriva-
pute its output given its inputs, and how to com- tives, which here are cheap to compute). Third, it
pute the gradient with respect to its inputs given makes it easy to implement other useful graph trans-
the gradient with respect to its output. This is the formations such as graph simplifications or numerical
strategy adopted in the Theano library11 with its Op optimizations and transformations that help making
objects (Bergstra et al., 2010), as well as in libraries the numerical results more robust and more efficient
such as Torch12 (Collobert et al., 2011b) and Lush13. (such as working in the domain of logarithms of prob-
Compared to Torch and Lush, Theano adds an in- abilities rather than in the domain of probabilities
teresting ingredient which makes it a full-fledged au- directly). Other potential beneficial applications of
tomatic differentiation tool: symbolic computation. such symbolic manipulations include parallelization
The flow graph itself (without the numerical values and additional differential operators (such as the R-
attached) can be viewed as a symbolic representation operator, recently implemented in Theano, which is
(in a data structure) of a numerical computation. In very useful to compute the product of a Jacobian ma-
2
(x)
Theano, the gradient computation is first performed trix ∂f∂x or Hessian matrix ∂ L(x,θ)
∂θ 2 with a vector
symbolically, i.e., each Op object knows how to create without ever having to actually compute and store
other Ops corresponding to the computation of the the matrix itself (Pearlmutter, 1994)).
partial derivatives associated with that Op. Hence the
symbolic differentiation of the output of a flow graph
with respect to any or all of its input nodes can be 3 Hyper-Parameters
performed easily in most cases, yielding another flow
graph which specifies how to compute these gradi- A pure learning algorithm can be seen as a func-
ents, given the input of the original graph. Since the tion taking training data as input and producing
gradient graph typically contains the original graph as output a function (e.g. a predictor) or model
(mapping parameters to loss) as a sub-graph, in or- (i.e. a bunch of functions). However, in practice,
der to make computations efficient it is important to many learning algorithms involve hyper-parameters,
automate (as done in Theano) a number of simplifica- i.e., annoying knobs to be adjusted. In many algo-
tions which are graph transformations preserving the rithms such as Deep Learning algorithms the number
11 http://deeplearning.net/software/theano/
of hyper-parameters (ten or more!) can make the idea
12 http://www.torch.ch of having to adjust all of them unappealing. In addi-
13 http://lush.sourceforge.net tion, it has been shown that the use of computer clus-
7
ters for hyper-parameter selection can have an im- datasets) to estimate generalization error of the pure
portant effect on results (Pinto et al., 2009). Choos- learning algorithm (with hyper-parameter selection
ing hyper-parameter values is formally equivalent to hidden inside).
the question of model selection, i.e., given a family
or set of learning algorithms, how to pick the most 3.1 Neural Network Hyper-
appropriate one inside the set? We define a hyper-
parameter for a learning algorithm A as a variable to
Parameters
be set prior to the actual application of A to the data, Different learning algorithms involve different sets of
one that is not directly selected by the learning algo- hyper-parameters, and it is useful to get a sense of
rithm itself. It is basically an outside control knob. the kinds of choices that practitioners have to make
It can be discrete (as in model selection) or continu- in choosing their values. We focus here mostly on
ous (such as the learning rate discussed above). Of those relevant to neural networks and Deep Learning
course, one can hide these hyper-parameters by wrap- algorithms.
ping another learning algorithm, say B, around A, to
selects A’s hyper-parameters (e.g. to minimize vali- 3.1.1 Hyper-Parameters of the Approximate
dation set error). We can then call B a hyper-learner, Optimization
and if B has no hyper-parameters itself then the com-
position of B over A could be a “pure” learning al- First of all, several learning algorithms can be viewed
gorithm, with no hyper-parameter. In the end, to as the combination of two elements: a training cri-
apply a learner to training data, one has to have a terion and a model (e.g., a family of functions, a
pure learning algorithm. The hyper-parameters can parametrization) on the one hand, and on the other
be fixed by hand or tuned by an algorithm, but their hand, a particular procedure for approximately op-
value has to be selected. The value of some hyper- timizing this criterion. Correspondingly, one should
parameters can be selected based on the performance distinguish hyper-parameters associated with the op-
of A on its training data, but most cannot. For any timizer from hyper-parameters associated with the
hyper-parameter that has an impact on the effective model itself, i.e., typically the function class, regular-
capacity of a learner, it makes more sense to select its izer and loss function. We have already mentioned
value based on out-of-sample data (outside the train- above some of the hyper-parameters typically asso-
ing set), e.g., a validation set performance, online er- ciated with gradient-based optimization. Here is a
ror, or cross-validation error. Note that some learn- more extensive descriptive list, focusing on those used
ing algorithms (in particular unsupervised learning in stochastic (mini-batch) gradient descent (although
algorithms such as algorithms for training RBMs by number of training iterations is used for all iterative
approximate maximum likelihood) are problematic in optimization algorithms).
this respect because we cannot directly measure the
• The initial learning rate (ǫ0 below, Eq.(2)).
quantity that is to be optimized (e.g. the likelihood)
This is often the single most important hyper-
because it is intractable. On the other hand, the
parameter and one should always make sure that
expected denoising reconstruction error is easy to es-
it has been tuned (up to approximately a fac-
timate (by just averaging the denoising error over a
tor of 2). Typical values for a neural network
validation set).
with standardized inputs (or inputs mapped to
Once some out-of-sample data has been used for the (0,1) interval) are less than 1 and greater
selecting hyper-parameter values, it cannot be used than 10−6 but these should not be taken as strict
anymore to obtain an unbiased estimator of gener-
alization performance, so one typically uses a test cross-validation, using an outer loop cross-validation to evalu-
ate generalization error and then applying an inner loop cross-
set (or double cross-validation14, in the case of small validation inside each outer loop split’s training subset (i.e.,
splitting it again into training and validation folds) in order to
14 Double cross-validation applies recursively the idea of select hyper-parameters for that split.
8
ranges and greatly depend on the parametriza- choices of learning rate (all in parallel), and keep
tion of the model. A default value of 0.01 typi- the value that gave the best results until the next
cally works for standard multi-layer neural net- re-estimation of the optimal learning rate. Other
works but it would be foolish to rely exclu- examples of adaptive learning rate strategies are
sively on this default value. If there is only discussed below (Sec. 6.2).
time to optimize one hyper-parameter and one
• The mini-batch size (B in Eq. (1)) is typi-
uses stochastic gradient descent, then this is the
cally chosen between 1 and a few hundreds, e.g.
hyper-parameter that is worth tuning.
B = 32 is a good default value, with values above
• The choice of strategy for decreasing or adapt- 10 taking advantage of the speed-up of matrix-
ing the learning rate schedule (with hyper- matrix products over matrix-vector products.
parameters such as the time constant τ in Eq. (2) The impact of B is mostly computational, i.e.,
below). The default value of τ → ∞ means that larger B yield faster computation (with ap-
the learning rate is constant over training it- propriate implementations) but requires visiting
erations. In many cases the benefit of choos- more examples in order to reach the same error,
ing other than this default value is small. An since there are less updates per epoch. In the-
example of O(1/t) learning rate schedule, used ory, this hyper-parameter should impact train-
in Bergstra and Bengio (2012) is ing time and not so much test performance, so it
can be optimized separately of the other hyper-
ǫ0 τ parameters, by comparing training curves (train-
ǫt = (2)
max(t, τ ) ing and validation error vs amount of training
time), after the other hyper-parameters (except
which keeps the learning rate constant for the
learning rate) have been selected. B and ǫ0 may
first τ steps and then decreases it in O(1/tα ),
slightly interact with other hyper-parameters so
with traditional recommendations (based on
both should be re-optimized at the end. Once
asymptotic analysis of the convex case) suggest-
B is selected, it can generally be fixed while the
ing α = 1. See Bach and Moulines (2011) for a
other hyper-parameters can be further optimized
recent analysis of the rate of convergence for the
(except for a momentum hyper-parameter, if one
general case of α ≤ 1, suggesting that smaller
is used).
values of α should be used in the non-convex
case, especially when using a gradient averaging • Number of training iterations T (measured
or momentum technique (see below). An adap- in mini-batch updates). This hyper-parameter
tive and heuristic way of automatically setting is particular in that it can be optimized almost
τ above is to keep ǫt constant until the training for free using the principle of early stopping: by
criterion stops decreasing significantly (by more keeping track of the out-of-sample error (as for
than some relative improvement threshold) from example estimated on a validation set) as train-
epoch to epoch. That threshold is a less sensi- ing progresses (every N updates), one can decide
tive hyper-parameter than τ itself. An alterna- how long to train for any given setting of all the
tive to a fixed schedule with a couple of (global) other hyper-parameters. Early stopping is an
free hyper-parameters like in the above formula inexpensive way to avoid strong overfitting, i.e.,
is the use of an adaptive learning rate heuristic, even if the other hyper-parameters would yield
e.g., the simple procedure proposed in Bottou to overfitting, early stopping will considerably
(2013): at regular intervals during training, us- reduce the overfitting damage that would other-
ing a fixed small subset of the training set (what wise ensue. It also means that it hides the over-
matters is only the number of examples used, fitting effect of other hyper-parameters, possibly
not what fraction of the whole training set it obscuring the analysis that one may want to do
represents), continue training with N different when trying to figure out the effect of individual
9
hyper-parameters, i.e., it tends to even out the during the stochastic gradient descent. For ex-
performance obtained by many otherwise overfit- ample, a moving average of the past gradients
ting configurations of hyper-parameters by com- can be computed with ḡ ← (1−β)ḡ+βg, where g
pensating a too large capacity with a smaller is the instantaneous gradient ∂L(z∂θ
t ,θ)
or a mini-
training time. For this reason, it might be use- batch average, and β is a small positive coeffi-
ful to turn early-stopping off when analyzing the cient that controls how fast the old examples get
effect of individual hyper-parameters. Now let downweighted in the moving average. The sim-
us turn to implementation details. Practically, plest momentum trick is to make the updates
one needs to continue training beyond the se- proportional to this smoothed gradient estima-
lected number of training iterations T̂ (which tor ḡ instead of the instantaneous gradient g.
should be the point of lowest validation error The idea is that it removes some of the noise and
in the training run) in order to ascertain that oscillations that gradient descent has, in particu-
validation error is unlikely to go lower than at lar in the directions of high curvature of the loss
the selected point. A heuristic introduced in the function18 . A default value of β = 1 (no mo-
Deep Learning Tutorials15 is based on the idea mentum) works well in many cases but in some
of patience (set initially to 10000 examples in the cases momentum seems to make a positive dif-
MLP tutorial), which is a minimum number of ference. Polyak averaging (Polyak and Juditsky,
training examples to see after the candidate se- 1992) is a related form of parameter averag-
lected point T̂ before deciding to stop training ing19 that has theoretical advantages and has
(i.e. before accepting this candidate as the final been advocated and shown to bring improve-
answer). As training proceeds and new candi- ments on some unsupervised learning procedures
date selected points T̂ (new minima of the vali- such as RBMs (Swersky et al., 2010). More re-
dation error) are observed, the patience param- cently, several mathematically motivated algo-
eter is increased, either multiplicatively or addi- rithms (Nesterov, 2009; Le Roux et al., 2012)
tively on top of the last T̂ found. Hence, if we have been proposed that incorporate some form
find a new minimum16 at t, we save the current of momentum and that also ensure much faster
best model, update T̂ ← t and we increase our convergence (linear rather than sublinear) com-
patience up to t+constant or t× constant. Note pared to stochastic gradient descent, at least for
that validation error should not be estimated af- convex optimization problems. See also Bottou
ter each training update (that would be really (2013) for an example of averaged SGD with
wasteful) but after every N examples, where N successful empirical speedups in the convex
is at least as large as the validation set (ideally case. Note however that in the pure online
several times larger so that the early stopping case (stream of examples) and under some as-
overhead remains small)17 . sumptions, the sublinear rate of convergence of
stochastic gradient descent with O(1/t) decrease
• Momentum β. It has long been advo- of learning rate is an optimal rate, at least for
cated (Hinton, 1978, 2010) to temporally smooth convex problems (Nemirovski and Yudin, 1983).
out the stochastic gradient samples obtained That would suggest that for really large train-
15 http://deeplearning.net/tutorial/ 18 Think about a ball coming down a valley. Since it has not
16 Ideally, we should use a statistical test of significance and started from the bottom of the valley it will oscillate between
accept a new minimum (over a longer training period) only if its sides as it settles deeper, forcing the learning rate to be
the improvement is statistically significant, based on the size small to avoid large oscillations that would kick it out of the
and variance estimates one can compute for the validation set. valley. Averaging out the local gradients along the way will
17 When an extra processor on the same machine is available, cancel the opposing forces from each side of the valley.
validation error can conveniently be recomputed by a proces- 19 Polyak averaging uses for predictions a moving average of
sor different from the one performing the training updates, the parameters found in the trajectory of stochastic gradient
allowing more frequent computation of validation error. descent.
10
ing sets it may not be possible to obtain bet- 3.2 Hyper-Parameters of the Model
ter rates than ordinary stochastic gradient de- and Training Criterion
scent, albeit the constants in front (which de-
pend on the condition number of the Hessian) Let us now turn to “model” and “criterion” hyper-
may still be greatly reduced by using second- parameters typically found in neural networks, espe-
order information online (Bottou and LeCun, cially deep neural networks.
2004; Bottou and Bousquet, 2008).
• Number of hidden units nh . Each layer in a
• Layer-specific optimization hyper- multi-layer neural network typically has a size
parameters: although rarely done, it is that we are free to set and that controls ca-
possible to use different values of optimization pacity. Because of early stopping and possibly
hyper-parameters (such as the learning rate) on other regularizers (e.g., weight decay, discussed
different layers of a multi-layer network. This is below), it is mostly important to choose nh large
especially appropriate (and easier to do) in the enough. Larger than optimal values typically do
context of layer-wise unsupervised pre-training, not hurt generalization performance much, but
since each layer is trained separately (while the of course they require proportionally more com-
layers below are kept fixed). This would be putation (in O(n2h ) if scaling all the layers at
particularly useful when the number of units the same time in a fully connected architecture).
per layer varies a lot from layer to layer. See Like for many other hyper-parameters, there is
the paragraph below entitled Layer-wise opti- the option of allowing a different value of nh for
mization of hyper-parameters (Sec. 3.3.4). each hidden layer20 of a deep architecture. See
Some researchers also advocate the use of the paragraph below entitled Layer-wise opti-
different learning rates for the different types mization of hyper-parameters (Sec. 3.3.4).
of parameters one finds in the model, such as In a large comparative study (Larochelle et al.,
biases and weights in the standard multi-layer 2009), we found that using the same size for all
network, but the issue becomes more important layers worked generally better or the same as us-
when parameters such as precision or variance ing a decreasing size (pyramid-like) or increasing
are included in the lot (Courville et al., 2011). size (upside down pyramid), but of course this
Up to now we have only discussed the hyper- may be data-dependent. For most tasks that
parameters in the setup where one trains a neural we worked on, we find that an overcomplete21
network by stochastic gradient descent. With other first hidden layer works better than an under-
optimization algorithms, some hyper-parameters complete one. Another even more often vali-
are typically different. For example, Conju- dated empirical observation is that the optimal
gate Gradient (CG) algorithms typically have a nh is much larger when using unsupervised pre-
number of line search steps (which is a hyper- training in a supervised neural network, e.g., go-
parameter) and a tolerance for stopping each line ing from hundreds of units to thousands of units.
search (another hyper-parameter). An optimiza- A plausible explanation is that after unsuper-
tion algorithm like L-BFGS (limited-memory Broy- vised pre-training many of the hidden units are
den–Fletcher–Goldfarb–Shanno) also has a hyper- carrying information that is irrelevant to the spe-
parameter controlling the memory usage of the algo- cific supervised task of interest. In order to make
rithm, the rank of the Hessian approximation kept in sure that the information relevant to the task is
memory, which also has an influence on the efficiency captured, larger hidden layers are therefore nec-
of each step. Both CG and L-BFGS are iterative essary when using unsupervised pre-training.
(e.g., one line search per iteration), and the number 20 A hidden layer is a group of units that is neither an input
of iterations can be optimized as described above for layer nor an output layer.
stochastic gradient descent, with early stopping. 21 larger than the input vector
11
• Weight decay regularization coefficient λ. A between early stopping (see above, choosing the
way to reduce overfitting is to add a regulariza- number of training iterations) and L2 regular-
tion term to the training criterion, which lim- ization (Collobert and Bengio, 2004a), with one
its the capacity of the learner. The parameters basically playing the same role as the other (but
of machine learning models can be regularized early stopping allowing a much more efficient se-
by pushing them towards a prior value, which lection of the hyper-parameter value, which sug-
is Ptypically 0. L2 regularization adds a term gests dropping L2 regularization altogether when
λ i θi2 to the training criterion,
P while L1 reg- early-stopping is used). However, L1 regular-
ularization adds a term λ i |θi |. Both types of ization behaves differently and can sometimes
terms can be included. There is a clean Bayesian be useful, acting as a form of feature selection.
justification for such a regularization term: it is L1 regularization makes sure that parameters
the negative log-prior − log P (θ) on the param- that are not really very useful are driven to zero
eters θ. The training criterion then corresponds (i.e. encouraging sparsity of the parameter val-
to the negative joint likelihood of data and pa- ues), and corresponds to a Laplace density prior
|θ|
rameters, − log P (data, θ) = − log P (data|θ) − ∝ e− s with scale parameter s = λ1 . L1 regu-
log P (θ), with the loss function L(z, θ) being in- larization often helps to make the input filters22
terpreted as − log P (z|θ) and − log P (data|θ) = cleaner (more spatially localized) and easier to
PT
− t=1 L(zt , θ) if the data consists of T i.i.d. interpret. Stochastic gradient descent will not
examples zt . This detail is important to note yield actual zeros but values hovering around
because when one is doing stochastic gradient- zero. If both L1 and L2 regularization are used,
based learning, it makes sense to use an unbi- a different coefficient (i.e. a different hyper-
ased estimator of the gradient of the total train- parameter) should be considered for each, and
ing criterion (including both the total loss and one may also use a different coefficient for differ-
the regularizer), but one only considers a single ent layers. In particular, the input weights and
mini-batch or example at a time. How should the output weights may be treated differently.
regularizer be weighted in this sum, which is dif-
One reason for treating output weights differ-
ferent from the sum of the regularizer and the to-
ently (i.e., not relying only on early stopping)
tal loss on all examples? On each mini-batch up-
is that we know that it is sufficient to regu-
date, the gradient of the regularization penalty
larize only the output weights in order to con-
should be multiplied not just by λ but also by
B strain capacity: in the limit case of the num-
T , i.e., one over the number of updates needed ber of hidden units going to infinity, L2 regular-
to go once through the training set. When the
ization corresponds to Support Vector Machines
training set size is not a multiple of B, the last
(SVM) while L1 regularization corresponds to
mini-batch will have size B ′ < B and the contri-
boosting (Bengio et al., 2006a). Another reason
bution of the regularizer to the mini-batch gradi-
for treating inputs and outputs differently from
ent should therefore be modified accordingly (i.e.
′ hidden units is because they may be sparse. For
scaled by BB compared to other mini-batches).
example, some input features may be 0 most of
In the pure online setting (there is no fixed ahead
the time while others are non-zero frequently. In
training set size nor iterating again on the ex-
that case, there are fewer examples that inform
amples), it would then make sense to use Bt at
the model about that rarely active input feature,
example t, or one over the number of updates
and the corresponding parameters (weights out-
to date. L2 regularization penalizes large val-
going from the corresponding input units) should
ues more strongly and corresponds to a Gaus-
2
sian prior ∝ exp(− 21 ||θ||
σ2 ) with prior variance
22 The input weights of a 1st layer neuron are often called
σ 2 = 1/(2λ). Note that there is a connection “filters” because of analogies with signal processing techniques
such as convolutions.
12
be more regularized than the parameters associ- cause they encourage representations that dis-
ated with frequently observed inputs. A similar entangle the underlying factors of representa-
situation may occur with target variables that tion. A sparsity-inducing penalty is also a
are sparse (e.g., trying to predict rarely observed way to regularize (in the sense of reducing the
events). In both cases, the effective number of number of examples that the learner can learn
meaningful updates seen by these parameters is by heart) (Ranzato et al., 2008b), which means
less than the actual number of updates. This that the sparsity coefficient is likely to interact
suggests to scale the regularization coefficient of with the many other hyper-parameters which in-
these parameters by one over the effective num- fluence capacity. In general, increased sparsity
ber of updates seen by the parameter. A related can be compensated by a larger number of hid-
formula turns up in Bayesian probit regression den units.
applied to sparse inputs (Graepel et al., 2010).
Several approaches have been proposed to in-
Some practitioners also choose to penalize only
duce a sparse representation (or with more hid-
the weights w and not the biases b associated
den units whose activation is closer to 0). One
with the hidden unit activations w′ z+b for a unit
approach (Ranzato et al., 2008b; Le et al., 2011;
taking the vector of values z as input. This guar-
Zou et al., 2011) is simply to penalize the L1
antees that even with strong regularization, the
norm of the representation or another function
predictor would converge to the optimal constant
of the hidden units’ activation (such as the
predictor, rather than the one corresponding to
student-t log-prior). This typically makes sense
0 activation. For example, with the mean-square
for non-linearities such as the sigmoid which
loss and the cross-entropy loss, the optimal con-
have a saturating output around 0, but not for
stant predictor is the output average.
the hyperbolic tangent non-linearity (whose sat-
uration is near the -1 and 1 interval borders
• Sparsity of activation regularization coeffi-
rather than near the origin). Another option
cient α. A common practice in the Deep
is to penalize the biases of the hidden units,
Learning literature (Ranzato et al., 2007, 2008b;
to make them more negative (Ranzato et al.,
Lee et al., 2008, 2009; Bagnell and Bradley,
2007; Lee et al., 2008; Goodfellow et al., 2009;
2009; Glorot et al., 2011a; Coates and Ng, 2011;
Larochelle and Bengio, 2008). Note that penal-
Goodfellow et al., 2011) consists in adding a
izing the bias runs the danger that the weights
penalty term to the training criterion that en-
could compensate for the bias23 , which could
courages the hidden units to be sparse, i.e.,
hurt the numerical optimization of parameters.
with values at or near 0. Although the L1
When directly penalizing the hidden unit out-
penalty (discussed above in the case of weights)
puts, several variants can be found in the litera-
can also be applied to hidden units activations,
ture, but no clear comparative analysis has been
this is mathematically very different from the
published to evaluate which one works better.
L1 regularization term on parameters. Whereas
Although the L1 penalty (i.e., simply α times
the latter corresponds to a prior on the pa-
the sum of output elements hj in the case of sig-
rameters, the former does not because it in-
moid non-linearity) would seem the most natural
volves the training distribution (since we are
(because of its use in sparse coding), it is used
looking at data-dependent hidden units out-
in few papers involving sparse auto-encoders. A
puts). Although we will not discuss this much
close cousin of the L1 penalty is the Student-
here, the inspiration for a sparse representa-
t penalty (log(1 + h2j )), originally proposed for
tion in Deep Learning comes from the ear-
sparse coding (Olshausen and Field, 1997). Sev-
lier work on sparse coding (Olshausen and Field,
1997). As discussed in Goodfellow et al. (2009) 23 because the input to the layer generally has a non-zero
sparse representations may be advantageous be- average, that when multiplied by the weights acts like a bias
13
eral researchers penalize the average output h̄j pre-training, but works well for auto-encoder
(e.g. over a mini-batch), and instead of pushing variants24 . For output (or reconstruction) units,
it to 0, encourage it to approach a fixed target ρ. hard neuron non-linearities like the rectifier do
This can be donePthrough a mean-square error not make sense because when the unit is satu-
2
penalty such as j (ρ − h̄j ) , or maybe more rated (e.g. a < 0 for the rectifier) and associ-
sensibly (because hj behaves like a probabil- ated with a loss, no gradient is propagated in-
ity), a Kullback-Liebler divergence with respect side the network, i.e., there is no chance to cor-
to the binomial distribution with probability ρ, rect the error25 . In the case of hidden layers the
−ρ log h̄j − (1 − ρ) log(1 − h̄j )+constant, e.g., gradient manages to go through a subset of the
with ρ = 0.05, as in (Hinton, 2010). In addition hidden units, even if the others are saturated.
to the regularization penalty itself, the choice For output units a good trick is to obtain the
of activation function can have a strong impact output non-linearity and the loss by considering
on the sparsity obtained. In particular, rectify- the associated negative log-likelihood and choos-
ing non-linearities (such as max(0, x), instead of ing an appropriate (conditional) output proba-
a sigmoid) have been very successful in several bility model, usually in the exponential family.
instances (Jarrett et al., 2009; Nair and Hinton, For example, one can typically take squared er-
2010; Glorot et al., 2011a; Mesnil et al., 2011; ror and linear outputs to correspond to a Gaus-
Glorot et al., 2011b). The rectifier also re- sian output model, cross-entropy and sigmoids
lates to the hard tanh (Collobert and Bengio, to correspond to a binomial output model, and
2004b), whose derivatives are also 0 or 1. − log output[target class] with softmax outputs
In sparse coding and sparse predictive cod- to correspond to multinomial output variables.
ing (Kavukcuoglu et al., 2009) the activations For reasons yet to be elucidated, having a sig-
are directly optimized and actual zeros are the moidal non-linearity on the output (reconstruc-
expected result of the optimization. In that tion) units (along with target inputs normalized
case, ordinary stochastic gradient is not guaran- in the (0,1) interval) seems to be helpful when
teed to find these zeros (it will oscillate around) training the contractive auto-encoder.
and other methods such as proximal gradient are
more appropriate (Bertsekas, 2010). • Weights initialization scaling coefficient.
Biases can generally be initialized to zero
• Neuron non-linearity. The typical neuron but weights need to be initialized carefully
output is s(a) = s(w′ x + b), where x is the to break the symmetry between hidden units
vector of inputs into the neuron, w the vec- of the same layer26 . Because different out-
tor of weights and b the offset or bias pa- put units receive different gradient signals,
rameter, while s is a scalar non-linear func- this symmetry breaking issue does not con-
tion. Several non-linearities have been proposed
24 The author hypothesizes that this discrepency is due
and some choices of non-linearities have been
to the fact that the weight matrix W of an auto-encoder of
shown to be more successful (Jarrett et al., 2009;
the form r(x) = W T sigmoid(W x) is pulled towards being or-
Glorot and Bengio, 2010; Glorot et al., 2011a). thonormal since this would make the auto-encoder closer to the
The most commonly used by the author, for hid- identity function, because W T W x ≈ x when W is orthonormal
den units, are the sigmoid 1/(1+e−a), the hyper- and x is in the span of the rows of W .
25 A hard non-linearity for the output units non-linearity is
a
−e−a
bolic tangent eea +e−a , the rectifier max(0, a) and very different from a hard non-linearity in the loss function,
the hard tanh (Collobert and Bengio, 2004b). such as the hinge loss. In the latter case the derivative is 0
Note that the sigmoid was shown to yield se- only when there is no error.
26 By symmetry, if hidden units of the same layer share the
rious optimization difficulties when used as the same input and output weights, they will compute the same
top hidden layer of a deep supervised network output and receive the same gradient, hence performing the
(Glorot and Bengio, 2010) without unsupervised same update and remaining identical, thus wasting capacity.
14
cern the output weights (into the output or sampling corruption noise in denoising auto-
units), which can therefore also be set to zero. encoders). Some random seeds could therefore
Although several tricks (LeCun et al., 1998a; yield better results than others. Because of the
Glorot and Bengio, 2010) for initializing the presence of local minima in the training criterion
weights into hidden layers have been proposed of neural networks (except in the linear case or
(i.e. a hyper-parameter is the discrete choice with fixed lower layers), parameter initialization
between them), Bergstra and Bengio (2012) also matters. See Erhan et al. (2010b) for an exam-
inserted as an extra hyper-parameter a scaling ple of histograms of test errors for hundreds of
coefficient for the initialization range. These different random seeds. Typically, the choice of
tricks are based on the idea that units with random seed only has a slight effect on the result
more inputs (the fan-in of the unit) should have and can mostly be ignored in general or for most
smaller weights. Both LeCun et al. (1998a) and of the hyper-parameter search process. If com-
Glorot and Bengio (2010) recommend scaling by puting power is available, then a final set of jobs
the inverse of the square root of the fan-in, al- with different random seeds (5 to 10) for a small
though Glorot and Bengio (2010) and the Deep set of best choices of hyper-parameter values can
Learning Tutorials use a combination of the fan- squeeze a bit more performance. Another way to
in and fan-out,
p e.g., sample a Uniform(−r, r) exploit computing power to push performance a
with r = 6/(fan-in + fan-out)
p for hyperbolic bit is model averaging, as in Bagging (Breiman,
tangent units and r = 4 6/(fan-in + fan-out) 1994) and Bayesian methods. After training
for sigmoid units. We have found that we could them, the outputs of different networks (or in
avoid any hyper-parameter related to initializa- general different learning algorithms) can be av-
tion using these formulas (and the derivation in eraged. For example, the difference between the
Glorot and Bengio (2010) can be used to derive neural networks being averaged into a commit-
the formula for other settings). Note however tee may come from the different seeds used for
that in the case of RBMs, a zero-mean Gaussian parameter initialization, or the use of different
with a small standard deviation around 0.1 or subsets of input variables, or different subsets of
0.01 works well (Hinton, 2010) to initialize the training examples (the latter being called Bag-
weights, while visible biases are typically set to ging).
their optimal value if the weights were 0, i.e.,
log(x̄/(1 − x̄)) in the case of a binomial visible • Preprocessing. Many preprocessing steps have
unit whose corresponding binary input feature been proposed to massage raw data into ap-
has empirical mean x̄ in the training set. propriate inputs for neural networks and model
An important choice is whether one should use selection must also choose among them. In
unsupervised pre-training (and which unsuper- addition to element-wise standardization (sub-
vised feature learning algorithm to use) in or- tract mean and divide by standard devia-
der to initialize parameters. In most settings tion), Principal Components Analysis (PCA)
we have found unsupervised pre-training to help has often been advocated (LeCun et al., 1998a;
and very rarely to hurt, but of course that Bergstra and Bengio, 2012) and also allows di-
implies additional training time and additional mensionality reduction, at the price of an ex-
hyper-parameters. tra hyper-parameter (the number of principal
components retained, or the proportion of vari-
• Random seeds. There are often several sources ance explained). A convenient non-linear pre-
of randomness in the training of neural net- processing is the uniformization (Mesnil et al.,
works and deep learners (such as for random 2011) of each feature (which estimates its cumu-
initialization, sampling examples, sampling hid- lative distribution Fi and then transforms each
den units in stochastic models such as RBMs, feature xi by its quantile Fi−1 (xi ), i.e., returns
15
an approximate normalized rank or quantile for cess, using techniques such as grid search or better,
the value xi ). A simpler to compute transform random search, or even hyper-parameter optimiza-
that may help reduce the tails of input features tion, discussed below.
is a non-linearity such as the logarithm or the
square root, in an attempt to make them more 3.3.1 General guidance for the exploration of
Gaussian-like. hyper-parameters
In addition to the above somewhat generic choices, First of all, let us consider recommendations for ex-
more choices arise with different architectures and ploring hyper-parameter settings, whether with man-
learning algorithms. For example, the denois- ual search, with an automated procedure, or with
ing auto-encoder has a hyper-parameter scaling the a combination of both. We call a numerical hyper-
amount of input corruption and the contractive auto- parameter one that involves choosing a real number or
encoder has as hyper-parameter a coefficient scaling an integer (where order matters), as opposed to mak-
the norm of the Jacobian of the encoder, i.e., control- ing a discrete symbolic choice from an unordered set.
ling the importance of the contraction penalty. The Examples of numerical hyper-parameters are regular-
latter seems to be a rather sensitive hyper-parameter ization coefficients, number of hidden units, number
that must be tuned carefully. The contractive auto- of training iterations, etc. One has to think of hyper-
encoder’s success also seems sensitive to the weight parameter selection as a difficult form of learning:
tying constraint used in many auto-encoder archi- there is both an optimization problem (looking for
tectures: the decoder’s weight matrix is equal to the hyper-parameter configurations that yield low vali-
transpose of the encoder’s weight matrix. The spe- dation error) and a generalization problem: there is
cific architecture used in the contractive auto-encoder uncertainty about the expected generalization after
(with tied weights, sigmoid non-linearies on hidden optimizing validation performance, and it is possi-
and reconstruction units, along with squared loss or ble to overfit the validation error and get optimisti-
cross-entropy loss) works quite well but other related cally biased estimators of performance when com-
variants do not always train well, for reasons that paring many hyper-parameter configurations. The
remain to be understood. training criterion for this learning is typically the
There are also many architectural choices that validation set error, which is a proxy for general-
are relevant in the case of convolutional architec- ization error. Unfortunately, the relation between
tures (e.g. for modeling images, time-series or hyper-parameters and validation error can be com-
sound) (LeCun et al., 1989, 1998b; Le et al., 2010) in plicated. Although to first approximation we expect
which hidden units have local receptive fields. Their a kind of U-shaped curve (when considering only a
discussion is postponed to another chapter (LeCun, single hyper-parameter, the others being fixed), this
2013). curve can also have noisy variations, in part due to
the use of finite data sets.
3.3 Manual Search and Grid Search
• Best value on the border. When considering
Many of the hyper-parameters or model choices de- the validation error obtained for different values
scribed above can be ignored by picking a standard of a numerical hyper-parameter one should pay
trick suggested here or in some other paper. Still, attention as to whether or not the best value
one remains with a substantial number of choices to found is near the border of the investigated in-
be made, which may give the impression of neural terval. If it is near the border, then this sug-
network training as an art. With modern comput- gests that better values can be found with val-
ing facilities based on large computer clusters, it is ues beyond the border: it is recommended in
however possible to make the optimization of hyper- that case to explore further, beyond that border.
parameters a more reproducible and automated pro- Because the relation between a hyper-parameter
16
and validation error can be noisy, it is gener- the convolution). While this yields a noisy and
ally not enough to try very few values. For biased (pessimistic) estimator of the validation
instance, trying only 3 values for a numerical error which would otherwise be obtained with
hyper-parameter is insufficient, even if the best full training, this cheap estimator appears to be
value found is the middle one. correlated with the expensive validation error.
Hence this cheap estimator is enough for select-
• Scale of values considered. Exploring values
ing some hyper-parameters (or for keeping un-
of a numerical hyper-parameter entails choosing
der consideration for further and more expen-
a starting interval to be searched, which is there-
sive evaluation only the few best choices found).
fore a kind of hyper-hyper-parameter. By choos-
Even without cheap estimators of generalization
ing the interval large enough to start with, but
error, high-throughput computing (e.g., on clus-
based on previous experience with this hyper-
ters, GPUs, or clusters of GPUs) can be ex-
parameter, we ensure that we do not get com-
ploited to run not just hundreds but thousands
pletely wrong results. Now instead of choosing
of training jobs, something not conceivable only
the intermediate values linearly in the chosen in-
a few years ago, with each job taking on the order
terval, it often makes much more sense to con-
of hours or days for larger datasets. With com-
sider a linear or uniform sampling in the log-
putationally cheap surrogates, some researchers
domain (in the space of the logarithm of the
have run on the order of ten thousands trials,
hyper-parameter). For example, the results ob-
and we can expect future advances in parallelized
tained with a learning rate of 0.01 are likely to
computing power to boost these numbers.
be very similar to the results with 0.011 while
results with 0.001 could be quite different from
results with 0.002 even though the absolute dif-
ference is the same in both cases. The ratio
between different values is often a better guide 3.3.2 Coordinate Descent and Multi-
of the expected impact of the change. That is Resolution Search
why exploring uniformly or regularly-spaced val-
ues in the space of the logarithm of the numer- When performing a manual search and with access to
ical hyper-parameter is typically preferred for only a single computer, a reasonable strategy is coor-
positive-valued numerical hyper-parameters. dinate descent: change only one hyper-parameter at a
• Computational considerations. Validation time, always making a change from the best configu-
error is actually not the only measure to consider ration of hyper-parameters found up to now. Instead
in selecting hyper-parameters. Often, one has to of a standard coordinate descent (which systemati-
consider computational cost, either of training cally cycles through all the variables to be optimized)
or prediction. Computing resources for training one can make sure to regularly fine-tune the most
and prediction are limited and generally con- sensitive variables, such as the learning rate.
dition the choice of intervals of considered val- Another important idea is that there is no point in
ues: for example increasing the number of hid- exploring the effect of fine changes before one or more
den units or number of training iterations also reasonably good settings have been found. The idea
scales up computation. An interesting idea is of multi-resolution search is to start the search by
to use computationally cheap estimators of val- considering only a few values of the numerical hyper-
idation error to select some hyper-parameters. parameters (over a large range), or considering large
For example, Saxe et al. (2011) showed that the changes each time a new value is tried. One can then
architecture hyper-parameters of convolutional start from the one or few best configurations found
networks could be selected using random weights and explore more locally around them with smaller
in the lower layers of the network (filters of variations around these values.
17
3.3.3 Automated and Semi-automated Grid initial learning rate while keeping fixed (and initially
Search constant) the learning rate descent schedule. Once
the shape of the schedule has been chosen, it may be
Once some interval or set of values has been selected possible to further refine the learning rate, but in a
for each hyper-parameter (thus defining a search
smaller interval around the best value found.
space), a simple strategy that exploits parallel com-
Humans can get very good at performing hyper-
puting is the grid search. One first needs to con-
parameter search, and having a human in the loop
vert the numerical intervals into lists of values (e.g.,
also has the advantage that it can help detect bugs
K regularly-spaced values in the log-domain of the
or unwanted or unexpected behavior of a learning
hyper-parameter). The grid search is simply an ex-
algorithm. However, for the sake of reproducibil-
haustive search through all the combinations of these
ity, machine learning researchers should strive to use
values. The cross-product of these lists contains a
procedures that do not involve human decisions in
number of elements that is unfortunately exponen-
the middle, only at the outset (e.g., setting hyper-
tial in the number of hyper-parameters (e.g., with
parameter ranges, which can be specified in a paper
5 hyper-parameters, each allowed to take 6 different
describing the experiments).
values, one gets 65 = 7776 configurations). In sec-
tion 3.4 below we consider an approach that works
more efficiently than the grid search when the num- 3.3.4 Layer-wise optimization of hyper-
ber of hyper-parameters increases beyond 2 or 3. parameters
The advantage of the grid search, compared to In the case of Deep Learning with unsupervised
many other optimization strategies (such as coordi- pre-training there is an opportunity for combin-
nate descent), is that it is fully parallelizable. If a ing coordinate descent and cheap relative valida-
large computer cluster is available, it is tempting to tion set performance evaluation associated with
choose a model selection strategy that can take ad- some hyper-parameter choices. The idea, described
vantage of parallelization. One practical disadvan- by Mesnil et al. (2011); Bengio (2011), is to perform
tage of grid search (especially against random search, greedy choices for the hyper-parameters associated
Sec. 3.4), with a parallelized set of jobs on a cluster, with lower layers (near the input) before training the
is that if only one of the jobs fails27 then one has higher layers. One first trains (unsupervised) the
to launch another volley of jobs to complete the grid first layer with different hyper-parameter values and
(and yet a third one if any of these fails, etc.), thus somehow estimates the relative validation error that
multiplying the overall computing time. would be obtained from these different configurations
Typically, a single grid search is not enough and if the final network only had this single layer as in-
practitioners tend to proceed with a sequence of grid ternal representation. In the common case where the
searches, each time adjusting the ranges of values ultimate task is supervised, it means training a simple
considered based on the previous results obtained. supervised predictor (e.g. a linear classifier) on top
Although this can be done manually, this procedure of the learned representation. In the case of a linear
can also be automated by considering the idea of predictor (e.g. regression or logistic regression) this
multi-resolution search to guide this outer loop. Dif- can even be done on the fly while unsupervised train-
ferent, more local, grid searches can be launched in ing of the representation progresses (i.e. can be used
the neighborhood of the best solutions found previ- for early stopping as well), as in (Larochelle et al.,
ously. In addition, the idea of coordinate descent can 2009). Once a set of apparently good (according
also be thrown in, by making each grid search focus to this greedy evaluation) hyper-parameters values
on only a few of the hyper-parameters. For exam- has been found (or possibly using only the best one
ple, it is common practice to start by exploring the found), these good values can be used as starting
27 For all kinds of hardware and software reasons, a job point to train (and hyper-optimize) a second layer
failing is very common. in the same way, etc. The completely greedy ap-
18
proach is to keep only the best configuration up to
now (for the lower layers), but keeping the K best
configurations overall only multiplies computational Algorithm 1 : Greedy layer-wise hyper-
costs of hyper-parameter selection by K for layers be- parameter optimization.
yond the first one, because we would still keep only input K: number of best configurations to keep
the best K configurations from all the 1st layer and at each level.
2nd layer hyper-parameters as starting points for ex- input N LEV ELS: number of levels of the deep
ploring 3rd layer hyper-parameters, etc. This proce- architecture
dure is formalized in the Algorithm 1 below. Since input LEV ELSET T IN GS: list of hyper-
greedy layer-wise pre-training does not modify the parameter settings to be considered for unsuper-
lower layers when pre-training the upper layers, this vised pre-training of a level
is also very efficient computationally. This proce- input SF T SET T IN GS: list of hyper-parameter
dure allows one to set the hyper-parameters associ- settings to be considered for supervised fine-tuning
ated with the unsupervised pre-training stage, and
then there remains hyper-parameters to be selected
for the supervised fine-tuning stage, if one is desired. Initialize set of best configurations S = ∅
A final supervised fine-tuning stage is strongly sug- for L = 1 to N LEV ELS do
gested, especially when there are many labeled exam- for C in LEV ELSET T IN GS do
ples (Lamblin and Bengio, 2010). for H in (S or {∅}) do
* Pretrain level L using hyper-parameter
setting C for level L and the parameters ob-
3.4 Random Sampling of Hyper- tained with setting H for lower levels.
Parameters * Evaluate target task performance L using
A serious problem with the grid search approach to this depth-L pre-trained architecture (e.g.
find good hyper-parameter configurations is that it train a linear classifier on top of these layers
scales exponentially badly with the number of hyper- and estimate validation error).
parameters considered. In the above sections we have * Push the pair (C ∪ H, L) into S if it is
discussed numerous hyper-parameters and if all of among the K best performing of S.
them were to be explored at the same time it would end for
be impossible to use only a grid search to do so. end for
One may think that there are no other options sim- end for
ply because this is an instance of the curse of di- for C in SF T SET T IN GS do
mensionality. But like we have found in our work for H in S do
on Deep Learning (Bengio, 2009), if there is some * Supervised fine-tuning of the pre-trained ar-
structure in a target function we are trying to dis- chitecture associated with H, using supervised
cover, then there is a chance to find good solutions fine-tuning hyper-parameter setting C.
without paying an exponential price. It turns out * Evaluate target task performance L of this
that in many practical cases we have encountered, fine-tuned predictor (e.g. validation error).
there is a kind of structure that random sampling * Push the pair (C ∪H, L) into S if it is among
can exploit (Bergstra and Bengio, 2012). The idea the K best performing of S.
of random sampling is to replace the regular grid end for
by a random (typically uniform) sampling. Each end for
tested hyper-parameter configuration is selected by output S the set of K best-performing models
independently sampling each hyper-parameter from with their settings and validation performance.
a prior distribution (typically uniform in the log-
domain, inside the interval of interest). For a discrete
19
hyper-parameter, a multinomial distribution can be tells us that we are approaching a plateau, i.e., it tells
defined according to our prior beliefs on the likely us whether it is worth it or not to continue launching
good values. At worse, i.e., with no prior preference jobs, i.e., we can perform a kind of early stopping in
at all, this would be a uniform distribution across the the outer optimization over hyper-parameters. Note
allowed values. In fact, we can use our prior knowl- that one should distinguish the curve of the “best
edge to make this prior distribution quite sophisti- trial in first N trials” with the curve of the mean (and
cated. For example, we can readily include knowl- standard deviation) of the “best in a subset of size
edge that some values of some hyper-parameters only N”. The latter is a better statistical representative of
make sense in the context of other particular val- the improvements we should expect if we increase the
ues of hyper-parameters. This is a practical consid- number of trials. Even if the former has a plateau,
eration for example when considering layer-specific the latter may still be on the increase, pointing for the
hyper-parameters when the number of layers itself is need to more hyper-parameter configuration samples,
a hyper-parameter. i.e., more trials (Bergstra and Bengio, 2012). Com-
The experiments performed (Bergstra and Bengio, paring these curves with the equivalent obtained from
2012) show that random sampling can be many times grid search we see faster convergence with random
more efficient than grid search as soon as the number search. On the other hand, note that one advan-
of hyper-parameters goes beyond the 2 or 3 typically tage of grid search compared to random sampling is
seen with SVMs and vanilla neural networks. The that the qualitative analysis of results is easier be-
main reason why faster convergence is observed is cause one can consider variations of a single hyper-
because it allows one to explore more values for each parameter with all the other hyper-parameters being
hyper-parameter, whereas in grid search, the same fixed. It may remain a valid option to do a small
value of a hyper-parameter is repeated in exponen- grid search around the best solutions found by ran-
tially many configurations (of all the other hyper- dom search, considering only the hyper-parameters
parameters). In particular, if only a small subset of that were found to matter or which concern a scien-
the hyper-parameters really matters, then this proce- tific question of interest29 .
dure can be shown to be exponentially more efficient. Random search maintains the advantage of easy
What we found is that for different datasets and ar- parallelization provided by grid search and improves
chitectures, the subset of hyper-parameters that mat- on it. Indeed, a practical advantage of random search
tered most was different, but it was often the case compared to grid search is that if one of the jobs fails
that a few hyper-parameters made a big difference then there is no need to re-launch that job. It also
(and the learning rate is always one of them!). When means that if one has launched 100 random search
marginalizing (by averaging or minimizing) the val- jobs, and finds that the convergence curve still has an
idation performance to visualize the effect of one or interesting slope, one can launch another 50 or 100
two hyper-parameters, we get a more noisy picture without wasting the first 100. It is not that simple to
using a random search compared to a grid search, combine the results of two grid searches because they
because of the random variations of the other hyper- are not always compatible (i.e., one is not a subset of
parameters but one with much more resolution, be- the other).
cause so many more different values have been consid- Finally, although random search is a useful ad-
ered. Practically, one can plot the curves of best val- dition to the toolbox of the practitioner, semi-
idation error as the number of random trials28 is in- automatic exploration is still helpful and one will
creased (with mean and standard deviation, obtained often iterate between launching a new volley of
by considering, for each choice of number of trials, all jobs and analysis of the results obtained with
possible same-size subsets of trials), and this curve 29 This is often the case in machine learning research, e.g.,
does depth of architecture matter? then we need to control ac-

28 each random trial corresponding to a training job with a curately for the effect of depth, with all other hyper-parameters
particular choice of hyper-parameter values optimized for each value of depth.
20
the previous volley in order to guide model de- big deal while debugging) but provides quadratically
sign and research. What we need is more, and more precision.
more efficient, automation of hyper-parameter op- Note that because of finite precision in the com-
timization. There are some interesting steps in putation, there will be a difference between the an-
this direction (Hutter, 2009; Bergstra et al., 2011; alytic (even correct) and finite difference gradient.
Hutter et al., 2011; Srinivasan and Ramakrishnan, Contrary to naive expectations, the relative differ-
2011) but much more needs to done. ence may grow if we choose an ε that is too small,
i.e., the error should first decrease as ε is decreased,
and then may worsen when numerical precision kicks
4 Debugging and Analysis in, due to non-linearities. We have often used a value
of ε = 10−4 in neural networks, a value that is suffi-
4.1 Gradient Checking and Con- ciently small to detect most bugs.
trolled Overfitting Once the gradient is known to be well computed,
A very useful debugging step consists in verifying another sanity check is that gradient descent (or any
that the implementation of the gradient ∂L other gradient-based optimization) should be able
∂θ is com-
patible with the computation of L as a function of to overfit on a small training set30 . In particular,
θ. If the analytically computed gradient does not to factor out effects of SGD hyper-parameters, a
match the one obtained by a finite difference approx- good sanity check for the code (and the other hyper-
imation, this signals that a bug is probably present parameters) is to verify that one can overfit on a small
somewhere. First of all, looking at for which i one training set using a powerful second order method
gets important relative change between ∂θ ∂L
and its such as L-BFGS. For any optimizer, though, as the
i
finite difference approximation, we can get hints as number of examples is increased, the degradation of
to where the problem may be. An error in sign is training error should be gradual while validation er-
particularly troubling, of course. A good next step is ror should improve. And one typically sees the advan-
then to verify in the same way intermediate gradients tages of SGD over batch second-order methods like
∂L L-BFGS increase as the training set size increases.
∂a with a some quantities that depend on the faulty
θ, such as intervening neuron activations. The break-even point may depend on the task, paral-
As many researchers know, the gradient can be lelization (multi-core or GPU, see Sec.5 below), and
approximated by a finite difference approximation architecture (number of computations compared to
number of parameters, per example).
obtained from the first-order Taylor expansion of a
scalar function f with respect to a scalar argument Of course, the real goal of learning is to achieve
x: good generalization error, and the latter can be es-
∂f (x) f (x + ε) − f (x) timated by measuring performance on an indepen-
= + o(ε) dent test set. When test error is considered too
∂x ε
But a less known fact is that a second order approx- high, the first question to ask is whether it is be-
cause of a difficulty in optimizing the training cri-
imation can be achieved by considering the following
terion or because of overfitting. Comparing train-
alternative formula:
ing error and test error (and how they change as
∂f (x) f (x + ε) − f (x − ε) we change hyper-parameters that influence capacity,
≈ + o(ε2 ).
∂x 2ε
30 In principle, bad local minima could prevent that, but in
The second order terms of the Taylor expansion of the overfitting regime, e.g., with more hidden units than exam-
f (x + ε) and f (x − ε) cancel each other because they ples, the global minimum of the training error can generally be
reached almost surely from random initialization, presumably
are even, leaving only 3rd or higher order terms, because the training criterion becomes convex in the parame-
i.e., o(ε2 ) error after dividing the difference by ε. ters that suffice to get the training error to zero (Bengio et al.,
Hence this formula is twice more expensive (not a 2006a), i.e., the output weights of the neural network.
21
such as the number of training iterations) helps to useful to compare neural networks during training in
answer that question. Depending on the answer, of terms of their “age” (number of updates made times
course, the appropriate ways to improve test error mini-batch size B, i.e., number of examples visited)
are different. Optimization difficulties can be fixed rather than in terms of number of epochs (which is
by looking for bugs in the training code, inappropri- very sensitive to the training set size).
ate values of optimization hyper-parameters, or sim- When using unsupervised training to learn the first
ply insufficient capacity (e.g. not enough degrees of few layers of a deep architecture, a very common de-
freedom, hidden units, embedding sizes, etc.). Over- bugging and analysis tool is the visualization of fil-
fitting difficulties can be addressed by collecting more ters, i.e., of the weight vectors associated with in-
training data, introducing more or better regular- dividual hidden units. This is simplest in the case
ization terms, multi-task training, unsupervised pre- of the first layer and where the inputs are images
training, unsupervised term in the training criterion, (or image patches), time-series, or spectrograms (all
or considering different function families (or neural of which are visually interpretable). Several recipes
network architectures). In a multi-layer neural net- have been proposed to extend this idea to visualize
work, both problems can be simultaneously present. the preferred input of hidden units in layers that
For example, as discussed in Bengio et al. (2007); follow the first one (Lee et al., 2008; Erhan et al.,
Bengio (2009), it is possible to have zero training er- 2010a). In the case of the first layer, since one of-
ror with a large top-level hidden layer that allows the ten obtains Gabor filters, a parametric fit of these
output layer to overfit, while the lower layer are not filters to the weight vector can be done so as to vi-
doing a good job of extracting useful features because sualize the distribution of orientations, positions and
they were not properly optimized. scales of the learned filters. An interesting special
Unless using a framework such as Theano which case of visualizing first-layer weights is the visual-
automatically handles the efficient allocation of ization of word embeddings (see Section 5.3 below)
buffers for intermediate results, it is important to using a dimensionality reduction technique such as
pay attention to such buffers in the design of the t-SNE (van der Maaten and Hinton, 2008).
code. The first objective is to avoid memory alloca- An extension of the idea of visualizing filters (which
tion in the middle of the training loop, i.e., all mem- can apply to non-linear or deeper features) is that of
ory buffers should be allocated once and for all. Care- visualizing local (arount the given test point) lead-
less reuse of the same memory buffers for different ing tangent vectors, i.e., the main directions in input
uses can however lead to bugs, which can be checked, space to which the representation (at a given layer)
in the debugging phase, by initializing buffers to the is most sensitive to (Rifai et al., 2011b).
NaN (Not-A-Number) value, which propagates into In the case where the inputs are not images or eas-
downstream computation (making it easy to detect ily visualizable, or to get a sense of the weight values
that uninitialized values were used)31 . in different hidden units, Hinton diagrams (Hinton,
1989) are also very useful, using small squares whose
color (black or white) indicates a weight’s sign and
4.2 Visualizations and Statistics whose area represents its magnitude.
Another way to visualize what has been learned
The most basic statistics that should be measured
by an unsupervised (or joint label-input) model is
during training are error statistics. The average loss
to look at samples from the model. Sampling pro-
on the training set and the validation set and their
cedures have been defined at the outset for RBMs,
evolution during training are very useful to monitor
Deep Belief Nets, and Deep Boltzmann Machines,
progress and differentiate overfitting from poor op-
for example based on Gibbs sampling. When weights
timization. To make comparisons easier, it may be
become larger, mixing between modes can become
31 Personal communication from David Warde-Farley, who very slow with Gibbs sampling. An interesting alter-
learned this trick from Sam Roweis. native is rates-FPCD (Tieleman and Hinton, 2009;
22
Breuleux et al., 2011) which appears to be more ro- for a practical example. A particularly interesting
bust to this problem and generally mixes faster, but quantity to monitor is the discriminative ability of
at the cost of losing theoretical guarantees. the representations learnt at each layer, as discussed
In the case of auto-encoder variants, it was not in (Montavon et al., 2012), and ultimately leading to
clear until recently whether they were really captur- an analysis of the disentangled factors captured by
ing the underlying density (since they are not opti- the different layers as we consider deeper architec-
mized with respect to the maximum likelihood prin- tures.
ciple or an approximation of it). It was therefore
even less clear if there existed appropriate sampling
algorithms for auto-encoders, but a recent proposal 5 Other Recommendations
for sampling from contractive auto-encoders appears
to be working very well (Rifai et al., 2012), based on 5.1 Multi-core machines, BLAS and
arguments about the geometric interpretation of the GPUs
first derivative of the encoder (Bengio et al., 2012),
showing that denoising and contractive auto-encoders Matrix operations are the most time-consuming in
capture local moments (first and second) of the train- efficient implementations of many machine learning
ing density. algorithms and this is particularly true of neural
To get a sense of what individual hidden units rep- networks and deep architectures. The basic opera-
resent, it has also been proposed to vary only one tions are matrix-vector products (forward propaga-
unit while keeping the others fixed, e.g., to the value tion and back-propagation) and vector times vector
obtained by finding the hidden units representation outer products (resulting in a matrix of weight gra-
associated with a particular input example. dients). Matrix-matrix multiplications can be done
Another interesting technique is the visual- substantially faster than the equivalent sequence of
ization of the learning trajectory in function matrix-vector products for two reasons: by smart
space (Erhan et al., 2010b). The idea is to asso- caching mechanisms such as implemented in the
ciate the function (as opposed to simply the pa- BLAS library (which is called from many higher-level
rameters) computed by a neural network with a environments such as python’s numpy and Theano,
low-dimensional (2-D or 3-D) representation, e.g., Matlab, Torch or Lush), and thanks to parallelism.
with the t-SNE (van der Maaten and Hinton, 2008) Appropriate versions of BLAS can take advantage
or Isomap (Tenenbaum et al., 2000) algorithms, and of multi-core machines to distribute these computa-
then plot the evolution of this function during train- tions on multi-core machines. The speed-up is how-
ing, or the population of such trajectories for different ever generally a fraction of the total speedup one can
initializations. This provides visualization of effec- hope for (e.g. 4× on a 4-core machine), because of
tive local minima32 and shows that no two different communication overheads and because not all com-
random initializations ended up in the same effective putation is parallelized. Parallelism becomes more
local minimum. efficient when the sizes of these matrices is increased,
Finally, another useful type of visualization is to which is why mini-batch updates can be computa-
display statistics (e.g., histogram, mean and stan- tionally advantageous, and more so when more cores
dard deviation) of activations (inputs and outputs are present.
of the non-linearities at each layer), activation gradi- The extreme multi-core machines are the GPUs
ents, parameters and parameter gradients, by groups (Graphics Processing Units), with hundreds of cores.
(e.g. different layers, biases vs weights) and across Unfortunately, they also come with constraints and
training iterations. See Glorot and Bengio (2010) specialized compilers which make it more difficult to
32 It is difficult to know for sure if it is a true local minima fully take advantage of their potential. On 512-core
or if it appears like one because the optimization algorithm is machines, we are routinely able to get speed-ups of
stuck. 4× to 40× for large neural networks. To make the
23
use of GPUs practical, it really helps to use existing for the case of auto-encoders and denoising auto-
libraries that efficiently implement computations on encoders. The first idea is that on each example (or
GPUs. See Bergstra et al. (2010) for a comparative mini-batch), one samples a subset of the elements
study of the Theano library (which compiles numpy- of the reconstruction vector, along with the associ-
like code for GPUs). One practical issue is that only ated reconstruction loss. One only needs to com-
the GPU-compiled operations will typically be done pute the reconstruction and the loss associated with
on the GPU, and that transfers between the GPU these sampled elements (or features), as well as the
and CPU considerably slow things down. It is im- associated back-propagation operations into hidden
portant to use a profiler to find out what is done units and reconstruction weights. That alone would
on the GPU and how efficient these operations are multiplicatively reduce the computational cost by the
in order to quickly invest one’s time where needed amount of sparsity but make the gradient much more
to make an implementation GPU-efficient and keep noisy and possibly biased as well, if the sampling dis-
most operations on the GPU card. tribution was chosen not uniform. To reduce the vari-
ance of that estimator, the idea is to guess for which
5.2 Sparse High-Dimensional Inputs features the reconstruction loss will be larger and to
sample with higher probability these features (and
Sparse high-dimensional inputs can be efficiently han- their loss). In particular, the authors always sample
dled by traditional supervised neural networks by us- the features with a non-zero in the input (or the cor-
ing a sparse matrix multiplication. Typically, the in- rupted input, in the denoising case), and uniformly
put is a sparse vector while the weights are in a dense sample an equal number of those with a zero in the
matrix, and one should use an efficient implementa- input and corrupted input. To make the estimator
tion made for just this case in order to optimally take unbiased now requires introducing a weight on the
advantage of sparsity. There is still going to be an reconstruction loss associated with each sampled fea-
overhead on the order of 2× or more (on the multiply- ture, inversely proportional to the probability of sam-
add operations, not the others) compared to a dense pling it, i.e., this is an importance sampling scheme.
implementation of the matrix-vector product. The experiments show that the speed-up increases
For many unsupervised learning algorithms there is linearly with the amount of sparsity while the aver-
unfortunately a difficulty. The computation for these age loss is optimized as well as in the deterministic
learning algorithms usually involves some kind of re- full-computation case.
construction of the input (like for all auto-encoder
variants, but also for RBMs and sparse coding vari-
ants), as if the inputs were in the output space of 5.3 Symbolic Variables, Embeddings,
the learner. Two exceptions to this problem are Multi-Task Learning and Multi-
semi-supervised embedding (Weston et al., 2008) and Relational Learning
Slow Feature Analysis (Wiskott and Sejnowski, 2002;
Berkes and Wiskott, 2002). The former pulls the rep- Parameter sharing (Lang and Hinton, 1988; LeCun,
resentation of nearby examples near each other and 1989; Lang and Hinton, 1988; Caruana, 1993; Baxter,
pushes dissimilar points apart, while also tuning the 1995, 1997) is an old neural network technique for in-
representation for a supervised learning task. The creasing statistical power: if a parameter is used in N
latter maximizes the learned features’ variance while times more contexts (different tasks, different parts of
minimizing their covariance and maximizing their the input, etc.) then it may be as if we had N times
temporal auto-correlation. more training examples for tuning its value. More
For algorithms that do need a form of input re- examples to estimate a parameter reduces its vari-
construction, an efficient approach based on sam- ance (with respect to sampling of training examples),
pled reconstruction (Dauphin et al., 2011) has been which is directly influencing generalization error: for
proposed, successfully implemented and evaluated example the generalization mean squared error can
24
be decomposed as the sum of a bias term and a vari- SNE (van der Maaten and Hinton, 2008).
ance term (Geman et al., 1992). The reuse idea was In addition to sharing the embedding parame-
first exploited by applying the same parameter to dif- ters across positions of words in an input sentence,
ferent parts of the input, as in convolutional neu- Collobert et al. (2011a) share them across natural
ral networks (Lang and Hinton, 1988; LeCun, 1989). language processing tasks such as Part-Of-Speech
Reuse was also exploited by sharing the lower lay- tagging, chunking and semantic role labeling. Param-
ers of a network (and the representation of the input eter sharing is a key idea behind convolutional nets,
that they capture) across multiple tasks associated recurrent neural networks and dynamic Bayes nets, in
with different outputs of the network (Caruana, 1993; which the same parameters are used for different tem-
Baxter, 1995, 1997). This idea is also one of the key poral or spatial slices of the data. This idea has been
motivations behind Deep Learning (Bengio, 2009) be- generalized from sequences and 2-D images to arbi-
cause one can think of the intermediate features com- trary graphs with recursive neural networks or recur-
puted in higher (deeper) layers as different tasks that sive graphical models (Pollack, 1990; Frasconi et al.,
can share the sub-features computed in lower layers 1998; Bottou, 2011; Socher et al., 2011), Markov
(nearer the input). This very basic notion of reuse Logic Networks (Richardson and Domingos, 2006)
is key to improving generalization in many settings, and relational learning (Getoor and Taskar, 2006).
guiding the design of neural network architectures in A relational database can be seen as a set of ob-
practical applications as well. jects (or typed values) and relations between them,
An interesting special case of these ideas is in the of the form (object1, relation-type, object2). The
context of learning with symbolic data. If some in- same global set of parameters can be shared to char-
put variables are symbolic, taking value in a finite acterize such relations, across relations (which can be
alphabet, they can be represented as neural net- seen as tasks) and objects. Object-specific parame-
work inputs by a one-hot subvector of the input vec- ters are the parameters specifying the embedding of
tor (with a 0 everywhere except at the position as- a particular discrete object. One can think of the el-
sociated with the particular symbol). Now, some- ements of each embedding vector as implicit learned
times different input variables refer to different in- attributes. Different tasks may demand different at-
stances of the same type of symbol. A patent ex- tributes, so that objects which share some underly-
ample is with neural language models (Bengio et al., ing characteristics and behavior should end up hav-
2003; Bengio, 2008), where the input is a sequence of ing similar values of some of their attributes. For
words. In these models, the same input layer weights example, words appearing in semantically and syn-
are reused for words at different positions in the input tactically similar contexts end up getting a very close
sequence (as in convolutional networks). The prod- embedding (Collobert et al., 2011a). If the same at-
uct of a one-hot sub-vector with this shared weight tributes can be useful for several tasks, then statisti-
matrix is a generally dense vector, and this asso- cal power is gained through parameter sharing, and
ciates each symbol in the alphabet with a point in transfer of information between tasks can happen,
a vector space33 , which we call its embedding. The making the data of some task informative for gener-
idea of vector space representations for words and alizing properly on another task.
symbols is older (Deerwester et al., 1990) and is a The idea proposed in Bordes et al. (2011, 2012) is
particular case of the notion of distributed represen- to learn an energy function that is lower for posi-
tation (Hinton, 1986, 1989) central to the connec- tive (valid) relations present in the training set, and
tionist approaches. Learned embeddings of symbols parametrized in two parts: on the one hand the sym-
(or other objects) can be conveniently visualized us- bol embeddings and on the other hand the rest of
ing a dimensionality reduction algorithm such as t- the neural network that maps them to a scalar en-
ergy. In addition, by considering relation types them-
33 the result of the matrix multiplication, which equals one selves as particular symbolic objects, the model can
of the columns of the matrix reason about relations themselves and have relations
25
between relation types. For example, ‘To be’ can act vised tasks — training a neural network for clas-
as a relation type (in subject-attribute relations) but sification (Hinton et al., 2006; Bengio et al., 2007;
in the statement “ ‘To be’ is a verb” it appears both Ranzato et al., 2007) — and unsupervised tasks —
as a relation type and as an object of the relation. training a Deep Boltzmann Machine to model the
Such multi-relational learning opens the door to data distribution (Salakhutdinov and Hinton, 2009).
the application of neural networks outside of their The learning trajectories visualizations
traditional applications, which was based on a single of Erhan et al. (2010b) have shown that even
homogeneous source of data, often seen as a matrix when starting from nearby configurations in function
with one row per example and one column (or group space, different initializations seem to always fall in
of columns) per random variable. Instead, one often a different effective local minimum. Furthermore,
has multiple heterogeneous sources of data (typically the same study showed that the minima found when
providing examples seen as a tuple of values), each in- using unsupervised pre-training were far in function
volving different random variables. So long as these space from those found from random initialization,
different sources share some variables, then the above in addition to giving better generalization error.
multi-relational multi-task learning approaches can Both of these findings highlight the importance of
be applied. Each variable can be associated with its initialization, hence of local minima effects, in deep
embedding function (that maps the value of a vari- networks. Finally, it has been shown that these
able to a generic representation space that is valid effects were both increased when considering deeper
across tasks and data sources). This framework can architectures (Erhan et al., 2010b).
be applied not only to symbolic data but to mixed There are also results showing that specific ways
symbolic/numeric data if the mapping from object of setting the initial distribution and ordering of
to embedding is generalized from a table look-up to examples (“curriculum learning”) can yield bet-
a parametrized function (the simplest being a linear ter solutions (Elman, 1993; Bengio et al., 2009;
mapping) from its raw attributes (e.g., image fea- Krueger and Dayan, 2009). This also suggest that
tures) to its embedding. This has been exploited very particular ways of initializing parameters, very
successfully to design image search systems in which different from uniformly sampled, can have a strong
images and queries are mapped to the same semantic impact on the solutions found by gradient descent.
space (Weston et al., 2011). The hypothesis proposed in (Bengio et al., 2009) is
that curriculum learning can act similarly to a con-
tinuation method, i.e., starting from an easier opti-
6 Open Questions mization task (e.g. convex) and tracking the local
minimum as the learning task is gradually made more
6.1 On the Added Difficulty of Train- difficult and closer to the real task of interest.
ing Deeper Architectures Why would training deeper networks be more dif-
ficult? This is clearly still an open question. A
There are experimental results which provide some plausible partial answer is that deeper networks are
evidence that, at least in some circumstances, deeper also more non-linear (since each layer composes more
neural networks are more difficult to train than non-linearity on top of the previous ones), making
shallow ones, in the sense that there is a greater gradient-based methods less efficient. It may also be
chance of missing out on better minima when start- that the number and structure of local minima both
ing from random initialization. This is borne out change qualitatively as we increase depth. Theoreti-
by all the experiments where we find that some cal arguments support a potentially exponential gain
initialization scheme can drastically improve per- in expressive power of deeper architectures (Bengio,
formance. In the Deep Learning literature this 2009; Bengio and Delalleau, 2011) and it would be
has been shown with the use of unsupervised pre- plausible that with this added expressive power com-
training (supervised or not), both applied to super- ing from the combinatorics of composed reuse of sub-
26
functions could come a corresponding increase in the (s(a) = max(0, a), see also (Nair and Hinton,
number (and possibly quality) of local minima. But 2010)) actually worked very well (but should not
the best ones could then also be more difficult to find. be used for output units), in spite of the prior be-
On the practical side, several experimental results lief that the fact that when hidden units are sat-
point to factors that may help training deep architec- urated, gradients would not flow well into lower
tures: layers. In fact gradients flow very well, but on
selected paths, possibly making the credit as-
• A local training signal. What many success- signment (which parameters should change to
ful procedures for training deep networks have handle the current error) sharper and the Hes-
in common is that they involve a local training sian condition number better. A recent heuris-
signal that helps each layer decide what to do tic that is related to the difficulty of gradient
without requiring the back-propagation of gradi- propagation through neural net non-linearities is
ents through many non-linearities. This includes the idea of “centering” the non-linear operation
of course the many variants of greedy layer-wise such that each hidden unit has zero average out-
pre-training but also the less well-known semi- put and zero average slope (Schraudolph, 1998;
supervised embedding algorithm (Weston et al., Raiko et al., 2012).
2008).
• Initialization in the right range. Based 6.2 Adaptive Learning Rates and
on the idea that both activations and gradients Second-Order Methods
should be able to flow well through a deep archi-
tecture without significant reduction in variance, To improve convergence and remove learning rates
Glorot and Bengio (2010) proposed setting up from the list of hyper-parameters, many authors have
the initial weights to make the Jacobian of each advocated exploring adaptive learning rate methods,
layer have singular values near 1 (or preserve either for a global learning rate (Cho et al., 2011),
variance in both directions). In their experi- a layer-wise learning rate, a neuron-wise learning
ments this clearly helped greatly reducing the rate, or a parameter-wise learning rate (Bordes et al.,
gap between purely supervised and pre-trained 2009) (which then starts to look like a diagonal New-
deep networks. ton method). LeCun (1987); LeCun et al. (1998a)
advocate the use of a second-order diagonal New-
• Choice of non-linearities. In the same ton (always positive) approximation, with one learn-
study (Glorot and Bengio, 2010) and a following rate per parameter (associated with the approx-
up (Glorot et al., 2011a) it was shown that the imated inverse second derivative of the loss with re-
choice of hidden layer non-linearities interacted spect to the parameter). Hinton (2010) proposes
with depth. In particular, without unsupervised scaling learning rates so that the average weight up-
pre-training, a deep neural network with sig- date is on the order of 1/1000th of the weight mag-
moids in the top hidden layer would get stuck nitude. LeCun et al. (1998a) also propose a simple
for a long time on a plateau and generally pro- power method in order to estimate the largest eigen-
duce inferior results, due to the special role of value of the Hessian (which would be the optimal
0 and of the initial gradients from the output learning rate). An interesting alternative to variants
units. Symmetric non-linearities like the hy- of Newton’s method are variants of the natural gradi-
perbolic tangent did not suffer from that prob- ent method (Amari, 1998), but like the basic Newton
lem, while softer non-linearities (without ex- method it is computationally too expensive, requir-
ponential tails) such as the softsign function ing operations on a too large square matrix (num-
a
s(a) = 1+|a| worked even better. In Glorot et al. ber of parameters by number of parameters). Diag-
(2011a) it was shown that an asymmetric but onal and low-rank online approximations of natural
hard-limiting non-linearity such as the rectifier gradient (Le Roux et al., 2008; Le Roux et al., 2011)
27
have been proposed and shown to speed-up train- the introduction, the wisdom distilled here should be
ing in some contexts. Several adaptive learning rate taken as a guideline, to be tried and challenged, not
procedures have been proposed recently and merit as a practice set in stone. The practice summarized
more attention and evaluations in the neural network here, coupled with the increase in available comput-
context, such as adagrad (Duchi et al., 2011) and ing power, now allows researchers to train neural net-
the adaptive learning rate method from Schaul et al. works on a scale that is far beyond what was possible
(2012) which claims to remove completely the need at the time of the first edition of this book, helping
for a learning rate hyper-parameter. to move us closer to artificial intelligence.
Whereas stochastic gradient descent converges
very quickly initially it is generally slower than Acknowledgements
second-order methods for the final convergence, and
this may be important in some applications. As a The author is grateful for the comments and feed-
consequence, batch training algorithms (performing back provided by Nicolas Le Roux, Ian Goodfel-
only one update after seeing the whole training set) low, James Bergstra, Guillaume Desjardins, Razvan
such as the Conjugate Gradient method (a second Pascanu, David Warde-Farley, Eric Larsen, Frederic
order method) have dominated stochastic gradient Bastien, and Sina Honari, as well as for the finan-
descent for not too large datasets (e.g. less than cial support of NSERC, FQRNT, CIFAR, and the
thousands or tens of thousands of examples). Fur- Canada Research Chairs.
thermore, it has recently been proposed and success-
fully applied to use second-order methods over large
mini-batches (Le et al., 2011; Martens, 2010). The
References
idea is to do just a few iterations of the second-order Amari, S. (1998). Natural gradient works efficiently
methods on each mini-batch and then move on to in learning. Neural Computation, 10(2), 251–276.
the next mini-batch, starting from the best previous
point found. A useful twist is to start training with Bach, F. and Moulines, E. (2011). Non-asymptotic
one or more epoch of SGD, since SGD remains the analysis of stochastic approximation algorithms. In
fastest optimizer early on in training. NIPS’2011 .
At this point in time however, although the second-
order and natural gradient methods are appealing Bagnell, J. A. and Bradley, D. M. (2009). Differen-
conceptually, have demonstrably helped in the stud- tiable sparse coding. In NIPS’2009 , pages 113–120.
ied cases and may in the end prove to be very impor- Baxter, J. (1995). Learning internal representations.
tant, they have not yet become a standard for neural In COLT’95 , pages 311–320.
networks optimization and need to be validated and
maybe improved by other researchers, before displac- Baxter, J. (1997). A Bayesian/information theoretic
ing simple (mini-batch) stochastic gradient descent model of learning via multiple task sampling. Ma-
variants. chine Learning, 28, 7–40.
Bengio, Y. (2008). Neural net language models.
6.3 Conclusion Scholarpedia, 3(1), 3881.
In spite of decades of experimental and theoretical Bengio, Y. (2009). Learning deep architectures for
work on artificial neural networks, and with all the AI . Now Publishers.
impressive progress made since the first edition of
this book, in particular in the area of Deep Learning, Bengio, Y. (2011). Deep learning of representations
there is still much to be done to better train neural for unsupervised and transfer learning. In JMLR
networks and better understand the underlying issues W&CP: Proc. Unsupervised and Transfer Learn-
that can make the training task difficult. As stated in ing.
28
Bengio, Y. and Delalleau, O. (2011). On the expres- Bertsekas, D. P. (2010). Incremental gradient, sub-
sive power of deep architectures. In ALT’2011 . gradient, and proximal methods for convex opti-
mization: a survey. Technical Report 2848, LIDS.
Bengio, Y. and LeCun, Y. (2007). Scaling learning
algorithms towards AI. In Large Scale Kernel Ma- Bordes, A., Bottou, L., and Gallinari, P. (2009). Sgd-
chines. qn: Careful quasi-newton stochastic gradient de-
scent. Journal of Machine Learning Research, 10,
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, 1737–1754.
C. (2003). A neural probabilistic language model.
JMLR, 3, 1137–1155. Bordes, A., Weston, J., Collobert, R., and Bengio, Y.
(2011). Learning structured embeddings of knowl-
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O.,
edge bases. In AAAI 2011 .
and Marcotte, P. (2006a). Convex neural networks.
In NIPS’2005 , pages 123–130. Bordes, A., Glorot, X., Weston, J., and Bengio, Y.
(2012). Joint learning of words and meaning rep-
Bengio, Y., Delalleau, O., and Le Roux, N. (2006b).
resentations for open-text semantic parsing. AIS-
The curse of highly variable functions for local ker-
TATS’2012 .
nel machines. In NIPS’2005 , pages 107–114.
Bengio, Y., Lamblin, P., Popovici, D., and Bottou, L. (2011). From machine learning to machine
Larochelle, H. (2007). Greedy layer-wise training reasoning. Technical report, arXiv.1102.1808.
of deep networks. In NIPS’2006 .
Bottou, L. (2013). Large-scale learning with stochas-
Bengio, Y., Louradour, J., Collobert, R., and We- tic gradient descent. In K.-R. Müller, G. Mon-
ston, J. (2009). Curriculum learning. In ICML’09 . tavon, and G. B. Orr, editors, Neural Networks:
Tricks of the Trade, Reloaded . Springer.
Bengio, Y., Alain, G., and Rifai, S. (2012). Im-
plicit density estimation by local moment matching Bottou, L. and Bousquet, O. (2008). The tradeoffs of
to sample from auto-encoders. Technical report, large scale learning. In NIPS’2008 .
arXiv:1207.0057.
Bottou, L. and LeCun, Y. (2004). Large-scale on-line
Bergstra, J. and Bengio, Y. (2012). Random search learning. In NIPS’2003 .
for hyper-parameter optimization. J. Machine
Learning Res., 13, 281–305. Breiman, L. (1994). Bagging predictors. Machine
Learning, 24(2), 123–140.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P.,
Pascanu, R., Desjardins, G., Turian, J., Warde- Breuleux, O., Bengio, Y., and Vincent, P. (2011).
Farley, D., and Bengio, Y. (2010). Theano: a Quickly generating representative samples from an
CPU and GPU math expression compiler. In Proc. rbm-derived process. Neural Computation, 23(8),
Python for Scientific Comp. Conf. (SciPy). 2053–2073.
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Caruana, R. (1993). Multitask connectionist learn-
(2011). Algorithms for hyper-parameter optimiza- ing. In Proceedings of the 1993 Connectionist Mod-
tion. In NIPS’2011 . els Summer School , pages 372–379.
Berkes, P. and Wiskott, L. (2002). Applying slow fea- Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced
ture analysis to image sequences yields a rich reper- gradient and adaptive learning rate for training
toire of complex cell properties. In ICANN’02 , restricted boltzmann machines. In ICML’2011 ,
pages 81–86. pages 105–112.
29
Coates, A. and Ng, A. Y. (2011). The importance Frasconi, P., Gori, M., and Sperduti, A. (1998). A
of encoding versus training with sparse coding and general framework for adaptive processing of data
vector quantization. In ICML’2011 . structures. IEEE Transactions on Neural Net-
works, 9(5), 768–786.
Collobert, R. and Bengio, S. (2004a). Links between
perceptrons, MLPs and SVMs. In ICML’2004 . Geman, S., Bienenstock, E., and Doursat, R. (1992).
Neural networks and the bias/variance dilemma.
Collobert, R. and Bengio, S. (2004b). Links between
Neural Computation, 4(1), 1–58.
perceptrons, MLPs and SVMs. In International
Conference on Machine Learning, ICML. Getoor, L. and Taskar, B. (2006). Introduction to
Statistical Relational Learning. MIT Press.
Collobert, R., Weston, J., Bottou, L., Karlen, M.,
Kavukcuoglu, K., and Kuksa, P. (2011a). Natural Glorot, X. and Bengio, Y. (2010). Understanding
language processing (almost) from scratch. Journal the difficulty of training deep feedforward neural
of Machine Learning Research, 12, 2493–2537. networks. In AISTATS’2010 , pages 249–256.
Collobert, R., Kavukcuoglu, K., and Farabet, C. Glorot, X., Bordes, A., and Bengio, Y. (2011a).
(2011b). Torch7: A matlab-like environment for Deep sparse rectifier neural networks. In AIS-
machine learning. In BigLearn, NIPS Workshop. TATS’2011 .
Courville, A., Bergstra, J., and Bengio, Y. (2011). Glorot, X., Bordes, A., and Bengio, Y. (2011b). Do-
Unsupervised models of images by spike-and-slab main adaptation for large-scale sentiment classifi-
RBMs. In ICML’2011 . cation: A deep learning approach. In ICML’2011 .
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Sam- Goodfellow, I., Le, Q., Saxe, A., and Ng, A.
pled reconstruction for large-scale learning of em- (2009). Measuring invariances in deep networks.
beddings. In Proc. ICML’2011 . In NIPS’2009 , pages 646–654.
Deerwester, S., Dumais, S. T., Furnas, G. W., Lan- Goodfellow, I., Courville, A., and Bengio, Y. (2011).
dauer, T. K., and Harshman, R. (1990). Indexing Spike-and-slab sparse coding for unsupervised fea-
by latent semantic analysis. J. Am. Soc. Informa- ture discovery. In NIPS Workshop on Challenges
tion Science, 41(6), 391–407. in Learning Hierarchical Models.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adap- Graepel, T., Candela, J. Q., Borchert, T., and Her-
tive subgradient methods for online learning and brich, R. (2010). Web-scale Bayesian click-through
stochastic optimization. Journal of Machine rate prediction for sponsored search advertising in
Learning Research. microsoft’s bing search engine. In ICML’2010 .
Elman, J. L. (1993). Learning and development in Håstad, J. (1986). Almost optimal lower bounds for
neural networks: The importance of starting small. small depth circuits. In STOC’86 , pages 6–20.
Cognition, 48, 781–799.
Håstad, J. and Goldmann, M. (1991). On the power
Erhan, D., Courville, A., and Bengio, Y. (2010a). of small-depth threshold circuits. Computational
Understanding representations learned in deep ar- Complexity, 1, 113–129.
chitectures. Technical Report 1355, Université de
Montréal/DIRO. Hinton, G. E. (1978). Relaxation and its role in vi-
sion. Ph.D. thesis, University of Edinburgh.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-
A., Vincent, P., and Bengio, S. (2010b). Why does Hinton, G. E. (1986). Learning distributed represen-
unsupervised pre-training help deep learning? J. tations of concepts. In Proc. 8th Annual Conf. Cog.
Machine Learning Res., 11, 625–660. Sc. Society, pages 1–12.
30
Hinton, G. E. (1989). Connectionist learning proce- Larochelle, H., Bengio, Y., Louradour, J., and Lam-
dures. Artificial Intelligence, 40, 185–234. blin, P. (2009). Exploring strategies for training
deep neural networks. J. Machine Learning Res.,
Hinton, G. E. (2010). A practical guide to train- 10, 1–40.
ing restricted Boltzmann machines. Technical Re-
port UTML TR 2010-003, Department of Com- Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh,
puter Science, University of Toronto. P. W., and Ng, A. (2010). Tiled convolutional neu-
ral networks. In NIPS’2010 .
Hinton, G. E. (2013). A practical guide to training
restricted boltzmann machines. In K.-R. Müller, Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow,
G. Montavon, and G. B. Orr, editors, Neural Net- B., and Ng, A. (2011). On optimization methods
works: Tricks of the Trade, Reloaded . Springer. for deep learning. In ICML’2011 .
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008).
A fast learning algorithm for deep belief nets. Neu- Topmoumoute online natural gradient algorithm.
ral Computation, 18, 1527–1554. In NIPS’07 .
Hutter, F. (2009). Automated Configuration of Algo- Le Roux, N., Bengio, Y., and Fitzgibbon, A. (2011).
rithms for Solving Hard Computational Problems. Improving first and second-order methods by mod-
Ph.D. thesis, University of British Columbia. eling uncertainty. In Optimization for Machine
Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Learning. MIT Press.
Sequential model-based optimization for general Le Roux, N., Schmidt, M., and Bach, F. (2012).
algorithm configuration. In LION-5 . A stochastic gradient method with an exponen-
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le- tial convergence rate for strongly-convex optimiza-
Cun, Y. (2009). What is the best multi-stage ar- tion with finite training sets. Technical report,
chitecture for object recognition? In ICCV’09 . arXiv:1202.6258.
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (1987). Modèles connexionistes de
LeCun, Y. (2009). Learning invariant features l’apprentissage. Ph.D. thesis, Université de Paris
through topographic filter maps. In CVPR’2009 . VI.
Krueger, K. A. and Dayan, P. (2009). Flexible shap- LeCun, Y. (1989). Generalization and network de-
ing: how learning in small steps helps. Cognition, sign strategies. Technical Report CRG-TR-89-4,
110, 380–394. University of Toronto.
Lamblin, P. and Bengio, Y. (2010). Important gains LeCun, Y. (2013). to appear. In K.-R. Müller,
from supervised fine-tuning of deep architectures G. Montavon, and G. B. Orr, editors, Neural Net-
on large labeled sets. NIPS*2010 Deep Learning works: Tricks of the Trade, Reloaded . Springer.
and Unsupervised Feature Learning Workshop.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Lang, K. J. and Hinton, G. E. (1988). The develop- Howard, R. E., Hubbard, W., and Jackel, L. D.
ment of the time-delay neural network architecture (1989). Backpropagation applied to handwritten
for speech recognition. Technical Report CMU-CS- zip code recognition. Neural Computation, 1(4),
88-152, Carnegie-Mellon University. 541–551.
Larochelle, H. and Bengio, Y. (2008). Classifica- LeCun, Y., Bottou, L., Orr, G. B., and Müller, K.
tion using discriminative restricted Boltzmann ma- (1998a). Efficient backprop. In Neural Networks,
chines. In ICML’2008 . Tricks of the Trade.
31
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. inspired visual representation. PLoS Comput Biol ,
(1998b). Gradient based learning applied to docu- 5(11), e1000579.
ment recognition. IEEE , 86(11), 2278–2324.
Pollack, J. B. (1990). Recursive distributed represen-
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse tations. Artificial Intelligence, 46(1), 77–105.
deep belief net model for visual area V2. In
NIPS’07 . Polyak, B. and Juditsky, A. (1992). Acceleration of
stochastic approximation by averaging. SIAM J.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Control and Optimization, 30(4), 838–855.
(2009). Convolutional deep belief networks for scal-
able unsupervised learning of hierarchical represen- Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep
tations. In ICML’2009 . learning made easier by linear transformations in
perceptrons. In AISTATS’2012 .
Martens, J. (2010). Deep learning via Hessian-free
optimization. In ICML’2010 , pages 735–742. Ranzato, M., Poultney, C., Chopra, S., and LeCun,
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Ben- Y. (2007). Efficient learning of sparse representa-
gio, Y., Goodfellow, I., Lavoie, E., Muller, X., tions with an energy-based model. In NIPS’06 .
Desjardins, G., Warde-Farley, D., Vincent, P.,
Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008a).
Courville, A., and Bergstra, J. (2011). Unsuper-
Sparse feature learning for deep belief networks.
vised and transfer learning challenge: a deep learn-
In J. Platt, D. Koller, Y. Singer, and S. Roweis,
ing approach. In JMLR W&CP: Proc. Unsuper-
editors, Advances in Neural Information Process-
vised and Transfer Learning, volume 7.
ing Systems 20 (NIPS’07), pages 1185–1192, Cam-
Montavon, G., Braun, M. L., and Muller, K.-R. bridge, MA. MIT Press.
(2012). Deep boltzmann machines as feed-forward
hierarchies. In AISTATS’2012 . Ranzato, M., Boureau, Y., and LeCun, Y. (2008b).
Sparse feature learning for deep belief networks. In
Nair, V. and Hinton, G. E. (2010). Rectified linear NIPS’2007 .
units improve restricted Boltzmann machines. In
ICML’2010 . Richardson, M. and Domingos, P. (2006). Markov
logic networks. Machine Learning, 62, 107–136.
Nemirovski, A. and Yudin, D. (1983). Problem com-
plexity and method efficiency in optimization. Wi- Rifai, S., Vincent, P., Muller, X., Glorot, X., and
ley. Bengio, Y. (2011a). Contracting auto-encoders:
Explicit invariance during feature extraction. In
Nesterov, Y. (2009). Primal-dual subgradient meth-
ICML’2011 .
ods for convex problems. Mathematical program-
ming, 120(1), 221–259. Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and
Olshausen, B. A. and Field, D. J. (1997). Sparse Muller, X. (2011b). The manifold tangent classi-
coding with an overcomplete basis set: a strategy fier. In NIPS’2011 .
employed by V1? Vision Research, 37, 3311–3325.
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P.
Pearlmutter, B. (1994). Fast exact multiplication by (2012). A generative process for sampling contrac-
the Hessian. Neural Computation, 6(1), 147–160. tive auto-encoders. In ICML’2012 .
Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, Robbins, H. and Monro, S. (1951). A stochastic
D. D. (2009). A high-throughput screening ap- approximation method. Annals of Mathematical
proach to discovering good forms of biologically Statistics, 22, 400–407.
32
Rumelhart, D. E., Hinton, G. E., and Williams, Vincent, P., Larochelle, H., Bengio, Y., and Man-
R. J. (1986). Learning representations by back- zagol, P.-A. (2008). Extracting and composing
propagating errors. Nature, 323, 533–536. robust features with denoising autoencoders. In
ICML 2008 .
Salakhutdinov, R. and Hinton, G. (2009). Deep
Boltzmann machines. In AISTATS’2009 . Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y.,
and Manzagol, P.-A. (2010). Stacked denoising au-
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., toencoders: Learning useful representations in a
Suresh, B., and Ng, A. (2011). On random weights deep network with a local denoising criterion. J.
and unsupervised feature learning. In ICML’2011 . Machine Learning Res., 11.
Schaul, T., Zhang, S., and LeCun, Y. (2012). No
Weston, J., Ratle, F., and Collobert, R. (2008). Deep
More Pesky Learning Rates. Technical report.
learning via semi-supervised embedding. In ICML
Schraudolph, N. N. (1998). Centering neural network 2008 .
gradient factors. In G. B. Orr and K.-R. Muller, ed-
Weston, J., Bengio, S., and Usunier, N. (2011). Ws-
itors, Neural Networks: Tricks of he Trade, pages
abie: Scaling up to large vocabulary image anno-
548–548. Springer.
tation. In Proceedings of the International Joint
Socher, R., Manning, C., and Ng, A. Y. (2011). Pars- Conference on Artificial Intelligence, IJCAI .
ing natural scenes and natural language with recur-
Wiskott, L. and Sejnowski, T. J. (2002). Slow fea-
sive neural networks. In ICML’2011 .
ture analysis: Unsupervised learning of invari-
Srinivasan, A. and Ramakrishnan, G. (2011). Pa- ances. Neural Computation, 14(4), 715–770.
rameter screening and optimisation for ILP using
Zou, W. Y., Ng, A. Y., and Yu, K. (2011). Unsu-
designed experiments. Journal of Machine Learn-
pervised learning of visual invariance with tempo-
ing Research, 12, 627–662.
ral coherence. In NIPS 2011 Workshop on Deep
Swersky, K., Chen, B., Marlin, B., and de Freitas, Learning and Unsupervised Feature Learning.
N. (2010). A tutorial on stochastic approximation
algorithms for training restricted boltzmann ma-
chines and deep belief nets. In Information Theory
and Applications Workshop.
Tenenbaum, J., de Silva, V., and Langford, J. C.
(2000). A global geometric framework for nonlin-
ear dimensionality reduction. Science, 290(5500),
2319–2323.
Tieleman, T. and Hinton, G. (2009). Using fast
weights to improve persistent contrastive diver-
gence. In ICML’2009 .
van der Maaten, L. and Hinton, G. E. (2008). Visual-
izing data using t-sne. J. Machine Learning Res.,
9.
Vincent, P. (2011). A connection between score
matching and denoising autoencoders. Neural
Computation, 23(7).
33
Published as a conference paper at ICLR 2015
A DAM : A M ETHOD FOR S TOCHASTIC O PTIMIZATION

Diederik P. Kingma* Jimmy Lei Ba∗
University of Amsterdam University of Toronto
dpkingma@uva.nl jimmy@psi.utoronto.ca
A BSTRACT
We introduce Adam, an algorithm for first-order gradient-based optimization of

stochastic objective functions, based on adaptive estimates of lower-order mo-
arXiv:1412.6980v8 [cs.LG] 23 Jul 2015
ments. The method is straightforward to implement, is computationally efficient,

has little memory requirements, is invariant to diagonal rescaling of the gradients,
and is well suited for problems that are large in terms of data and/or parameters.
The method is also appropriate for non-stationary objectives and problems with
very noisy and/or sparse gradients. The hyper-parameters have intuitive interpre-
tations and typically require little tuning. Some connections to related algorithms,
on which Adam was inspired, are discussed. We also analyze the theoretical con-
vergence properties of the algorithm and provide a regret bound on the conver-
gence rate that is comparable to the best known results under the online convex
optimization framework. Empirical results demonstrate that Adam works well in
practice and compares favorably to other stochastic optimization methods. Finally,
we discuss AdaMax, a variant of Adam based on the infinity norm.
1 I NTRODUCTION
Stochastic gradient-based optimization is of core practical importance in many fields of science and
engineering. Many problems in these fields can be cast as the optimization of some scalar parameter-
ized objective function requiring maximization or minimization with respect to its parameters. If the
function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization
method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same
computational complexity as just evaluating the function. Often, objective functions are stochastic.
For example, many objective functions are composed of a sum of subfunctions evaluated at different
subsamples of data; in this case optimization can be made more efficient by taking gradient steps
w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself
as an efficient and effective optimization method that was central in many machine learning success
stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton
& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other
sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For
all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this
paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In
these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be
restricted to first-order methods.
We propose Adam, a method for efficient stochastic optimization that only requires first-order gra-
dients with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients; the name Adam
is derived from adaptive moment estimation. Our method is designed to combine the advantages
of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra-
dients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary
settings; important connections to these and other stochastic optimization methods are clarified in
section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to
rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter,
it does not require a stationary objective, it works with sparse gradients, and it naturally performs a
form of step size annealing.
∗
Equal contribution. Author ordering determined by coin flip over a Google Hangout.
1
Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,
and for a slightly more efficient (but less clear) order of computation. gt2 indicates the elementwise
square gt gt . Good default settings for the tested machine learning problems are α = 0.001,
β1 = 0.9, β2 = 0.999 and = 10−8 . All operations on vectors are element-wise. With β1t and β2t
we denote β1 and β2 to the power t.
Require: α: Stepsize
Require: β1 , β2 ∈ [0, 1): Exponential decay rates for the moment estimates
Require: f (θ): Stochastic objective function with parameters θ
Require: θ0 : Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
v0 ← 0 (Initialize 2nd moment vector)
t ← 0 (Initialize timestep)
while θt not converged do
t←t+1
gt ← ∇θ ft (θt−1 ) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 · mt−1 + (1 − β1 ) · gt (Update biased first moment estimate)
vt ← β2 · vt−1 + (1 − β2 ) · gt2 (Update biased second raw moment estimate)
mb t ← mt /(1 − β1t ) (Compute bias-corrected first moment estimate)
vbt ← vt /(1 − β2t ) (Compute
√ bias-corrected second raw moment estimate)
θt ← θt−1 − α · m b t /( vbt + ) (Update parameters)
end while
return θt (Resulting parameters)
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s
convergence in online convex programming. Empirically, our method consistently outperforms other
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is
a versatile algorithm that scales to large-scale high-dimensional machine learning problems.
2 A LGORITHM
See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let f (θ) be a noisy objec-
tive function: a stochastic scalar function that is differentiable w.r.t. parameters θ. We are in-
terested in minimizing the expected value of this function, E[f (θ)] w.r.t. its parameters θ. With
f1 (θ), ..., , fT (θ) we denote the realisations of the stochastic function at subsequent timesteps
1, ..., T . The stochasticity might come from the evaluation at random subsamples (minibatches)
of datapoints, or arise from inherent function noise. With gt = ∇θ ft (θ) we denote the gradient, i.e.
the vector of partial derivatives of ft , w.r.t θ evaluated at timestep t.
The algorithm updates exponential moving averages of the gradient (mt ) and the squared gradient
(vt ) where the hyper-parameters β1 , β2 ∈ [0, 1) control the exponential decay rates of these moving
averages. The moving averages themselves are estimates of the 1st moment (the mean) and the
2nd raw moment (the uncentered variance) of the gradient. However, these moving averages are
initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially
during the initial timesteps, and especially when the decay rates are small (i.e. the βs are close to 1).
The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected
estimates mb t and vbt . See section 3 for more details.
Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing
the order p
of computation, e.g. by replacing the last three lines in the loop with the following lines:
√
αt = α · 1 − β2t /(1 − β1t ) and θt ← θt−1 − αt · mt /( vt + ˆ).
2.1 A DAM ’ S UPDATE RULE
An important property of Adam’s update rule is its careful choice of √ stepsizes. Assuming = 0, the
timestep t is ∆t = α · m
effective step taken in parameter space at √ b t / vbt . The
√ effective stepsize has
two upper bounds: |∆t | ≤ α · (1 − β1 )/ 1 − β2 in the case (1 − β1 ) > 1 − β2 , and |∆t | ≤ α
2
otherwise. The first case only happens in the most severe case of sparsity: when a gradient has
been zero at all timesteps except at the√ current timestep. For less √ sparse cases, the effective stepsize
will be smaller. When (1 − β1 ) = 1 − β2 we √ have that | m
b t / vbt | <
p1 therefore |∆t | < α. In
more common scenarios, we will have that m b t / vbt ≈ ±1 since |E[g]/ E[g 2 ]| ≤ 1. The effective
magnitude of the steps taken in parameter space at each timestep are approximately bounded by
the stepsize setting α, i.e., |∆t | / α. This can be understood as establishing a trust region around
the current parameter value, beyond which the current gradient estimate does not provide sufficient
information. This typically makes it relatively easy to know the right scale of α in advance. For
many machine learning models, for instance, we often know in advance that good optima are with
high probability within some set region in parameter space; it is not uncommon, for example, to
have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of
steps in parameter space, we can often deduce the right order of magnitude of α such that optima
can be reached from θ0 within√ some number of iterations. With a slight abuse of terminology,
we will call the ratio mb t / vbt the signal-to-noise ratio (SN R). With a smaller SNR the effective
stepsize ∆t will be closer to zero. This is a desirable property, since a smaller SNR means that
there is greater uncertainty about whether the direction of m b t corresponds to the direction of the true
gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading
to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize
∆t is also invariant to the scale of the gradients; rescaling the gradients√ g with factor√ c will scale m
bt
with a factor c and vbt with a factor c2 , which cancel out: (c · m
b t )/( c2 · vbt ) = mb t / vbt .
3 I NITIALIZATION BIAS CORRECTION
As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive
the term for the second moment estimate; the derivation for the first moment estimate is completely
analogous. Let g be the gradient of the stochastic objective f , and we wish to estimate its second
raw moment (uncentered variance) using an exponential moving average of the squared gradient,
with decay rate β2 . Let g1 , ..., gT be the gradients at subsequent timesteps, each a draw from an
underlying gradient distribution gt ∼ p(gt ). Let us initialize the exponential moving average as
v0 = 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average
vt = β2 · vt−1 + (1 − β2 ) · gt2 (where gt2 indicates the elementwise square gt gt ) can be written as
a function of the gradients at all previous timesteps:
t
X
vt = (1 − β2 ) β2t−i · gi2 (1)
i=1
We wish to know how E[vt ], the expected value of the exponential moving average at timestep t,
relates to the true second moment E[gt2 ], so we can correct for the discrepancy between the two.
Taking expectations of the left-hand and right-hand sides of eq. (1):
" t
#
X
E[vt ] = E (1 − β2 ) β2t−i · gi2 (2)
i=1
t
X
= E[gt2 ] · (1 − β2 ) β2t−i + ζ (3)
i=1
= E[gt2 ] · (1 − β2t ) + ζ (4)
where ζ = 0 if the true second moment E[gi2 ] is stationary; otherwise ζ can be kept small since
the exponential decay rate β1 can (and should) be chosen such that the exponential moving average
assigns small weights to gradients too far in the past. What is left is the term (1 − β2t ) which is
caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this
term to correct the initialization bias.
In case of sparse gradients, for a reliable estimate of the second moment one needs to average over
many gradients by chosing a small value of β2 ; however it is exactly this case of small β2 where a
lack of initialisation bias correction would lead to initial steps that are much larger.
3
4 C ONVERGENCE ANALYSIS
We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich,
2003). Given an arbitrary, unknown sequence of convex cost functions f1 (θ), f2 (θ),..., fT (θ). At
each time t, our goal is to predict the parameter θt and evaluate it on a previously unknown cost
function ft . Since the nature of the sequence is unknown in advance, we evaluate our algorithm
using the regret, that is the sum of all the previous difference between the online prediction ft (θt )
and the best fixed point parameter ft (θ∗ ) from a feasible set X for all the previous steps. Concretely,
the regret is defined as:
T
X
R(T ) = [ft (θt ) − ft (θ∗ )] (5)
t=1
PT √
where θ∗ = arg minθ∈X t=1 ft (θ). We show Adam has O( T ) regret bound and a proof is given
in the appendix. Our result is comparable to the best known bound for this general convex online
learning problem. We also use some definitions simplify our notation, where gt , ∇ft (θt ) and gt,i
as the ith element. We define g1:t,i ∈ Rt as a vector that contains the ith dimension of the gradients
β2
over all iterations till t, g1:t,i = [g1,i , g2,i , · · · , gt,i ]. Also, we define γ , √1 .
β2
Our following
− 21
theorem holds when the learning rate αt is decaying at a rate of t and first moment running
average coefficient β1,t decay exponentially with λ, that is typically close to 1, e.g. 1 − 10−8 .
Theorem 4.1. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, kθn − θm k2 ≤ D,
β2
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }, and β1 , β2 ∈ [0, 1) satisfy √β1 < 1. Let αt = √αt
2
and β1,t = β1 λt−1 , λ ∈ (0, 1). Adam achieves the following guarantee, for all T ≥ 1.
d d d √
D2 X p α(1 + β1 )G∞ X X D∞2
G ∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg k
1:T,i 2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2α(1 − β1 )(1 − λ)2
Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the √ sum-
Pd
mation term can be much smaller than its upper bound
Pd p √ i=1 kg1:T,i k2 << dG∞ T and
i=1 T v
b T,i << dG ∞ T , in particular if the class of function and data features are in the form of
Pd
section 1.2 in (Duchi et al., 2011). Their results for the expected value E[ i=1 kg1:T,i k2 ] also apply√
√ adaptive method, such as Adam and Adagrad, can achieve O(log d T ),
to Adam. In particular, the
an improvement over O( dT ) for the non-adaptive method. Decaying β1,t towards zero is impor-
tant in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al.,
2013) suggests reducing the momentum coefficient in the end of training can improve convergence.
Finally, we can show the average regret of Adam converges,
Corollary 4.2. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }. Adam achieves the following guarantee, for all
T ≥ 1.
R(T ) 1
= O( √ )
T T
Pd √
This result can be obtained by using Theorem 4.1 and i=1 kg1:T,i k2 ≤ dG∞ T . Thus,
limT →∞ R(TT
)
= 0.
5 R ELATED WORK
Optimization methods bearing a direct relation to Adam are RMSProp (Tieleman & Hinton, 2012;
Graves, 2013) and AdaGrad (Duchi et al., 2011); these relationships are discussed below. Other
stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the
natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature
4
from first-order information. The Sum-of-Functions Optimizer (SFO) (Sohl-Dickstein et al., 2014)
is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear
in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained
systems such as a GPU. Like natural gradient descent (NGD) (Amari, 1998), Adam employs a
preconditioner that adapts to the geometry of the data, since vbt is an approximation to the diagonal
of the Fisher information matrix (Pascanu & Bengio, 2013); however, Adam’s preconditioner (like
AdaGrad’s) is more conservative in its adaption than vanilla NGD by preconditioning with the square
root of the inverse of the diagonal Fisher information matrix approximation.
RMSProp: An optimization method closely related to Adam is RMSProp (Tieleman & Hinton,
2012). A version with momentum has sometimes been used (Graves, 2013). There are a few impor-
tant differences between RMSProp with momentum and Adam: RMSProp with momentum gener-
ates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are
directly estimated using a running average of first and second moment of the gradient. RMSProp
also lacks a bias-correction term; this matters most in case of a small value β2 (required in case of
sparse gradients), since in that case not correcting the bias leads to very large stepsizes and often
divergence, as we also empirically demonstrate in section 6.4.
AdaGrad: An algorithm that works well for sparse gradients qP is AdaGrad (Duchi et al., 2011). Its
t 2
basic version updates parameters as θt+1 = θt − α · gt / i=1 gt . Note that if we choose β2 to be
t
infinitesimally close to 1 from below, then limβ2 →1 vbt = t−1 · i=1 gt2 . AdaGrad corresponds to a
P
version of Adam with β1 = 0, infinitesimal (1 − β2 ) and a replacement of α by an qannealed version
t
αt = α · t−1/2 , namely θt − α · t−1/2 · mb t / limβ2 →1 vbt = θt − α · t−1/2 · gt / t−1 · i=1 gt2 =
p P
qP
t 2
θt − α · gt / i=1 gt . Note that this direct correspondence between Adam and Adagrad does
not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a β2
infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates.
6 E XPERIMENTS
To empirically evaluate the proposed method, we investigated different popular machine learning
models, including logistic regression, multilayer fully connected neural networks and deep convolu-
tional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve
practical deep learning problems.
We use the same parameter initialization when comparing different optimization algorithms. The
hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the
results are reported using the best hyper-parameter setting.
6.1 E XPERIMENT: L OGISTIC R EGRESSION
We evaluate our proposed method on L2-regularized multi-class logistic regression using the MNIST
dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison
of different optimizers without worrying √about local minimum issues. The stepsize α in our logistic
regression experiments is adjusted by 1/ t decay, namely αt = √αt that matches with our theorat-
ical prediction from section 4. The logistic regression classifies the class label directly on the 784
dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and
Adagrad using minibatch size of 128. According to Figure 1, we found that the Adam yields similar
convergence as SGD with momentum and both converge faster than Adagrad.
As discussed in (Duchi et al., 2011), Adagrad can efficiently deal with sparse features and gradi-
ents
√ as one of its main theoretical results whereas SGD is low at learning rare features. Adam with
1/ t decay on its stepsize should theoratically match the performance of Adagrad. We examine the
sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process
the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most
frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse. As sug-
gested in (Wang & Manning, 2013), 50% dropout noise can be applied to the BoW features during
5
0.7 MNIST Logistic Regression 0.50 IMDB BoW feature Logistic Regression
AdaGrad Adagrad+dropout
SGDNesterov RMSProp+dropout
Adam 0.45
0.6 SGDNesterov+dropout
Adam+dropout
0.40
0.5
training cost
training cost
0.35
0.4
0.30
0.3
0.25
0.20 5 10 15 20 25 30 35 40 45 0.200 20 40 60 80 100 120 140 160

iterations over entire dataset iterations over entire dataset
Figure 1: Logistic regression training negative log likelihood on MNIST images and IMDB movie
reviews with 10,000 bag-of-words (BoW) feature vectors.
training to prevent over-fitting. In figure 1, Adagrad outperforms SGD with Nesterov momentum
by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The
empirical performance of Adam is consistent with our theoretical findings in sections 2 and 4. Sim-
ilar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than
normal SGD with momentum.
6.2 E XPERIMENT: M ULTI - LAYER N EURAL N ETWORKS
Multi-layer neural network are powerful models with non-convex objective functions. Although
our convergence analysis does not apply to non-convex problems, we empirically found that Adam
often outperforms other methods in such cases. In our experiments, we made model choices that are
consistent with previous publications in the area; a neural network model with two fully connected
hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with
minibatch size of 128.
First, we study different optimizers using the standard deterministic cross-entropy objective func-
tion with L2 weight decay on the parameters to prevent over-fitting. The sum-of-functions (SFO)
method (Sohl-Dickstein et al., 2014) is a recently proposed quasi-Newton method that works with
minibatches of data and has shown good performance on optimization of multi-layer neural net-
works. We used their implementation and compared with Adam to train such models. Figure 2
shows that Adam makes faster progress in terms of both the number of iterations and wall-clock
time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration com-
pared to Adam, and has a memory requirement that is linear in the number minibatches.
Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and
often used in practice due to their simplicity. SFO assumes deterministic subfunctions, and indeed
failed to converge on cost functions with stochastic regularization. We compare the effectiveness of
Adam to other stochastic first order methods on multi-layer neural networks trained with dropout
noise. Figure 2 shows our results; Adam shows better convergence than other methods.
6.3 E XPERIMENT: C ONVOLUTIONAL N EURAL N ETWORKS
Convolutional neural networks (CNNs) with several layers of convolution, pooling and non-linear
units have shown considerable success in computer vision tasks. Unlike most fully connected neural
nets, weight sharing in CNNs results in vastly different gradients in different layers. A smaller
learning rate for the convolution layers is often used in practice when applying SGD. We show the
effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x5
convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer
of 1000 rectified linear hidden units (ReLU’s). The input image are pre-processed by whitening, and
6
10-1 MNIST Multilayer Neural Network + dropout

AdaGrad
RMSProp
SGDNesterov
AdaDelta
Adam
training cost
10-2
0 50 100 150 200

iterations over entire dataset
(a) (b)
Figure 2: Training of multilayer neural networks on MNIST images. (a) Neural networks using
dropout stochastic regularization. (b) Neural networks with deterministic cost function. We compare
with the sum-of-functions (SFO) optimizer (Sohl-Dickstein et al., 2014)
3.0 CIFAR10 ConvNet First 3 Epoches CIFAR10 ConvNet

AdaGrad 102
AdaGrad
AdaGrad+dropout AdaGrad+dropout
SGDNesterov SGDNesterov
2.5 SGDNesterov+dropout 101 SGDNesterov+dropout
Adam Adam
Adam+dropout Adam+dropout
2.0 100
training cost
training cost
10-1
1.5
10-2
1.0
10-3
0.50.0 0.5 1.0 1.5 2.0 2.5 3.0 10-4 0 5 10 15 20 25 30 35 40 45

iterations over entire dataset iterations over entire dataset
Figure 3: Convolutional neural networks training cost. (left) Training cost for the first three epochs.
(right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c128-1000 architecture.
dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set
to 128 similar to previous experiments.
Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial
stage of the training, shown in Figure 3 (left), Adam and SGD eventually converge considerably
faster than Adagrad for CNNs shown in Figure 3 (right). We notice the second moment estimate vbt
vanishes to zeros after a few epochs and is dominated by the in algorithm 1. The second moment
estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing
to fully connected network from Section 6.2. Whereas, reducing the minibatch variance through
the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad
converges much slower than others in this particular experiment. Though Adam shows marginal
improvement over SGD with momentum, it adapts learning rate scale for different layers instead of
hand picking manually as in SGD.
7
β2=0.99 β2=0.999 β2=0.9999 β2=0.99 β2=0.999 β2=0.9999
β1=0
Loss
β1=0.9
log10(α)
(a) after 10 epochs (b) after 100 epochs
Figure 4: Effect of bias-correction terms (red line) versus no bias correction terms (green line)
after 10 epochs (left) and 100 epochs (right) on the loss (y-axes) when learning a Variational Auto-
Encoder (VAE) (Kingma & Welling, 2013), for different settings of stepsize α (x-axes) and hyper-
parameters β1 and β2 .
6.4 E XPERIMENT: BIAS - CORRECTION TERM
We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3.
Discussed in section 5, removal of the bias correction terms results in a version of RMSProp (Tiele-
man & Hinton, 2012) with momentum. We vary the β1 and β2 when training a variational auto-
encoder (VAE) with the same architecture as in (Kingma & Welling, 2013) with a single hidden
layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian
latent variable. We iterated over a broad range of hyper-parameter choices, i.e. β1 ∈ [0, 0.9] and
β2 ∈ [0.99, 0.999, 0.9999], and log10 (α) ∈ [−5, ..., −1]. Values of β2 close to 1, required for robust-
ness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction
term is important in such cases of slow decay, preventing an adverse effect on optimization.
In Figure 4, values β2 close to 1 indeed lead to instabilities in training when no bias correction term
was present, especially at first few epochs of the training. The best results were achieved with small
values of (1 − β2 ) and bias correction; this was more apparent towards the end of optimization when
gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam
performed equal or better than RMSProp, regardless of hyper-parameter setting.
7 E XTENSIONS
7.1 A DA M AX
In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a
(scaled) L2 norm of their individual current and past gradients. We can generalize the L2 norm based
update rule to a Lp norm based update rule. Such variants become numerically unstable for large
p. However, in the special case where we let p → ∞, a surprisingly simple and stable algorithm
emerges; see algorithm 2. We’ll now derive the algorithm. Let, in case of the Lp norm, the stepsize
1/p
at time t be inversely proportional to vt , where:
vt = β2p vt−1 + (1 − β2p )|gt |p (6)

t
p(t−i)
X
= (1 − β2p ) β2 · |gi |p (7)
i=1
8
Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details.
Good default settings for the tested machine learning problems are α = 0.002, β1 = 0.9 and
β2 = 0.999. With β1t we denote β1 to the power t. Here, (α/(1 − β1t )) is the learning rate with the
bias-correction term for the first moment. All operations on vectors are element-wise.
Require: α: Stepsize
Require: β1 , β2 ∈ [0, 1): Exponential decay rates
Require: f (θ): Stochastic objective function with parameters θ
Require: θ0 : Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
u0 ← 0 (Initialize the exponentially weighted infinity norm)
t ← 0 (Initialize timestep)
while θt not converged do
t←t+1
gt ← ∇θ ft (θt−1 ) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 · mt−1 + (1 − β1 ) · gt (Update biased first moment estimate)
ut ← max(β2 · ut−1 , |gt |) (Update the exponentially weighted infinity norm)
θt ← θt−1 − (α/(1 − β1t )) · mt /ut (Update parameters)
end while
return θt (Resulting parameters)
Note that the decay term is here equivalently parameterised as β2p instead of β2 . Now let p → ∞,
and define ut = limp→∞ (vt )1/p , then:
t
!1/p
p(t−i)
p
X
1/p p
ut = lim (vt ) = lim (1 − β2 ) β2 · |gi | (8)
p→∞ p→∞
i=1
t
!1/p
p(t−i)
X
= lim (1 − β2p )1/p β2 · |gi |p (9)
p→∞
i=1
t
!1/p
p
(t−i)
X
= lim β2 · |gi | (10)
p→∞
i=1
β2t−1 |g1 |, β2t−2 |g2 |, . . . , β2 |gt−1 |, |gt |

= max (11)
Which corresponds to the remarkably simple recursive formula:
ut = max(β2 · vt−1 , |gt |) (12)
with initial value u0 = 0. Note that, conveniently enough, we don’t need to correct for initialization
bias in this case. Also note that the magnitude of parameter updates has a simpler bound with
AdaMax than Adam, namely: |∆t | ≤ α.
7.2 T EMPORAL AVERAGING
Since the last iterate is noisy due to stochastic approximation, better generalization performance is
often achieved by averaging. Previously in Moulines & Bach (2011), Polyak-Ruppert averaging
(Polyak & Juditsky, P 1992; Ruppert, 1988) has been shown to improve the convergence of standard
n
SGD, where θ̄t = 1t k=1 θk . Alternatively, an exponential moving average over the parameters can
be used, giving higher weight to more recent parameter values. This can be trivially implemented
by adding one line to the inner loop of algorithms 1 and 2: θ̄t ← β2 · θ̄t−1 + (1 − β2 )θt , with θ̄0 = 0.
Initalization bias can again be corrected by the estimator θbt = θ̄t /(1 − β2t ).
8 C ONCLUSION
We have introduced a simple and computationally efficient algorithm for gradient-based optimiza-
tion of stochastic objective functions. Our method is aimed towards machine learning problems with
9
large datasets and/or high-dimensional parameter spaces. The method combines the advantages of
two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients,
and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward
to implement and requires little memory. The experiments confirm the analysis on the rate of con-
vergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range
of non-convex optimization problems in the field machine learning.
9 ACKNOWLEDGMENTS
This paper would probably not have existed without the support of Google Deepmind. We would
like to give special thanks to Ivo Danihelka, and Tom Schaul for coining the name Adam. Thanks to
Kai Fan from Duke University for spotting an error in the original AdaMax derivation. Experiments
in this work were partly carried out on the Dutch national e-infrastructure with the support of SURF
Foundation. Diederik Kingma is supported by the Google European Doctorate Fellowship in Deep
Learning.
R EFERENCES
Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff,
He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft.
ICASSP 2013, 2013.
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic
optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural
networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,
pp. 6645–6649. IEEE, 2013.
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313
(5786):504–507, 2006.
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior,
Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine,
IEEE, 29(6):82–97, 2012a.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Im-
proving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,
2012b.
Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In The 2nd International Conference
on Learning Representations (ICLR), 2013.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142–150. Association for
Computational Linguistics, 2011.
Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint
arXiv:1301.3584, 2013.
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal
on Control and Optimization, 30(4):838–855, 1992.
10
Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th
International Conference on Machine Learning (ICML-10), pp. 623–630, 2010.
Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report,
Cornell University Operations Research and Industrial Engineering, 1988.
Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106,
2012.
Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochas-
tic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine
Learning (ICML-14), pp. 604–612, 2014.
Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and
momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning
(ICML-13), pp. 1139–1147, 2013.
Tieleman, T. and Hinton, G. Lecture 6.5 - RMSProp, COURSERA: Neural Networks for Machine Learning.
Technical report, 2012.
Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Confer-
ence on Machine Learning (ICML-13), pp. 118–126, 2013.
Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003.
11
10 A PPENDIX
10.1 C ONVERGENCE P ROOF
Definition 10.1. A function f : Rd → R is convex if for all x, y ∈ Rd , for all λ ∈ [0, 1],
λf (x) + (1 − λ)f (y) ≥ f (λx + (1 − λ)y)
Also, notice that a convex function can be lower bounded by a hyperplane at its tangent.
Lemma 10.2. If a function f : Rd → R is convex, then for all x, y ∈ Rd ,
f (y) ≥ f (x) + ∇f (x)T (y − x)
The above lemma can be used to upper bound the regret and our proof for the main theorem is
constructed by substituting the hyperplane with the Adam update rules.
The following two lemmas are used to support our main theorem. We also use some definitions sim-
plify our notation, where gt , ∇ft (θt ) and gt,i as the ith element. We define g1:t,i ∈ Rt as a vector
that contains the ith dimension of the gradients over all iterations till t, g1:t,i = [g1,i , g2,i , · · · , gt,i ]
Lemma 10.3. Let gt = ∇ft (θt ) and g1:t be defined as above and bounded, kgt k2 ≤ G, kgt k∞ ≤
G∞ . Then, s
T 2
X gt,i
≤ 2G∞ kg1:T,i k2
t=1
t
Proof. We will prove the inequality using induction over T.

q
The base case for T = 1, we have g1,i 2 ≤ 2G kg k .
∞ 1,i 2
For the inductive step,

s s s
T 2 T −1 2 2
X gt,i X gt,i gT,i
= +
t=1
t t=1
t T
s
2
gT,i
≤ 2G∞ kg1:T −1,i k2 +
T
s
q 2
gT,i
= 2G∞ kg1:T,i k22 − gT2 +
T
4
gT
From, kg1:T,i k22 − gT,i
2
+ ,i
4kg1:T ,i k22
≥ kg1:T,i k22 − gT,i
2
, we can take square root of both side and
have,
2
q gT,i
kg1:T,i k22 − gT,i
2 ≤ kg
1:T,i k2 −
2kg1:T,i k2
2
gT,i
≤ kg1:T,i k2 − p
2 T G2∞
q
Rearrange the inequality and substitute the kg1:T,i k22 − gT,i
2 term,
s
q 2
gT,i
G∞ kg1:T,i k22 − gT2 + ≤ 2G∞ kg1:T,i k2
T
12
β2 β2
Lemma 10.4. Let γ , √β1 . For β1 , β2 ∈ [0, 1) that satisfy √1
β2
< 1 and bounded gt , kgt k2 ≤ G,
2
kgt k∞ ≤ G∞ , the following inequality holds
T
mb2 2 1
p t,i ≤
X
√ kg1:T,i k2
t=1
tb
vt,i 1 − γ 1 − β2
√
1−β t 1
Proof. Under the assumption, (1−β t )22 ≤ (1−β 1)
2 . We can expand the last term in the summation
1
using the update rules in Algorithm 1,
T T −1 PT
b 2t,i b 2t,i 1 − β2T ( k=1 (1 − β1 )β1T −k gk,i )2
p
X m X m
= +
(1 − β1T )2 T PT (1 − β )β T −j g 2
p p q
t=1
tb
vt,i t=1
tb
vt,i
j=1 2 2 j,i
T −1 T
b2 1 − β2T X T ((1 − β1 )β1T −k gk,i )2
p
m
p t,i +
X
≤
(1 − β1T )2 k=1 T PT (1 − β )β T −j g 2
q
t=1
tb
vt,i
j=1 2 2 j,i
T −1 T
b2 1 − β2T X T ((1 − β1 )β1T −k gk,i )2
p
X m
≤ p t,i +
(1 − β1T )2 k=1 T (1 − β )β T −k g 2
q
t=1
tb
vt,i 2 2 k,i
T −1 T T −k
b2
p
m 1 − β2T (1 − β1 )2 β2

p t,i +
X X
≤ T 2
p T √1 kgk,i k2
t=1
tb
vt,i (1 − β1 ) T (1 − β2 ) k=1 β2
T −1 T
mb2 T
p t,i + p
X X
≤ γ T −k kgk,i k2
t=1
tb
vt,i T (1 − β2 ) k=1
Similarly, we can upper bound the rest of the terms in the summation.
T T T −t
mb2 kg k
p t,i ≤ p t,i 2
X X X
tγ j
t=1
tb
v t,i t=1 t(1 − β )
2 j=0
T T
X kgt,i k2 X
≤ p tγ j
t=1 t(1 − β2 ) j=0
1
tγ t <
P
For γ < 1, using the upper bound on the arithmetic-geometric series, t (1−γ)2 :
T T T
X kgt,i k2 X 1 X kgt,i k2
p tγ j ≤ 2
√ √
t=1 t(1 − β2 ) j=0
(1 − γ) 1 − β2 t=1 t
Apply Lemma 10.3,

T
mb2 2G∞
p t,i ≤
X
√
2 1−β
kg1:T,i k2
t=1
tb
vt,i (1 − γ) 2
β2
To simplify the notation, we define γ , √1 .
β2
Intuitively, our following theorem holds when the
− 21
learning rate αt is decaying at a rate of t and first moment running average coefficient β1,t decay
exponentially with λ, that is typically close to 1, e.g. 1 − 10−8 .
Theorem 10.5. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
13
β2 α
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }, and β1 , β2 ∈ [0, 1) satisfy √β1 < 1. Let αt = √
t
2
and β1,t = β1 λt−1 , λ ∈ (0, 1). Adam achieves the following guarantee, for all T ≥ 1.
d d d √
D2 X p α(β1 + 1)G∞ X X D∞2
G ∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg1:T,i k2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2α(1 − β1 )(1 − λ)2
Proof. Using Lemma 10.2, we have,
d
X
ft (θt ) − ft (θ∗ ) ≤ gtT (θt − θ∗ ) = gt,i (θt,i − θ,i∗ )
i=1
From the update rules presented in algorithm 1,

p
θt+1 = θt − αt m
b t/ vbt

αt β1,t (1 − β1,t )
= θt − √ m t−1 + √ gt
1 − β1t vbt vbt
We focus on the ith dimension of the parameter vector θt ∈ Rd . Subtract the scalar θ,i∗ and square
both sides of the above update rule, we have,
2αt β1,t β1,t ) m

b t,i
(θt+1,i − θ,i∗ )2 =(θt,i − θ,i∗ )2 − t ( p mt−1,i + (1 − p gt,i )(θt,i − θ,i∗ ) + αt2 ( p )2
1 − β1 vbt,i vbt,i vbt,i
above equation and use Young’s inequality, ab ≤ a2 /2 + b2 /2. Also, it can be

We can rearrange theq
p Pt t−j 2
p
t
shown that vbt,i = j=1 (1 − β2 )β2 gj,i / 1 − β2 ≤ kg1:t,i k2 and β1,t ≤ β1 . Then
p
(1 − β1t ) vbt,i

gt,i (θt,i − θ,i∗ ) = ∗ 2 ∗ 2
(θt,i − θ,t ) − (θt+1,i − θ,i )
2αt (1 − β1,t )
1 p
β1,t vbt−1,i
4
∗ √ mt−1,i αt (1 − β1t ) vbt,i mb t,i
+ √ (θ,i − θt,i ) αt−1 1 + ( p )2
(1 − β1,t ) αt−1 vbt−1,i
4 2(1 − β1,t ) vbt,i

1 ∗ 2 β1,t
) − (θt+1,i − θ,i∗ )2 (θ,i∗ − θt,i )2 vbt−1,i
p p
≤ (θt,i − θ,t vbt,i +
2αt (1 − β1 ) 2αt−1 (1 − β1,t )
β1 αt−1 m2t−1,i αt mb2
+ p + p t,i
2(1 − β1 ) vbt−1,i 2(1 − β1 ) vbt,i
We apply Lemma 10.4 to the above inequality and derive the regret bound by summing across all
the dimensions for i ∈ 1, ..., d in the upper bound of ft (θt ) − ft (θ∗ ) and the sequence of convex
functions for t ∈ 1, ..., T :
d d X T p p
X 1 ∗ 2
p X 1 ∗ 2 vbt,i vbt−1,i
R(T ) ≤ (θ1,i − θ,i ) vb1,i + (θt,i − θ,i ) ( − )
i=1
2α1 (1 − β1 ) i=1 t=2
2(1 − β1 ) αt αt−1
d d
β αG αG∞
√1 ∞
X X
+ kg1:T,i k2 + √ kg1:T,i k2
(1 − β1 ) 1 − β2 (1 − γ)2 i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1
d X
T
X β1,t
(θ∗ − θt,i )2 vbt,i
p
+
i=1 t=1
2αt (1 − β1,t ) ,i
14
From the assumption, kθt − θ∗ k2 ≤ D, kθm − θn k∞ ≤ D∞ , we have:

d d d X t
D2 X p α(1 + β1 )G∞ X 2 X
D∞ β1,t p
R(T ) ≤ T vbT,i + √ kg1:T,i k2 + tb
vt,i
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 2α i=1 t=1 (1 − β1,t )
d d
D2 X p α(1 + β1 )G∞ X
≤ T vbT,i + √ kg1:T,i k2
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1
√ d t
D2 G∞ 1 − β2 X X β1,t √
+ ∞ t
2α i=1 t=1
(1 − β1,t )
We can use arithmetic geometric series upper bound for the last term:
t t
X β1,t √ X 1 √
t≤ λt−1 t
t=1
(1 − β1,t ) t=1
(1 − β1 )
t
X 1
≤ λt−1 t
t=1
(1 − β1 )
1
≤
(1 − β1 )(1 − λ)2
Therefore, we have the following regret bound:
d d d √
D2 X p α(1 + β1 )G∞ X X D∞2
G∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg1:T,i k2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2αβ1 (1 − λ)2
15
Deep Learning with Limited Numerical Precision
Suyog Gupta suyog@us.ibm.com

Ankur Agrawal ankuragr@us.ibm.com
Kailash Gopalakrishnan kailash@us.ibm.com
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
Pritish Narayanan pnaraya@us.ibm.com
IBM Almaden Research Center, San Jose, CA 95120
arXiv:1502.02551v1 [cs.LG] 9 Feb 2015
Abstract At the same time, the natural error resiliency of

Training of large-scale deep neural networks neural network architectures and learning algorithms
is often constrained by the available compu- is well-documented, setting them apart from more
tational resources. We study the effect of lim- traditional workloads that typically require precise
ited precision data representation and com- computations and number representations with high
putation on neural network training. Within dynamic range. It is well appreciated that in the
the context of low-precision fixed-point com- presence of statistical approximation and estimation
putations, we observe the rounding scheme errors, high-precision computation in the context of
to play a crucial role in determining the learning is rather unnecessary (Bottou & Bousquet,
network’s behavior during training. Our re- 2007). Moreover, the addition of noise during train-
sults show that deep networks can be trained ing has been shown to improve the neural network’s
using only 16-bit wide fixed-point number performance (Murray & Edwards, 1994; Bishop, 1995;
representation when using stochastic round- Audhkhasi et al., 2013). With the exception of em-
ing, and incur little to no degradation in the ploying the asynchronous version of the stochastic
classification accuracy. We also demonstrate gradient descent algorithm (Recht et al., 2011) to
an energy-efficient hardware accelerator that reduce network traffic, the state-of-the-art large-scale
implements low-precision fixed-point arith- deep learning systems fail to adequately capitalize on
metic with stochastic rounding. the error-resiliency of their workloads. These systems
are built by assembling general-purpose computing
hardware designed to cater to the needs of more tradi-
1. Introduction tional workloads, incurring high and often unnecessary
overhead in the required computational resources.
To a large extent, the success of deep learning tech-
The work presented in this paper owes its inception
niques is contingent upon the underlying hardware
to the thinking that it may be possible to leverage
platform’s ability to perform fast, supervised train-
algorithm-level noise-tolerance to relax certain con-
ing of complex networks using large quantities of
straints on the underlying hardware, leading to a
labeled data. Such a capability enables rapid evalua-
hardware-software co-optimized system that achieves
tion of different network architectures and a thorough
significant improvement in computational performance
search over the space of model hyperparameters. It
and energy efficiency. Allowing the low-level hard-
should therefore come as no surprise that recent years
ware components to perform approximate, possibly
have seen a resurgence of interest in deploying large-
non-deterministic computations and exposing these
scale computing infrastructure designed specifically
hardware-generated errors up to the algorithm level of
for training deep neural networks. Some notable
the computing stack forms a key ingredient in develop-
efforts in this direction include distributed computing
ing such systems. Additionally, the low-level hardware
infrastructure using thousands of CPU cores (Dean
changes need to be introduced in a manner that pre-
et al., 2012; Chilimbi et al., 2014), or high-end graphics
serves the programming model so that the benefits can
processors (GPUs) (Krizhevsky & Hinton, 2009), or a
be readily absorbed at the application-level without
combination of CPUs and GPUs scaled-up to multiple
incurring significant software redevelopment costs.
nodes (Coates et al., 2013; Wu et al., 2015).
As a first step towards achieving this cross-layer co- Previous studies have also investigated neural network
design, we explore the use of low-precision fixed-point training using different number representations. Iwata
arithmetic for deep neural network training with a et al. (Iwata et al., 1989) implements the back-
special focus on the rounding mode adopted while propagation algorithm using 24-bit floating-point pro-
performing operations on fixed-point numbers. The cessing units. Hammerstrom (Hammerstrom, 1990)
motivation to move to fixed-point arithmetic (from presents a framework for on-chip learning using 8 to
the conventional floating-point computations) is two- 16 bit fixed-point arithmetic. In (Holt & Hwang, 1993),
fold. Firstly, fixed-point compute units are typically the authors perform theoretical analysis to understand
faster and consume far less hardware resources and a neural network’s ability to learn when trained in a
power than floating-point engines. The smaller logic limited precision setting. Results from empirical eval-
footprint of the fixed-point arithmetic circuits would uation of simple networks indicate that in most cases,
allow for the instantiation of many more such units for 8-16 bits of precision is sufficient for back-propagation
a given area and power budget. Secondly, low-precision learning. In (Höhfeld & Fahlman, 1992), probabilistic
data representation reduces the memory footprint, rounding of weight updates is used to further reduce
enabling larger models to fit within the given memory (< 8 bits) the precision requirements in gradient-based
capacity. Cumulatively, this could provide dramati- learning techniques. While these studies provide valu-
cally improved data-level parallelism. able insights into the behavior of the limited precision
training of neural networks, the networks considered
The key finding of our exploration is that deep neural
are often limited to variants of the classical multilayer
networks can be trained using low-precision fixed-
perceptron containing a single hidden layer and only
point arithmetic, provided that the stochastic rounding
a few hidden units. Extrapolating these results to
scheme is applied while operating on fixed-point num-
the state-of-the-art deep neural networks that can
bers. We test the validity of the proposed approach
easily contain millions of trainable parameters is non-
by training deep neural networks for the MNIST and
trivial. Consequently, there is a need to reassess the
CIFAR10 image classification tasks. Deep networks
impact of limited precision computations within the
trained using 16-bit wide fixed-point and stochastic
context of more contemporary deep neural network
rounding achieve nearly the same performance as that
architectures, datasets, and training procedures.
obtained when trained using 32-bit floating-point com-
putations. Furthermore, we present a hardware accel- A recent work (Chen et al., 2014) presents a hardware
erator design, prototyped on an FPGA, that achieves accelerator for deep neural network training that em-
high throughput and low power using a large number ploys fixed-point computation units, but finds it neces-
of fixed-point arithmetic units, a dataflow architecture, sary to use 32-bit fixed-point representation to achieve
and compact stochastic rounding modules. convergence while training a convolutional neural net-
work on the MNIST dataset. In contrast, our results
2. Related Work show that it is possible to train these networks using
only 16-bit fixed-point numbers, so long as stochastic
Determining the precision of the data representation rounding is used during fixed-point computations. To
and the compute units is a critical design choice in the our knowledge, this work represents the first study
hardware (analog or digital) implementation of artifi- of application of stochastic rounding while training
cial neural networks. Not surprisingly, a rich body of deep neural networks using low-precision fixed-point
literature exists that aims to quantify the effect of this arithmetic.
choice on the network’s performance. However, a dis-
proportionately large majority of these studies are fo- 3. Limited Precision Arithmetic
cused primarily on implementing just the feed-forward
(inference) stage, assuming that the network is trained Standard implementations of deep neural network
offline using high precision computations. Some recent training via the back-propagation algorithm typically
studies that embrace this approach have relied on the use 32-bit floating-point (float) representation of real
processor’s vector instructions to perform multiple 8 numbers for data storage and manipulation. Instead,
bit operations in parallel (Vanhoucke et al., 2011), consider the generalized fixed-point number repre-
or employ reconfigurable hardware (FPGAs) for high- sentation: [QI.QF], where QI and QF correspond to
throughput, energy-efficient inference (Farabet et al., the integer and the fractional part of the number,
2011; Gokhale et al., 2014), or take the route of custom respectively. The number of integer bits (IL) plus
hardware implementations (Kim et al., 2014; Merolla the number of fractional bits (FL) yields the total
et al., 2014). number of bits used to represent the number. The
2
sum IL + FL is referred to as the word length WL. In 3.2. Multiply and accumulate (MACC) operation
this paper, we use the notation hIL, FLi to denote a
Consider two d-dimensional vectors a and b such
fixed-point representation in which IL (FL) correspond
that each component is represented in the fixed-point
to the length of the integer (fractional) part of the
format hIL, FLi, and define c0 = a.b as the inner
number. We also employ to denote the smallest
product of a and b. c0 is also represented in some
positive number that may be represented in the given ~ IFi.
~ We split the computation
fixed-point format hIL,
fixed-point format. Therefore, the hIL, FLi fixed-point
of c0 into the following two steps:
format limits the precision
to FL bits, sets the range
to −2IL−1 , 2IL−1 − 2−FL , and defines to be equal to

2−FL . 1. Compute z =
Pd
ai bi
i=1
3.1. Rounding Modes The product of ai and bi produces a fixed-point

As will be evident in the sections to follow, the number in the h2 ∗ IL, 2 ∗ FLi format. z can be
rounding mode adopted while converting a number thought of as a temporary fixed-point register with
(presumably represented using the float or a higher enough width (number of bits) to prevent satura-
precision1 fixed-point format) into a lower precision tion/overflow and avoid any loss of precision while
fixed-point representation turns out to be a matter accumulating the sum over all products ai bi . The
of important consideration while performing compu- requirement on the width of z is log2 d + 2WL in the
tations on fixed-point numbers. Given a number x worst case. Note that the worst case is extremely
and the target fixed-point representation hIL, FLi, we rare and occurs when all ai and bi are saturated to
define bxc as the largest integer multiple of (= 2−FL ) either the lower or the upper limit of hIL, FLi.
less than or equal to x and consider the following
rounding schemes:
~ IFi)
2. Convert: c0 = Convert(z, hIL, ~
• Round-to-nearest
This step invokes the Convert() function defined
Round(x, hIL, FLi) =
previously in eq. 1, resulting in either clipping the
 ~ IFi
~ or rounding
bxc if bxc ≤ x ≤ bxc + value in z to the limits set by hIL,
2 ~
to FL bits of fractional precision using the specified
bxc + if bxc + < x ≤ bxc + rounding mode.
2
• Stochastic rounding: The probability of rounding x
to bxc is proportional to the proximity of x to bxc: Adopting this two-step approach has several advan-
tages. Firstly, it closely mimics the behavior of the
x − bxc

bxc
 w.p. 1 − hardware implementation of vector inner product us-
Round (x, hIL, FLi) = ing the the hardware DSP2 units in FPGAs. These
bxc + w.p. x − bxc
 DSP units accept 18-bit inputs and accumulate the
results of the MACC operation in a 48-bit wide reg-
Stochastic rounding is an unbiased rounding ister. Secondly, by invoking the rounding mode only
scheme and possesses the desirable property after the accumulation of all the sums, we significantly
that the expected rounding error is zero, i.e. reduce the hardware overhead in implementing the
E (Round (x, hIL, FLi)) = x stochastic rounding scheme. Lastly, the adoption of
this approach allows us to efficiently simulate fixed-
Irrespective of the rounding mode used, if x lies outside point computations using CPUs/GPUs and vendor-
the range of hIL, FLi, we saturate the result to either supplied BLAS3 libraries. For instance, matrix multi-
the lower or the upper limit of hIL, FLi: plication of two fixed-point matrices A and B can be
simulated by first converting them into float matri-
Convert (x, hIL, FLi) = ces, calling the hardware-optimized SGEMM routine and
 IL−1
−2 if x ≤ −2IL−1 applying the Convert() function to each element of the
(1)

2IL−1 − 2−FL if x ≥ 2IL−1 − 2−FL resulting float matrix.
 2
Round(x, hIL, FLi) otherwise Digital Signal Processing units are hardware units in

the FPGA fabric that implement fixed-point multiplication
1
We call hIL1 , FL1 i to be a higher precision representa- and addition
3
tion than hIL2 , FL2 i iff FL1 > FL2 Basic Linear Algebra Subprograms
3
Figure 1. MNIST dataset using fully connected DNNs: Training error (a, c) and the test error (b, d ) for training using
fixed-point number representation and rounding mode set to either “Round to nearest” (top) or “Stochastic rounding”
(bottom). The word length for fixed-point numbers WL is kept fixed at 16 bits and results are shown for three different
fractional (integer) lengths: 8(8), 10(6), and 14(2) bits. Results using float are also shown for comparison.
4. Training Deep Networks baseline evaluation. The word length WL for the fixed-
point format is set to 16 bits i.e. the number of bits
In this section, we present the results of our in- allocated to represent the integer and the fractional
vestigation into the effect of employing limited pre- parts add up to 16.
cision data representation during the training of
deep neural networks. We consider both fully con- This fairly restrictive choice of number representation
nected deep neural networks (DNN) as well as has some important implications. From the perspec-
convolutional neural networks (CNN) and present tive of neural network training, an aggressive reduction
results for the MNIST(Lecun & Cortes) and the of the precision with which the parameter updates are
CIFAR10(Krizhevsky & Hinton, 2009) datasets. As a computed and stored may result in the loss of the
baseline for comparison, we first evaluate the network gradient information if the updates are significantly
performance (in terms of the rate of reduction of both smaller than the for the given fixed-point format. As
the training error and the error on the test set) using a consequence, this may impede the progress of the
the conventional 32-bit floating-point arithmetic. Sub- gradient descent algorithm, or worse, introduce insta-
sequently, we constrain the neural network parameters bilities during the training procedure. Note that in the
(weights W l , biases B l ), as well as the other interme- round-to-nearest
scheme, any parameter update in the
diate variables generated during the back-propagation range − 2 , 2 is always rounded to zero, as opposed to
algorithm (layer outputs Y l , back-propagated error the stochastic rounding scheme which maintains a non-
δ l , weight updates ∆W l , bias updates ∆B l ) to be zero probability of small parameter updates to round
represented in the fixed-point format and train the to ±. Secondly, since the fixed-point format offers
network again starting from random initialization of only a limited range, outputs of the ReLU activation
the parameters. While training using fixed-point, the function may get clipped to the upper limit set by
different model hyperparameters such as weight ini- hIL, FLi. From a hardware perspective, the use of 16-
tialization, regularization parameters, learning rates bits for data storage (instead of float) corresponds to
etc. are kept unchanged from the ones used during the a factor 2 reduction in the amount of memory needed
4
Figure 2. MNIST dataset using CNNs: Training error (a) and the test error (b) for training using fixed-point number
representation and rounding mode set to either “Round to nearest” or “Stochastic rounding”. The word length for fixed-
point numbers WL is kept fixed at 16 bits and results are shown for different fractional (integer) lengths for weights and
weight updates: 12(4), and 14(2) bits. Layer outputs use h6, 10i format in all cases. Results using float are also shown
for comparison.
for training a given network. Moreover, the use of the degradation in either the convergence rate or the clas-
same word length for all network variables carries with sification accuracy. A reduction in the precision below
it the added advantage of simplifying the hardware 14 bits begins to negatively impact the network’s
implementation. ability to learn when the round-to-nearest scheme is
adopted. This is primarily because at reduced frac-
4.1. MNIST tional precision, most of the parameter updates are
rounded down to zero. In contrast, the stochastic
4.1.1. Fully connected DNN rounding preserves the gradient information, atleast
In the first set of experiments, we construct a fully statistically, and the network is able to learn with as
connected neural network with 2 hidden layers, each few as 8 bits of precision without any significant loss in
containing 1000 units with ReLU activation function performance. Note, however, at a precision lower than
and train this network to recognize the handwritten 8 bits, even the stochastic rounding scheme is unable
digits from the MNIST dataset. This dataset comprises to fully prevent the loss of gradient information.
of 60, 000 training images and 10, 000 test images –
each image is 28 x 28 pixels containing a digit from 4.1.2. CNN
0 to 9. The pixel values are normalized to lie in Using the MNIST dataset, we also evaluate a CNN
the [0, 1] range. No other form of data pre-processing with an architecture similar to LeNet-5 (LeCun et al.,
or augmentation is performed. The weights in each 1998). It comprises of 2 convolutional layers with 5x5
layer are initialized by sampling random values from filters and ReLU activation function. The first layer
N (0, 0.01) while the bias vectors are initialized to has 8 feature maps while the second convolutional
0. The network is trained using minibatch stochastic layer produces 16 feature maps. Each convolutional
gradient descent (SGD) with a minibatch size of 100 layer is followed by a pooling/subsampling layer. The
to minimize the cross entropy objective function. The pooling layers implement the max pooling function
float baseline achieves a test error of 1.4%. over non-overlapping pooling windows of size 2x2. The
Next, we retrain the network using fixed-point com- output of the second pooling layer feeds into a fully
putations and set WL to 16 bits. Figure 1 shows the connected layer consisting of 128 ReLU neurons, which
results for the two rounding modes: Round-to-nearest is then connected into a 10-way softmax output layer.
and Stochastic rounding. In both cases, allocating 14 For training this network, we adopt an exponentially
bits to the fractional part4 produces no noticeable decreasing learning rate – scaling it by a factor of 0.95
4
Using up 14 bits for the fractional part leaves only 2 after every epoch of training. The learning rate for
bits (including the sign bit) for representing the integer the first epoch is set to 0.1. Momentum (p = 0.9)
portion of the number. This does not seem to adversely is used to speed up SGD convergence. The weight
affect the network performance. decay parameter is set to 0.0005 for all layers. When
5
Figure 3. CIFAR10 dataset using CNNs:Training error (a) and the test error (b) for training using fixed-point number
representation and rounding mode set to either “Round to nearest” or “Stochastic rounding”. The word length for fixed-
point numbers WL is kept fixed at 16 bits and results are shown for different fractional (integer) lengths for weights and
weight updates: 12(4), and 14(2) bits. The black arrows indicate the epoch after which the training is carried out using
WL = 20 bits. Results using float are also shown for comparison.
trained using float, the network achieves a test error a 10-way softmax output layer. This architecture is
of 0.77%. As was done previously for DNNs, we retrain similar to the one introduced in (Hinton et al., 2012)
the network using fixed-point computations with WL with the exception that it does not implement local
set to 16 bits. However, in this case, saturating the out- response normalization or dropout layers.
put of the convolutional layers to a low integer value
The network training starts off with a learning rate
created some difficulty in jump-starting the training
of 0.01 and reduced by a factor of 2 after 50, 75,
procedure. As a result, we increase the number of
and 100 epochs. Using 32-bit floating point numbers
bits allocated for the integer part at the expense of
for training, this network configuration misclassifies
reducing the precision and choose the h6, 10i format
approximately 24.6% of the images in the test set. This
for representing the layer outputs. Figure 2 compiles
serves as the baseline for comparing the results ob-
the results obtained using the two different rounding
tained while training the network using fixed-point
modes. Unlike in the case of DNNs, when the round-to-
computations. Similar to earlier experiments, we set
nearest scheme is adopted during fixed-point computa-
the WL for fixed-point number to 16 and test the
tions, the training procedure fails to converge. When
different rounding modes and fractional precision. The
stochastic rounding is used, we achieve a test error
layer outputs are represented in the h4, 12i format. As
of 0.83% and 0.90% for 14-bit and 12-bit precision, re-
observed previously and as shown in Figure 3, training
spectively – corresponding to only a slight degradation
using fixed-point with round-to-nearest scheme begins
from the float baseline.
to collapse after only a few epochs. On the contrary,
the stochastic rounding scheme appears to bestow
4.2. CIFAR10 upon the training procedure a significantly higher
To further test the validity of the stochastic rounding degree of stability. For 14 bits of fractional precision
approach, we consider another commonly used image and the stochastic rounding scheme, the network’s
classification benchmark: CIFAR10. The training set behavior is quite similar to that observed during the
consists of 50, 000 RGB images of size 32x32 pixels. baseline evaluation and achieves a test error of 25.4%.
The images are divided into 10 classes, each containing If the precision is reduced further (to 12 bits) the
5, 000 images. The test set has 10, 000 images. We convergence rate degrades as the learning proceeds
scale the image RGB values to [0,1] range and do and after a point, SGD stops making progress. This
not perform any other form of data pre-processing or is expected since at reduced precision, the parameter
augmentation. For this dataset, we construct a CNN updates tend to become sparser (despite stochastic
with 3 convolutional layers each followed by a subsam- rounding) due to the perilous combination of smaller
pling/pooling layer. The convolutional layers consist gradients and diminished learning rates. The network’s
of 64 5x5 filters and the subsampling layers implement performance suffers as a result and the minimum
the max pooling function over a window of size 3x3 achievable test error saturates at 28.8%. Fortunately,
using a stride of 2. The 3rd pooling layer connects to this damage is reversible as shown in Figure 3. After
6
training for 100 epochs using the h4, 12i format, we FPGAs have a large number of hard-wired fixed-point
relax the constraint on WL slightly and increase WL by DSP units that are well-suited to implementing the
4 bits to 20 bits. This increases the fractional precision fixed-point arithmetic described in the earlier sections,
to 16 bits (h4, 16i format) and subsequent training and can potentially yield gains in performance and
results in a rapid improvement in the network’s per- power efficiency. However, limited memory bandwidth
formance. After an additional 15-20 epochs of training must still be carefully managed through various design
using the higher precision representation, the test error choices.
approaches that obtained using float.
This result reveals a promising (and possibly more
robust) strategy for deep neural network training in
which the network is first trained using low-precision
fixed-point arithmetic and stochastic rounding. At the
point where learning shows stagnation, the network
can be “fine-tuned” using only a few epochs of higher-
precision fixed-point computations. Such a concept
of employing mixed-precision computations has been
explored previously in the context of floating point
arithmetic (Baboulin et al., 2009), motivated largely
by the fact that most modern processors achieve a Figure 4. Block diagram of the FPGA-based fixed-point
factor 2 to 4 higher computational throughput for matrix multiplier.
single-precision (32-bit) floating-point as compared
with double-precision (64-bit) floating-point. Similar Our prototype is implemented on an off-the-shelf
concepts, in conjunction with stochastic rounding, can FPGA card featuring a Xilinx Kintex325T FPGA and
be extended to perform mixed-precision fixed-point 8 GB DDR3 memory, and communicating with the
arithmetic.5 host PC over a PCIe bus. This FPGA has 840 DSP
multiply-accumulate units and almost 2 MB of on-chip
5. Hardware Prototyping block RAM. The data bandwidth between the off-chip
DDR3 memory and the FPGA is 6.4 GB/s. The typical
The execution time of the mini-batch stochastic gradi- dimensions of the input matrices preclude storing
ent descent algorithm is dominated by a series of GEMM entire matrices in on-chip RAM. Thus, these matrices
operations in the feed-forward, error back-propagation are stored in the DDR3 memory and parts of the ma-
and weight update calculation steps6 . As a result, trices are brought into the FPGA for performing the
an improvement in the computational throughput of computations. The off-chip communication bandwidth
the GEMM operation translates into an improvement in limitation necessitates that we reuse the on-chip data
the training time. GPUs offering a large number of to the highest extent possible to make the achievable
parallel vector processors and high memory bandwidth throughput, measured in giga-operations/second (G-
have therefore been very effective in accelerating these ops/s), compute-bound.
workloads.
In this section we describe a FPGA-based hardware ac- 5.1. System Description
celerator for matrix-matrix multiplication. Our choice Figure 4 presents a block diagram of the our fixed-
of using FPGAs as the hardware substrate is mo- point matrix multiplier. The DSP units within
tivated by two factors. Firstly, FPGAs enable fast the FPGA are organized as a massively parallel 2-
hardware development times and significantly lower dimensional systolic array (SA) (Kung, 1982) of size
costs when compared to ASICs7 . Secondly, modern n such that n2 < 840. This forms the core of the
5
While preparing this paper, we became aware of a very multiplier and will be described in greater detail in
recent work (Courbariaux et al., 2014) that shares our mo- the next subsection. Most of the block RAM on the
tivations but adopts an orthogonal approach. The authors FPGA is designated as the L2 cache where a fraction
propose the use of dynamic fixed-point (a hybrid of the of the input matrices are stored. The READ logic sends
fixed-point and the conventional floating-point arithmetic)
for training deep neural networks. However, hardware data requests to the DDR3 memory and organizes
implications of this approach are not immediately obvious. the incoming data into the L2 cache. The WRITE
6 logic sends back computed results to the external
Convolution may also be rewritten as a GEMM operation
7
Application Specific Integrated Circuits memory. The L2-to-SA circuit moves relevant rows
and columns from the L2 cache to the array. The TOP
7
controller coordinates the entire process. The FPGA FIFO. Elements from earlier cycles are cascaded right
also contains Xilinx-supplied IP blocks that interface (for A) or down (for B) and the corresponding partial
to the DDR3 memory. products are accumulated at the DSP units. After
accumulation of all partial products, output data is
The operation sequence of the multiplier is as fol-
cascaded out to stochastic rounding units (DSP ROUND)
lows. Assume the first input matrix A has dimensions
that are also implemented with DSP units. Rounded
l x k and the second input matrix B has dimensions
results are stored in output FIFOs (one per column)
k x m. Initially n columns of matrix B and pn rows
before final readout to external memory. Throughput
of matrix A, where p is the largest integer we can
of the array depends on the number of DSPs available
choose based on on-chip memory capacity constraints,
and the maximum operating frequency at which the
are brought into the FPGA to compute pn2 elements
system can be operated without timing errors. This is
of the result matrix. The next n columns of matrix B
an example of a wavefront-type systolic array where
are then brought it and processed. This continues until
all connections are local, i.e. only between neighbor-
all m columns of matrix B have been multiplied with
ing DSPs and edge FIFOs, which limits interconnect
the first pn rows of matrix A. This entire sequence
delays and improves maximum operating frequency.
is repeated l/pn times to process all rows of matrix
A. Double buffering is employed to hide the latency
Bk1 Bkk
of bringing in new subsets of the matrices in to the
chip. This sequence of operation ensures that elements
A1k MACC MACC MACC
of matrix A are reused m times once brought into
11 12 1n
the FPGA while those of matrix B are reused pn
times. This reuse allows efficient use of the bandwidth
between the FPGA and the DDR3 memory. MACC MACC
21 2n
5.2. Systolic Array Architecture
Output C FIFOs
Input B
FIFFO
FIFFO
FIFFO
FIFOs
O
Akk MACC MACC MACC
DSP DSP DSP
n1
1 n2
2 nn
FIFO
FIFO
FIFO
ROUND ROUND ROUND

O
FIFO
DSP DSP DSP Figure 6. Wavefront systolic array operation.
MACC MACC MACC
In a wavefront array, as depicted in Figure 6, at the
DSP DSP DSP end of k cycles, where k corresponds to the inner
Input A FIFO
FIFOs
MACC MACC MACC dimension of the matrix multiplication, MACC unit “11”
has accumulated all of its partial products. At this
point, the accumulated result is transferred to a local
register and the DSP is reset. This frees it up to receive
FIFO
DSP DSP DSP data from the next matrix multiplication operation,
MACC MACC MACC
even before other elements have completed. This
Local Storage Registers achieves high throughput for the systolic array so long
as the pipeline is fed with new incoming data. At the
Figure 5. Schematic of the systolic core for matrix multi-
end of (k + 2n − 2) cycles, the matrix multiplication is
plication.
complete, and data from the last DSP unit can be read
out. Output paths from local registers to the edge of
Figure 5 shows the logical organization of the systolic
the array are also cascaded.
array. Each node of the systolic array (DSP MACC) has
a DSP unit that implements two operations (multiply Word length of the result elements after MACC oper-
and accumulate) in every clock cycle. Elements of ations are much larger (typically 48 bits if using 7-
input matrices A and B brought in from L2-cache series DSPs) than word length of the inputs (typi-
are staged in local block RAM units configured as cally 18 bits or less). Before transferring to output
FIFO (First In First Out) queues. Each FIFO contains FIFOs, result elements must be trimmed through
elements from either a row of A or a column of B. In the stochastic rounding of least signficant bits (LSB)
each clock cycle, one element is read out from the and truncation of excess MSB bits (after detection of
8
overflow/underflow). Both operations can be efficiently 6. Conclusion

achieved using a single DSP unit per output. At each
column, linear feedback shift register (LFSR) is used In this paper, we embrace a top-down approach ex-
to generate a random number whose width is equal ploiting the noise-tolerance of deep neural networks
to the number of LSB bits being rounded off. The and their training algorithms to influence the design
DSP unit adds the random number to the incoming of low-level compute units. Specifically, the substitu-
result and drops rounded off LSB bits. Pattern-detect tion of floating-point units with fixed-point arithmetic
capabilities built into the DSP are used to determine circuits comes with significant gains in the energy
if excess MSB bits are identical (all “0s” or all “1s”). If efficiency and computational throughput, while poten-
not, an overflow/underflow condition is detected, and tially risking the neural network’s performance. For
result values are saturated to the max/min 2’s com- low-precision fixed-point computations, where con-
plement values8 . The result is then transferred to ventional rounding schemes fail, adopting stochastic
output column FIFOs awaiting writeback to external rounding during deep neural network training deliv-
memory. The overhead of stochastic rounding is thus ers results nearly identical as 32-bit floating-point
the logic occupied by DSP ROUND units, which in our computations. Additionally, we implement a high-
case is 28 DSP units – corresponding to less than 4% throughput, energy-efficient architecture for matrix
overhead in hardware resources. multiplication that incorporates stochastic rounding
with very little overhead. Extrapolating, we envision
the emergence of hardware-software co-designed sys-
5.3. Results
tems for large-scale machine learning based on re-
For a 28x28 systolic array implemented on the laxed, inexact models of computing running on non-
KintexK325T FPGA, Xilinx’s Vivado synthesis and deterministic components all across the stack, right
place-and-route tool estimated a maximum circuit down to low-level hardware circuitry.
operation frequency of 166 MHz and a power consump-
tion of 7 W. This translates to a throughput of 260 G- References
ops/s at a power efficiency of 37 G-ops/s/W. This
compares very favorably against the Intel i7-3720QM Audhkhasi, Kartik, Osoba, Osonde, and Kosko, Bart.
CPU, the NVIDIA GT650m and the GTX780 GPUs, Noise benefits in backpropagation and deep bidirec-
all of which achieve power efficiency in the range of 1-5 tional pre-training. In Neural Networks (IJCNN),
G-ops/s/W (Gokhale et al., 2014). Table 1 presents a The 2013 International Joint Conference on, pp. 1–
summary of the utilization of various resources in the 8. IEEE, 2013.
FPGA. Throughput numbers can benefit from migra-
tion to newer Xilinx FPGAs, such as the Ultrascale Baboulin, Marc, Buttari, Alfredo, Dongarra, Jack,
series, that have much higher number of DSP units Kurzak, Jakub, Langou, Julie, Langou, Julien,
and can potentially operate at higher frequencies. Luszczek, Piotr, and Tomov, Stanimire. Acceler-
ating scientific computations with mixed precision
algorithms. Computer Physics Communications,
180(12):2526–2533, 2009.
Table 1. FPGA resource utilization.
Bishop, Chris M. Training with noise is equivalent to
Available on Utilization
tikhonov regularization. Neural computation, 7(1):
Resource Usage
XCVK325T Ratio 108–116, 1995.
LUTs 62922 203800 31% Bottou, Léon and Bousquet, Olivier. The tradeoffs of
Flip-flops 146510 407600 36% large scale learning. In NIPS, volume 4, pp. 2, 2007.
DSP 812 840 97%
Block RAM 334 445 75% Chen, Yunji, Luo, Tao, Liu, Shaoli, Zhang, Shijin, He,
Liqiang, Wang, Jia, Li, Ling, Chen, Tianshi, Xu,
Zhiwei, Sun, Ninghui, et al. Dadiannao: A machine-
learning supercomputer. In Microarchitecture (MI-
8
CRO), 2014 47th Annual IEEE/ACM International
A more direct stochastic rounding approach is multi-
Symposium on, pp. 609–622. IEEE, 2014.
bit magnitude comparison of result LSB vs. a random
number, followed by a conditional addition and examining
excess MSBs. The approach in this section achieves the Chilimbi, Trishul, Suzue, Yutaka, Apacible, Johnson,
same result but removes the first full multi-bit comparison, and Kalyanaraman, Karthik. Project adam: Build-
enabling compact implementation on a single DSP unit. ing an efficient and scalable deep learning training
9
system. In 11th USENIX Symposium on Operating Iwata, Akira, Yoshida, Yukio, Matsuda, Satoshi, Sato,
Systems Design and Implementation (OSDI 14), pp. Yukimasa, and Suzumura, Nobuo. An artificial
571–582, Broomfield, CO, October 2014. neural network accelerator using general purpose 24
bit floating point digital signal processors. In Neural
Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Networks, 1989. IJCNN., International Joint Con-
Catanzaro, Bryan, and Andrew, Ng. Deep learning ference on, pp. 171–175. IEEE, 1989.
with cots hpc systems. In Proceedings of The 30th
International Conference on Machine Learning, pp. Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wony-
1337–1345, 2013. ong. X1000 real-time phoneme recognition vlsi using
feed-forward deep neural networks. In Acoustics,
Courbariaux, Matthieu, Bengio, Yoshua, and David, Speech and Signal Processing (ICASSP), 2014 IEEE
Jean-Pierre. Low precision arithmetic for deep International Conference on, pp. 7510–7514. IEEE,
learning. arXiv preprint arXiv:1412.7024, 2014. 2014.
Krizhevsky, Alex and Hinton, Geoffrey. Learning mul-
Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, tiple layers of features from tiny images. Computer
Kai, Devin, Matthieu, Mao, Mark, Senior, Andrew, Science Department, University of Toronto, Tech.
Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large Rep, 1(4):7, 2009.
scale distributed deep networks. In Advances in
Neural Information Processing Systems, pp. 1223– Kung, H.T. Why systolic architectures? Computer,
1231, 2012. 15(1):37–46, Jan 1982. doi: 10.1109/MC.1982.
1653825.
Farabet, Clément, Martini, Berin, Corda, Benoit, Lecun, Yann and Cortes, Corinna. The MNIST
Akselrod, Polina, Culurciello, Eugenio, and LeCun, database of handwritten digits. URL http://yann.
Yann. Neuflow: A runtime reconfigurable dataflow lecun.com/exdb/mnist/.
processor for vision. In Computer Vision and
Pattern Recognition Workshops (CVPRW), 2011 LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and
IEEE Computer Society Conference on, pp. 109– Haffner, Patrick. Gradient-based learning applied
116. IEEE, 2011. to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
Gokhale, Vinayak, Jin, Jonghoon, Dundar, Aysegul,
Merolla, Paul A, Arthur, John V, Alvarez-Icaza, Ro-
Martini, Berin, and Culurciello, Eugenio. A 240 g-
drigo, Cassidy, Andrew S, Sawada, Jun, Akopyan,
ops/s mobile coprocessor for deep neural networks.
Filipp, Jackson, Bryan L, Imam, Nabil, Guo, Chen,
In Computer Vision and Pattern Recognition Work-
Nakamura, Yutaka, et al. A million spiking-neuron
shops (CVPRW), 2014 IEEE Conference on, pp.
integrated circuit with a scalable communication
696–701. IEEE, 2014.
network and interface. Science, 345(6197):668–673,
2014.
Hammerstrom, Dan. A vlsi architecture for high-
performance, low-cost, on-chip learning. In Neural Murray, Alan F and Edwards, Peter J. Enhanced
Networks, 1990., 1990 IJCNN International Joint mlp performance and fault tolerance resulting from
Conference on, pp. 537–544. IEEE, 1990. synaptic weight noise during training. Neural Net-
works, IEEE Transactions on, 5(5):792–802, 1994.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky,
Alex, Sutskever, Ilya, and Salakhutdinov, Rus- Recht, Benjamin, Re, Christopher, Wright, Stephen,
lan R. Improving neural networks by preventing and Niu, Feng. Hogwild: A lock-free approach to
co-adaptation of feature detectors. arXiv preprint parallelizing stochastic gradient descent. In Ad-
arXiv:1207.0580, 2012. vances in Neural Information Processing Systems,
pp. 693–701, 2011.
Höhfeld, Markus and Fahlman, Scott E. Probabilistic Vanhoucke, Vincent, Senior, Andrew, and Mao,
rounding in neural network learning with limited Mark Z. Improving the speed of neural networks
precision. Neurocomputing, 4(6):291–299, 1992. on cpus. In Proc. Deep Learning and Unsupervised
Feature Learning NIPS Workshop, 2011.
Holt, JL and Hwang, Jenq-Neng. Finite precision error
analysis of neural network hardware implementa- Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing,
tions. Computers, IEEE Transactions on, 42(3): and Sun, Gang. Deep image: Scaling up image
281–290, 1993. recognition. arXiv preprint arXiv:1501.02876, 2015.
10
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift
Sergey Ioffe Christian Szegedy
Google Inc., sioffe@google.com Google Inc., szegedy@google.com
Abstract Using mini-batches of examples, as opposed to one exam-

ple at a time, is helpful in several ways. First, the gradient
arXiv:1502.03167v3 [cs.LG] 2 Mar 2015
Training Deep Neural Networks is complicated by the fact of the loss over a mini-batch is an estimate of the gradient
that the distribution of each layer’s inputs changes during over the training set, whose quality improves as the batch
training, as the parameters of the previous layers change. size increases. Second, computation over a batch can be
This slows down the training by requiring lower learning much more efficient than m computations for individual
rates and careful parameter initialization, and makes it no- examples, due to the parallelism afforded by the modern
toriously hard to train models with saturating nonlineari- computing platforms.
ties. We refer to this phenomenon as internal covariate While stochastic gradient is simple and effective, it
shift, and address the problem by normalizing layer in- requires careful tuning of the model hyper-parameters,
puts. Our method draws its strength from making normal- specifically the learning rate used in optimization, as well
ization a part of the model architecture and performing the as the initial values for the model parameters. The train-
normalization for each training mini-batch. Batch Nor- ing is complicated by the fact that the inputs to each layer
malization allows us to use much higher learning rates and are affected by the parameters of all preceding layers – so
be less careful about initialization. It also acts as a regu- that small changes to the network parameters amplify as
larizer, in some cases eliminating the need for Dropout. the network becomes deeper.
Applied to a state-of-the-art image classification model, The change in the distributions of layers’ inputs
Batch Normalization achieves the same accuracy with 14 presents a problem because the layers need to continu-
times fewer training steps, and beats the original model ously adapt to the new distribution. When the input dis-
by a significant margin. Using an ensemble of batch- tribution to a learning system changes, it is said to experi-
normalized networks, we improve upon the best published ence covariate shift (Shimodaira, 2000). This is typically
result on ImageNet classification: reaching 4.9% top-5 handled via domain adaptation (Jiang, 2008). However,
validation error (and 4.8% test error), exceeding the ac- the notion of covariate shift can be extended beyond the
curacy of human raters. learning system as a whole, to apply to its parts, such as a
sub-network or a layer. Consider a network computing
1 Introduction ℓ = F2 (F1 (u, Θ1 ), Θ2 )
Deep learning has dramatically advanced the state of the where F1 and F2 are arbitrary transformations, and the
art in vision, speech, and many other areas. Stochas- parameters Θ1 , Θ2 are to be learned so as to minimize
tic gradient descent (SGD) has proved to be an effec- the loss ℓ. Learning Θ2 can be viewed as if the inputs
tive way of training deep networks, and SGD variants x = F1 (u, Θ1 ) are fed into the sub-network
such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the ℓ = F2 (x, Θ2 ).
art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss For example, a gradient descent step
m
1 X
N
α X ∂F2 (xi , Θ2 )
Θ = arg min ℓ(xi , Θ) Θ2 ← Θ2 −
Θ N m i=1 ∂Θ2
i=1
where x1...N is the training data set. With SGD, the train- (for batch size m and learning rate α) is exactly equivalent
ing proceeds in steps, and at each step we consider a mini- to that for a stand-alone network F2 with input x. There-
batch x1...m of size m. The mini-batch is used to approx- fore, the input distribution properties that make training
imate the gradient of the loss function with respect to the more efficient – such as having the same distribution be-
parameters, by computing tween the training and test data – apply to training the
1 ∂ℓ(xi , Θ) sub-network as well. As such it is advantageous for the
. distribution of x to remain fixed over time. Then, Θ2 does
m ∂Θ
1
not have to readjust to compensate for the change in the 2 Towards Reducing Internal
distribution of x.
Covariate Shift
Fixed distribution of inputs to a sub-network would We define Internal Covariate Shift as the change in the
have positive consequences for the layers outside the sub- distribution of network activations due to the change in
network, as well. Consider a layer with a sigmoid activa- network parameters during training. To improve the train-
tion function z = g(W u + b) where u is the layer input, ing, we seek to reduce the internal covariate shift. By
the weight matrix W and bias vector b are the layer pa- fixing the distribution of the layer inputs x as the training
1
rameters to be learned, and g(x) = 1+exp(−x) . As |x| progresses, we expect to improve the training speed. It has
′
increases, g (x) tends to zero. This means that for all di- been long known (LeCun et al., 1998b; Wiesler & Ney,
mensions of x = W u+b except those with small absolute 2011) that the network training converges faster if its in-
values, the gradient flowing down to u will vanish and the puts are whitened – i.e., linearly transformed to have zero
model will train slowly. However, since x is affected by means and unit variances, and decorrelated. As each layer
W, b and the parameters of all the layers below, changes observes the inputs produced by the layers below, it would
to those parameters during training will likely move many be advantageous to achieve the same whitening of the in-
dimensions of x into the saturated regime of the nonlin- puts of each layer. By whitening the inputs to each layer,
earity and slow down the convergence. This effect is we would take a step towards achieving the fixed distri-
amplified as the network depth increases. In practice, butions of inputs that would remove the ill effects of the
the saturation problem and the resulting vanishing gradi- internal covariate shift.
ents are usually addressed by using Rectified Linear Units
We could consider whitening activations at every train-
(Nair & Hinton, 2010) ReLU (x) = max(x, 0), careful
ing step or at some interval, either by modifying the
initialization (Bengio & Glorot, 2010; Saxe et al., 2013),
network directly or by changing the parameters of the
and small learning rates. If, however, we could ensure
optimization algorithm to depend on the network ac-
that the distribution of nonlinearity inputs remains more
tivation values (Wiesler et al., 2014; Raiko et al., 2012;
stable as the network trains, then the optimizer would be
Povey et al., 2014; Desjardins & Kavukcuoglu). How-
less likely to get stuck in the saturated regime, and the
ever, if these modifications are interspersed with the op-
training would accelerate.
timization steps, then the gradient descent step may at-
tempt to update the parameters in a way that requires
We refer to the change in the distributions of internal the normalization to be updated, which reduces the ef-
nodes of a deep network, in the course of training, as In- fect of the gradient step. For example, consider a layer
ternal Covariate Shift. Eliminating it offers a promise of with the input u that adds the learned bias b, and normal-
faster training. We propose a new mechanism, which we izes the result by subtracting the mean of the activation
call Batch Normalization, that takes a step towards recomputed over the training data: x b = x − E[x] where
ducing internal covariate shift, and in doing so dramati- x = u + b, X = {x1...N } is the set of values of x over
PN
cally accelerates the training of deep neural nets. It ac- the training set, and E[x] = N1 i=1 xi . If a gradient
complishes this via a normalization step that fixes the descent step ignores the dependence of E[x] on b, then it
means and variances of layer inputs. Batch Normalization will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂b x. Then
also has a beneficial effect on the gradient flow through u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].
the network, by reducing the dependence of gradients Thus, the combination of the update to b and subsequent
on the scale of the parameters or of their initial values. change in normalization led to no change in the output
This allows us to use much higher learning rates with- of the layer nor, consequently, the loss. As the training
out the risk of divergence. Furthermore, batch normal- continues, b will grow indefinitely while the loss remains
ization regularizes the model and reduces the need for fixed. This problem can get worse if the normalization not
Dropout (Srivastava et al., 2014). Finally, Batch Normal- only centers but also scales the activations. We have ob-
ization makes it possible to use saturating nonlinearities served this empirically in initial experiments, where the
by preventing the network from getting stuck in the satu- model blows up when the normalization parameters are
rated modes. computed outside the gradient descent step.
The issue with the above approach is that the gradient
In Sec. 4.2, we apply Batch Normalization to the best- descent optimization does not take into account the fact
performing ImageNet classification network, and show that the normalization takes place. To address this issue,
that we can match its performance using only 7% of the we would like to ensure that, for any parameter values,
training steps, and can further exceed its accuracy by a the network always produces activations with the desired
substantial margin. Using an ensemble of such networks distribution. Doing so would allow the gradient of the
trained with Batch Normalization, we achieve the top-5 loss with respect to the model parameters to account for
error rate that improves upon the best known results on the normalization, and for its dependence on the model
ImageNet classification. parameters Θ. Let again x be a layer input, treated as a
2
vector, and X be the set of these inputs over the training we introduce, for each activation x(k) , a pair of parameters
data set. The normalization can then be written as a trans- γ (k) , β (k) , which scale and shift the normalized value:
formation
x = Norm(x, X )
b y (k) = γ (k) x
b(k) + β (k) .
which depends not only on the given training example x These parameters are learned along with the original
but on all examples X – each of which depends on Θ if model parameters, and restore the representation
p power
x is generated by another layer. For backpropagation, we of the network. Indeed, by setting γ (k) = Var[x(k) ] and
would need to compute the Jacobians β (k) = E[x(k) ], we could recover the original activations,
if that were the optimal thing to do.
∂Norm(x, X ) ∂Norm(x, X ) In the batch setting where each training step is based on
and ;
∂x ∂X the entire training set, we would use the whole set to nor-
ignoring the latter term would lead to the explosion de- malize activations. However, this is impractical when us-
scribed above. Within this framework, whitening the layer ing stochastic optimization. Therefore, we make the sec-
inputs is expensive, as it requires computing the covari- ond simplification: since we use mini-batches in stochas-
ance matrix Cov[x] = Ex∈X [xxT ] − E[x]E[x]T and its tic gradient training, each mini-batch produces estimates
inverse square root, to produce the whitened activations of the mean and variance of each activation. This way, the
Cov[x]−1/2 (x − E[x]), as well as the derivatives of these statistics used for normalization can fully participate in
transforms for backpropagation. This motivates us to seek the gradient backpropagation. Note that the use of mini-
an alternative that performs input normalization in a way batches is enabled by computation of per-dimension vari-
that is differentiable and does not require the analysis of ances rather than joint covariances; in the joint case, reg-
the entire training set after every parameter update. ularization would be required since the mini-batch size is
Some of the previous approaches (e.g. likely to be smaller than the number of activations being
(Lyu & Simoncelli, 2008)) use statistics computed whitened, resulting in singular covariance matrices.
over a single training example, or, in the case of image Consider a mini-batch B of size m. Since the normal-
networks, over different feature maps at a given location. ization is applied to each activation independently, let us
However, this changes the representation ability of a focus on a particular activation x(k) and omit k for clarity.
network by discarding the absolute scale of activations. We have m values of this activation in the mini-batch,
We want to a preserve the information in the network, by
normalizing the activations in a training example relative B = {x1...m }.
to the statistics of the entire training data. Let the normalized values be xb1...m , and their linear trans-
formations be y1...m . We refer to the transform
3 Normalization via Mini-Batch BNγ,β : x1...m → y1...m
Statistics as the Batch Normalizing Transform. We present the BN
Since the full whitening of each layer’s inputs is costly Transform in Algorithm 1. In the algorithm, ǫ is a constant
and not everywhere differentiable, we make two neces- added to the mini-batch variance for numerical stability.
sary simplifications. The first is that instead of whitening
the features in layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: B = {x1...m };
normalize each scalar feature independently, by making it Parameters to be learned: γ, β
have the mean of zero and the variance of 1. For a layer Output: {yi = BNγ,β (xi )}
with d-dimensional input x = (x(1) . . . x(d) ), we will nor- m
malize each dimension 1 X
µB ← xi // mini-batch mean
(k) (k)
m i=1
x − E[x ]
b(k) = p
x 1 X
m
Var[x(k) ] σB2 ← (xi − µB )2 // mini-batch variance
m i=1
where the expectation and variance are computed over the
xi − µB
training data set. As shown in (LeCun et al., 1998b), such xbi ← p 2 // normalize
normalization speeds up convergence, even when the fea- σB + ǫ
tures are not decorrelated. yi ← γbxi + β ≡ BNγ,β (xi ) // scale and shift
Note that simply normalizing each input of a layer may
change what the layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform, applied to
malizing the inputs of a sigmoid would constrain them to activation x over a mini-batch.
the linear regime of the nonlinearity. To address this, we
make sure that the transformation inserted in the network The BN transform can be added to a network to manip-
can represent the identity transform. To accomplish this, ulate any activation. In the notation y = BNγ,β (x), we
3
indicate that the parameters γ and β are to be learned, (Duchi et al., 2011). The normalization of activations that
but it should be noted that the BN transform does not depends on the mini-batch allows efficient training, but is
independently process the activation in each training ex- neither necessary nor desirable during inference; we want
ample. Rather, BNγ,β (x) depends both on the training the output to depend only on the input, deterministically.
example and the other examples in the mini-batch. The For this, once the network has been trained, we use the
scaled and shifted values y are passed to other network normalization
layers. The normalized activations x b are internal to our x − E[x]
b= p
x
transformation, but their presence is crucial. The distri- Var[x] + ǫ
butions of values of any x b has the expected value of 0
using the population, rather than mini-batch, statistics.
and the variance of 1, as long as the elements of each
Neglecting ǫ, these normalized activations have the same
mini-batch are sampled from the same distribution, and
mean 0 and variance 1 as during training. We use the un-
if we neglect ǫ. ThisPcan be seen by observing that
P m
m 1 m biased variance estimate Var[x] = m−1 · EB [σB2 ], where
i=1 x
bi = 0 and m b2i = 1, and taking expec-
i=1 x the expectation is over training mini-batches of size m and
tations. Each normalized activation x b(k) can be viewed as
σB2 are their sample variances. Using moving averages in-
an input to a sub-network composed of the linear trans-
stead, we can track the accuracy of a model as it trains.
form y (k) = γ (k) xb(k) + β (k) , followed by the other pro-
Since the means and variances are fixed during inference,
cessing done by the original network. These sub-network
the normalization is simply a linear transform applied to
inputs all have fixed means and variances, and although
each activation. It may further be composed with the scal-
the joint distribution of these normalized x b(k) can change
ing by γ and shift by β, to yield a single linear transform
over the course of training, we expect that the introduc-
that replaces BN(x). Algorithm 2 summarizes the proce-
tion of normalized inputs accelerates the training of the
dure for training batch-normalized networks.
sub-network and, consequently, the network as a whole.
During training we need to backpropagate the gradi-
ent of loss ℓ through this transformation, as well as com- Input: Network N with trainable parameters Θ;
pute the gradients with respect to the parameters of the subset of activations {x(k) }K
k=1
BN transform. We use chain rule, as follows (before sim- Output: Batch-normalized network for inference, Ninf BN
tr
plification): 1: NBN ← N // Training BN network
2: for k = 1 . . . K do
∂ℓ ∂ℓ
∂bxi = ∂yi · γ 3: Add transformation y (k) = BNγ (k) ,β (k) (x(k) ) to
Pm ∂ℓ Ntr
∂ℓ
= · (xi − µB ) · −1 2 −3/2 BN (Alg. 1)
∂σB2 i=1 ∂bxi 2 (σB + ǫ)
4: Modify each layer in Ntr BN with input x
(k)
to take
Pm Pm (k)
∂ℓ
= ∂ℓ
· √ −1
+ ∂ℓ
· i=1 −2(xi −µB ) y instead
∂µB i=1 ∂b
xi 2 ∂σ 2 m
σB +ǫ B 5: end for
tr
∂ℓ
= ∂ℓ
·√ 1
+ ∂ℓ
· 2(xi −µB )
+ ∂ℓ
· 1 6: Train NBN to optimize the parameters Θ ∪
∂xi ∂b
xi 2 +ǫ ∂σB2 m ∂µB m
σB (k) (k) K
{γ , β }k=1
∂ℓ
Pm ∂ℓ inf tr
∂γ = i=1 ∂yi ·x
bi 7: NBN ← NBN // Inference BN network with frozen
∂ℓ Pm ∂ℓ // parameters
∂β = i=1 ∂yi
8: for k = 1 . . . K do
Thus, BN transform is a differentiable transformation that 9:
(k)
// For clarity, x ≡ x(k) , γ ≡ γ (k) , µB ≡ µB , etc.
introduces normalized activations into the network. This 10: Process multiple training mini-batches B, each of
ensures that as the model is training, layers can continue size m, and average over them:
learning on input distributions that exhibit less internal co-
E[x] ← EB [µB ]
variate shift, thus accelerating the training. Furthermore,
m 2
the learned affine transform applied to these normalized Var[x] ← m−1 EB [σB ]
activations allows the BN transform to represent the iden-
tity transformation and preserves the network capacity. 11: In Ninf
BN , replace the transform y = BN γ,β (x) with
γ γ E[x]
y= √ ·x+ β− √
Var[x]+ǫ Var[x]+ǫ
3.1 Training and Inference with Batch- 12: end for
Normalized Networks Algorithm 2: Training a Batch-Normalized Network
To Batch-Normalize a network, we specify a subset of ac-
tivations and insert the BN transform for each of them,
according to Alg. 1. Any layer that previously received 3.2 Batch-Normalized Convolutional Net-
x as the input, now receives BN(x). A model employing works
Batch Normalization can be trained using batch gradient
descent, or Stochastic Gradient Descent with a mini-batch Batch Normalization can be applied to any set of acti-
size m > 1, or with any of its variants such as Adagrad vations in the network. Here, we focus on transforms
4
that consist of an affine transformation followed by an the gradient during backpropagation and lead to the model
element-wise nonlinearity: explosion. However, with Batch Normalization, back-
propagation through a layer is unaffected by the scale of
z = g(W u + b) its parameters. Indeed, for a scalar a,
where W and b are learned parameters of the model, and BN(W u) = BN((aW )u)
g(·) is the nonlinearity such as sigmoid or ReLU. This for-
mulation covers both fully-connected and convolutional and we can show that
layers. We add the BN transform immediately before the
∂BN((aW )u) ∂BN(W u)
nonlinearity, by normalizing x = W u + b. We could have ∂u = ∂u
also normalized the layer inputs u, but since u is likely ∂BN((aW )u)
= 1 ∂BN(W u)
∂(aW ) a · ∂W
the output of another nonlinearity, the shape of its distri-
bution is likely to change during training, and constraining The scale does not affect the layer Jacobian nor, con-
its first and second moments would not eliminate the co- sequently, the gradient propagation. Moreover, larger
variate shift. In contrast, W u + b is more likely to have weights lead to smaller gradients, and Batch Normaliza-
a symmetric, non-sparse distribution, that is “more Gaus- tion will stabilize the parameter growth.
sian” (Hyvärinen & Oja, 2000); normalizing it is likely to We further conjecture that Batch Normalization may
produce activations with a stable distribution. lead the layer Jacobians to have singular values close to 1,
Note that, since we normalize W u+b, the bias b can be which is known to be beneficial for training (Saxe et al.,
ignored since its effect will be canceled by the subsequent 2013). Consider two consecutive layers with normalized
mean subtraction (the role of the bias is subsumed by β in inputs, and the transformation between these normalized
Alg. 1). Thus, z = g(W u + b) is replaced with vectors: bz = F (bx). If we assume that b
x and bz are Gaussian
and uncorrelated, and that F (b x) ≈ Jbx is a linear transfor-
z = g(BN(W u)) mation for the given model parameters, then both b x and bz
where the BN transform is applied independently to each have unit covariances, and I = Cov[bz] = JCov[b x]J T =
dimension of x = W u, with a separate pair of learned JJ T . Thus, JJ T = I, and so all singular values of J
parameters γ (k) , β (k) per dimension. are equal to 1, which preserves the gradient magnitudes
For convolutional layers, we additionally want the nor- during backpropagation. In reality, the transformation is
malization to obey the convolutional property – so that not linear, and the normalized values are not guaranteed to
different elements of the same feature map, at different be Gaussian nor independent, but we nevertheless expect
locations, are normalized in the same way. To achieve Batch Normalization to help make gradient propagation
this, we jointly normalize all the activations in a mini- better behaved. The precise effect of Batch Normaliza-
batch, over all locations. In Alg. 1, we let B be the set of tion on gradient propagation remains an area of further
all values in a feature map across both the elements of a study.
mini-batch and spatial locations – so for a mini-batch of
size m and feature maps of size p × q, we use the effec- 3.4 Batch Normalization regularizes the
tive mini-batch of size m′ = |B| = m · p q. We learn a model
pair of parameters γ (k) and β (k) per feature map, rather
than per activation. Alg. 2 is modified similarly, so that When training with Batch Normalization, a training ex-
during inference the BN transform applies the same linear ample is seen in conjunction with other examples in the
transformation to each activation in a given feature map. mini-batch, and the training network no longer produc-
ing deterministic values for a given training example. In
our experiments, we found this effect to be advantageous
3.3 Batch Normalization enables higher to the generalization of the network. Whereas Dropout
learning rates (Srivastava et al., 2014) is typically used to reduce over-
In traditional deep networks, too-high learning rate may fitting, in a batch-normalized network we found that it can
result in the gradients that explode or vanish, as well as be either removed or reduced in strength.
getting stuck in poor local minima. Batch Normaliza-
tion helps address these issues. By normalizing activa- 4 Experiments
tions throughout the network, it prevents small changes
to the parameters from amplifying into larger and subop-
4.1 Activations over time
timal changes in activations in gradients; for instance, it
prevents the training from getting stuck in the saturated To verify the effects of internal covariate shift on train-
regimes of nonlinearities. ing, and the ability of Batch Normalization to combat it,
Batch Normalization also makes training more resilient we considered the problem of predicting the digit class on
to the parameter scale. Normally, large learning rates may the MNIST dataset (LeCun et al., 1998a). We used a very
increase the scale of layer parameters, which then amplify simple network, with a 28x28 binary image as input, and
5
1
2 2 details are given in the Appendix. We refer to this model
0.9
as Inception in the rest of the text. The model was trained
0 0
0.8
Without BN using a version of Stochastic Gradient Descent with mo-
With BN
0.7
10K 20K 30K 40K 50K−2 −2 mentum (Sutskever et al., 2013), using the mini-batch size
(a) (b) Without BN (c) With BN of 32. The training was performed using a large-scale, dis-
tributed architecture (similar to (Dean et al., 2012)). All
Figure 1: (a) The test accuracy of the MNIST network networks are evaluated as training progresses by comput-
trained with and without Batch Normalization, vs. the ing the validation accuracy @1, i.e. the probability of
number of training steps. Batch Normalization helps the predicting the correct label out of 1000 possibilities, on
network train faster and achieve higher accuracy. (b, a held-out set, using a single crop per image.
c) The evolution of input distributions to a typical sig- In our experiments, we evaluated several modifications
moid, over the course of training, shown as {15, 50, 85}th of Inception with Batch Normalization. In all cases, Batch
percentiles. Batch Normalization makes the distribution Normalization was applied to the input of each nonlinear-
more stable and reduces the internal covariate shift. ity, in a convolutional way, as described in section 3.2,
while keeping the rest of the architecture constant.
3 fully-connected hidden layers with 100 activations each.
Each hidden layer computes y = g(W u+b) with sigmoid
4.2.1 Accelerating BN Networks
nonlinearity, and the weights W initialized to small ran-
dom Gaussian values. The last hidden layer is followed Simply adding Batch Normalization to a network does not
by a fully-connected layer with 10 activations (one per take full advantage of our method. To do so, we further
class) and cross-entropy loss. We trained the network for changed the network and its training parameters, as fol-
50000 steps, with 60 examples per mini-batch. We added lows:
Batch Normalization to each hidden layer of the network,
Increase learning rate. In a batch-normalized model,
as in Sec. 3.1. We were interested in the comparison be-
we have been able to achieve a training speedup from
tween the baseline and batch-normalized networks, rather
higher learning rates, with no ill side effects (Sec. 3.3).
than achieving the state of the art performance on MNIST
(which the described architecture does not). Remove Dropout. As described in Sec. 3.4, Batch Nor-
Figure 1(a) shows the fraction of correct predictions malization fulfills some of the same goals as Dropout. Re-
by the two networks on held-out test data, as training moving Dropout from Modified BN-Inception speeds up
progresses. The batch-normalized network enjoys the training, without increasing overfitting.
higher test accuracy. To investigate why, we studied in- Reduce the L2 weight regularization. While in Incep-
puts to the sigmoid, in the original network N and batch- tion an L2 loss on the model parameters controls overfit-
normalized network Ntr ting, in Modified BN-Inception the weight of this loss is
BN (Alg. 2) over the course of train-
reduced by a factor of 5. We find that this improves the
ing. In Fig. 1(b,c) we show, for one typical activation from
the last hidden layer of each network, how its distribu- accuracy on the held-out validation data.
tion evolves. The distributions in the original network Accelerate the learning rate decay. In training Incep-
change significantly over time, both in their mean and tion, learning rate was decayed exponentially. Because
the variance, which complicates the training of the sub- our network trains faster than Inception, we lower the
learning rate 6 times faster.
sequent layers. In contrast, the distributions in the batch-
normalized network are much more stable as training pro- Remove Local Response Normalization While Incep-
gresses, which aids the training. tion and other networks (Srivastava et al., 2014) benefit
from it, we found that with Batch Normalization it is not
necessary.
4.2 ImageNet classification
Shuffle training examples more thoroughly. We enabled
We applied Batch Normalization to a new variant of the within-shard shuffling of the training data, which prevents
Inception network (Szegedy et al., 2014), trained on the the same examples from always appearing in a mini-batch
ImageNet classification task (Russakovsky et al., 2014). together. This led to about 1% improvements in the val-
The network has a large number of convolutional and idation accuracy, which is consistent with the view of
pooling layers, with a softmax layer to predict the image Batch Normalization as a regularizer (Sec. 3.4): the ran-
class, out of 1000 possibilities. Convolutional layers use domization inherent in our method should be most bene-
ReLU as the nonlinearity. The main difference to the net- ficial when it affects an example differently each time it is
work described in (Szegedy et al., 2014) is that the 5 × 5 seen.
convolutional layers are replaced by two consecutive lay- Reduce the photometric distortions. Because batch-
ers of 3 × 3 convolutions with up to 128 filters. The net- normalized networks train faster and observe each train-
work contains 13.6 · 106 parameters, and, other than the ing example fewer times, we let the trainer focus on more
top softmax layer, has no fully-connected layers. More “real” images by distorting them less.
6
0.8
0.7
Model Steps to 72.2% Max accuracy
0.6
Inception 31.0 · 106 72.2%
BN-Baseline 13.3 · 106 72.7%
Inception
BN−Baseline BN-x5 2.1 · 106 73.0%
0.5 BN−x5
BN−x30
BN-x30 2.7 · 106 74.8%
BN−x5−Sigmoid BN-x5-Sigmoid 69.8%
Steps to match Inception
0.4
5M 10M 15M 20M 25M 30M Figure 3: For Inception and the batch-normalized
variants, the number of training steps required to
Figure 2: Single crop validation accuracy of Inception reach the maximum accuracy of Inception (72.2%),
and its batch-normalized variants, vs. the number of and the maximum accuracy achieved by the net-
training steps. work.
4.2.2 Single-Network Classification to be trained when sigmoid is used as the nonlinearity,

despite the well-known difficulty of training such net-
We evaluated the following networks, all trained on the works. Indeed, BN-x5-Sigmoid achieves the accuracy of
LSVRC2012 training data, and tested on the validation 69.8%. Without Batch Normalization, Inception with sig-
data: moid never achieves better than 1/1000 accuracy.
Inception: the network described at the beginning of
Section 4.2, trained with the initial learning rate of 0.0015.
BN-Baseline: Same as Inception with Batch Normal- 4.2.3 Ensemble Classification
ization before each nonlinearity. The current reported best results on the ImageNet Large
BN-x5: Inception with Batch Normalization and the Scale Visual Recognition Competition are reached by the
modifications in Sec. 4.2.1. The initial learning rate was Deep Image ensemble of traditional models (Wu et al.,
increased by a factor of 5, to 0.0075. The same learning 2015) and the ensemble model of (He et al., 2015). The
rate increase with original Inception caused the model pa- latter reports the top-5 error of 4.94%, as evaluated by the
rameters to reach machine infinity. ILSVRC server. Here we report a top-5 validation error of
BN-x30: Like BN-x5, but with the initial learning rate 4.9%, and test error of 4.82% (according to the ILSVRC
0.045 (30 times that of Inception). server). This improves upon the previous best result, and
BN-x5-Sigmoid: Like BN-x5, but with sigmoid non- exceeds the estimated accuracy of human raters according
1
linearity g(t) = 1+exp(−x) instead of ReLU. We also at- to (Russakovsky et al., 2014).
tempted to train the original Inception with sigmoid, but For our ensemble, we used 6 networks. Each was based
the model remained at the accuracy equivalent to chance. on BN-x30, modified via some of the following: increased
In Figure 2, we show the validation accuracy of the initial weights in the convolutional layers; using Dropout
networks, as a function of the number of training steps. (with the Dropout probability of 5% or 10%, vs. 40%
Inception reached the accuracy of 72.2% after 31 · 106 for the original Inception); and using non-convolutional,
training steps. The Figure 3 shows, for each network, per-activation Batch Normalization with last hidden lay-
the number of training steps required to reach the same ers of the model. Each network achieved its maximum
72.2% accuracy, as well as the maximum validation accu- accuracy after about 6 · 106 training steps. The ensemble
racy reached by the network and the number of steps to prediction was based on the arithmetic average of class
reach it. probabilities predicted by the constituent networks. The
By only using Batch Normalization (BN-Baseline), we details of ensemble and multicrop inference are similar to
match the accuracy of Inception in less than half the num- (Szegedy et al., 2014).
ber of training steps. By applying the modifications in We demonstrate in Fig. 4 that batch normalization al-
Sec. 4.2.1, we significantly increase the training speed of lows us to set new state-of-the-art by a healthy margin on
the network. BN-x5 needs 14 times fewer steps than In- the ImageNet classification challenge benchmarks.
ception to reach the 72.2% accuracy. Interestingly, in-
creasing the learning rate further (BN-x30) causes the
model to train somewhat slower initially, but allows it to 5 Conclusion
reach a higher final accuracy. It reaches 74.8% after 6·106
steps, i.e. 5 times fewer steps than required by Inception We have presented a novel mechanism for dramatically
to reach 72.2%. accelerating the training of deep networks. It is based on
We also verified that the reduction in internal covari- the premise that covariate shift, which is known to com-
ate shift allows deep networks with Batch Normalization plicate the training of machine learning systems, also ap-
7
Model Resolution Crops Models Top-1 error Top-5 error
GoogLeNet ensemble 224 144 7 - 6.67%
Deep Image low-res 256 - 1 - 7.96%
Deep Image high-res 512 - 1 24.88 7.42%
Deep Image ensemble variable - - - 5.98%
BN-Inception single crop 224 1 1 25.2% 7.82%
BN-Inception multicrop 224 144 1 21.99% 5.82%
BN-Inception ensemble 224 144 6 20.1% 4.9%*
Figure 4: Batch-Normalized Inception comparison with previous state of the art on the provided validation set com-
prising 50000 images. *BN-Inception ensemble has reached 4.82% top-5 error on the 100000 images of the test set of
the ImageNet as reported by the test server.
plies to sub-networks and layers, and removing it from entiating characteristics of Batch Normalization include
internal activations of the network may aid in training. the learned scale and shift that allow the BN transform
Our proposed method draws its power from normalizing to represent identity (the standardization layer did not re-
activations, and from incorporating this normalization in quire this since it was followed by the learned linear trans-
the network architecture itself. This ensures that the nor- form that, conceptually, absorbs the necessary scale and
malization is appropriately handled by any optimization shift), handling of convolutional layers, deterministic in-
method that is being used to train the network. To en- ference that does not depend on the mini-batch, and batch-
able stochastic optimization methods commonly used in normalizing each convolutional layer in the network.
deep network training, we perform the normalization for In this work, we have not explored the full range of
each mini-batch, and backpropagate the gradients through possibilities that Batch Normalization potentially enables.
the normalization parameters. Batch Normalization adds Our future work includes applications of our method to
only two extra parameters per activation, and in doing so Recurrent Neural Networks (Pascanu et al., 2013), where
preserves the representation ability of the network. We the internal covariate shift and the vanishing or exploding
presented an algorithm for constructing, training, and per- gradients may be especially severe, and which would al-
forming inference with batch-normalized networks. The low us to more thoroughly test the hypothesis that normal-
resulting networks can be trained with saturating nonlin- ization improves gradient propagation (Sec. 3.3). We plan
earities, are more tolerant to increased training rates, and to investigate whether Batch Normalization can help with
often do not require Dropout for regularization. domain adaptation, in its traditional sense – i.e. whether
Merely adding Batch Normalization to a state-of-the- the normalization performed by the network would al-
art image classification model yields a substantial speedup low it to more easily generalize to new data distribu-
in training. By further increasing the learning rates, re- tions, perhaps with just a recomputation of the population
moving Dropout, and applying other modifications af- means and variances (Alg. 2). Finally, we believe that fur-
forded by Batch Normalization, we reach the previous ther theoretical analysis of the algorithm would allow still
state of the art with only a small fraction of training steps more improvements and applications.
– and then beat the state of the art in single-network image
classification. Furthermore, by combining multiple mod-
els trained with Batch Normalization, we perform better References
than the best known system on ImageNet, by a significant
margin. Bengio, Yoshua and Glorot, Xavier. Understanding the
difficulty of training deep feedforward neural networks.
Interestingly, our method bears similarity to the stan- In Proceedings of AISTATS 2010, volume 9, pp. 249–
dardization layer of (Gülçehre & Bengio, 2013), though 256, May 2010.
the two methods stem from very different goals, and per-
form different tasks. The goal of Batch Normalization Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai,
is to achieve a stable distribution of activation values Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato,
throughout training, and in our experiments we apply it Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke,
before the nonlinearity since that is where matching the and Ng, Andrew Y. Large scale distributed deep net-
first and second moments is more likely to result in a works. In NIPS, 2012.
stable distribution. On the contrary, (Gülçehre & Bengio,
2013) apply the standardization layer to the output of the Desjardins, Guillaume and Kavukcuoglu, Koray. Natural
nonlinearity, which results in sparser activations. In our neural networks. (unpublished).
large-scale image classification experiments, we have not
observed the nonlinearity inputs to be sparse, neither with Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive
nor without Batch Normalization. Other notable differ- subgradient methods for online learning and stochastic
8
optimization. J. Mach. Learn. Res., 12:2121–2159, July Saxe, Andrew M., McClelland, James L., and Ganguli,
2011. ISSN 1532-4435. Surya. Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. CoRR,
Gülçehre, Çaglar and Bengio, Yoshua. Knowledge mat- abs/1312.6120, 2013.
ters: Importance of prior information for optimization.
CoRR, abs/1301.4083, 2013. Shimodaira, Hidetoshi. Improving predictive inference
under covariate shift by weighting the log-likelihood
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep function. Journal of Statistical Planning and Inference,
into Rectifiers: Surpassing Human-Level Performance 90(2):227–244, October 2000.
on ImageNet Classification. ArXiv e-prints, February
2015. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
Hyvärinen, A. and Oja, E. Independent component anal- A simple way to prevent neural networks from overfit-
ysis: Algorithms and applications. Neural Netw., 13 ting. J. Mach. Learn. Res., 15(1):1929–1958, January
(4-5):411–430, May 2000. 2014.
Jiang, Jing. A literature survey on domain adaptation of Sutskever, Ilya, Martens, James, Dahl, George E., and
statistical classifiers, 2008. Hinton, Geoffrey E. On the importance of initial-
ization and momentum in deep learning. In ICML
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (3), volume 28 of JMLR Proceedings, pp. 1139–1147.
Gradient-based learning applied to document recog- JMLR.org, 2013.
nition. Proceedings of the IEEE, 86(11):2278–2324,
November 1998a. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient mitru, Vanhoucke, Vincent, and Rabinovich, An-
backprop. In Orr, G. and K., Muller (eds.), Neural Net- drew. Going deeper with convolutions. CoRR,
works: Tricks of the trade. Springer, 1998b. abs/1409.4842, 2014.
Lyu, S and Simoncelli, E P. Nonlinear image representa- Wiesler, Simon and Ney, Hermann. A convergence anal-
tion using divisive normalization. In Proc. Computer ysis of log-linear training. In Shawe-Taylor, J., Zemel,
Vision and Pattern Recognition, pp. 1–8. IEEE Com- R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q.
puter Society, Jun 23-28 2008. doi: 10.1109/CVPR. (eds.), Advances in Neural Information Processing Sys-
2008.4587821. tems 24, pp. 657–665, Granada, Spain, December 2011.
Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units
Ney, Hermann. Mean-normalized stochastic gradient
improve restricted boltzmann machines. In ICML, pp.
for large-scale deep learning. In IEEE International
807–814. Omnipress, 2010.
Conference on Acoustics, Speech, and Signal Process-
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. ing, pp. 180–184, Florence, Italy, May 2014.
On the difficulty of training recurrent neural networks. Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and
In Proceedings of the 30th International Conference on Sun, Gang. Deep image: Scaling up image recognition,
Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 2015.
21 June 2013, pp. 1310–1318, 2013.
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San-

jeev. Parallel training of deep neural networks with Appendix
natural gradient and parameter averaging. CoRR,
abs/1410.7455, 2014. Variant of the Inception Model Used
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep Figure 5 documents the changes that were performed
learning made easier by linear transformations in per- compared to the architecture with respect to the
ceptrons. In International Conference on Artificial In-GoogleNet archictecture. For the interpretation of this
telligence and Statistics (AISTATS), pp. 924–932, 2012.table, please consult (Szegedy et al., 2014). The notable
architecture changes compared to the GoogLeNet model
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, include:
Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpa-
thy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, • The 5×5 convolutional layers are replaced by two
Alexander C., and Fei-Fei, Li. ImageNet Large Scale consecutive 3×3 convolutional layers. This in-
Visual Recognition Challenge, 2014. creases the maximum depth of the network by 9
9
weight layers. Also it increases the number of pa-
rameters by 25% and the computational cost is in-
creased by about 30%.
• The number 28×28 inception modules is increased
from 2 to 3.
• Inside the modules, sometimes average, sometimes
maximum-pooling is employed. This is indicated in
the entries corresponding to the pooling layers of the
table.
• There are no across the board pooling layers be-
tween any two Inception modules, but stride-2 con-
volution/pooling layers are employed before the fil-
ter concatenation in the modules 3c, 4e.
Our model employed separable convolution with depth
multiplier 8 on the first convolutional layer. This reduces
the computational cost while increasing the memory con-
sumption at training time.
10
patch size/ output #3×3 double #3×3 double
type depth #1×1 #3×3 Pool +proj
stride size reduce reduce #3×3
convolution* 7×7/2 112×112×64 1
max pool 3×3/2 56×56×64 0
convolution 3×3/1 56×56×192 1 64 192
max pool 3×3/2 28×28×192 0
inception (3a) 28×28×256 3 64 64 64 64 96 avg + 32
inception (3b) 28×28×320 3 64 64 96 64 96 avg + 64
inception (3c) stride 2 28×28×576 3 0 128 160 64 96 max + pass through
inception (4a) 14×14×576 3 224 64 96 96 128 avg + 128
inception (4b) 14×14×576 3 192 96 128 96 128 avg + 128
inception (4c) 14×14×576 3 160 128 160 128 160 avg + 128
inception (4d) 14×14×576 3 96 128 192 160 192 avg + 128
inception (4e) stride 2 14×14×1024 3 0 128 192 192 256 max + pass through
inception (5a) 7×7×1024 3 352 192 320 160 224 avg + 128
inception (5b) 7×7×1024 3 352 192 320 192 224 max + 128
avg pool 7×7/1 1×1×1024 0
Figure 5: Inception architecture
11
Binarized Neural Networks: Training Neural Networks with Weights and
Activations Constrained to +1 or −1
Matthieu Courbariaux*1 MATTHIEU . COURBARIAUX @ GMAIL . COM

Itay Hubara*2 ITAYHUBARA @ GMAIL . COM
Daniel Soudry3 DANIEL . SOUDRY @ GMAIL . COM
Ran El-Yaniv2 RANI @ CS . TECHNION . AC . IL
Yoshua Bengio1,4 YOSHUA . UMONTREAL @ GMAIL . COM
arXiv:1602.02830v3 [cs.LG] 17 Mar 2016
1
Université de Montréal
2
Technion - Israel Institute of Technology
3
Columbia University
4
CIFAR Senior Fellow
*Indicates equal contribution. Ordering determined by coin flip.
Abstract tistical machine translation (Devlin et al., 2014; Sutskever

et al., 2014; Bahdanau et al., 2015), Atari and Go games
We introduce a method to train Binarized Neu-
(Mnih et al., 2015; Silver et al., 2016), and even abstract
ral Networks (BNNs) - neural networks with bi-
art (Mordvintsev et al., 2015).
nary weights and activations at run-time. At
training-time the binary weights and activations Today, DNNs are almost exclusively trained on one or
are used for computing the parameters gradi- many very fast and power-hungry Graphic Processing
ents. During the forward pass, BNNs drastically Units (GPUs) (Coates et al., 2013). As a result, it is of-
reduce memory size and accesses, and replace ten a challenge to run DNNs on target low-power devices,
most arithmetic operations with bit-wise opera- and substantial research efforts are invested in speeding
tions, which is expected to substantially improve up DNNs at run-time on both general-purpose (Vanhoucke
power-efficiency. To validate the effectiveness of et al., 2011; Gong et al., 2014; Romero et al., 2014; Han
BNNs we conduct two sets of experiments on the et al., 2015) and specialized computer hardware (Farabet
Torch7 and Theano frameworks. On both, BNNs et al., 2011a;b; Pham et al., 2012; Chen et al., 2014a;b;
achieved nearly state-of-the-art results over the Esser et al., 2015).
MNIST, CIFAR-10 and SVHN datasets. Last but
This paper makes the following contributions:
not least, we wrote a binary matrix multiplication
GPU kernel with which it is possible to run our
MNIST BNN 7 times faster than with an unopti- • We introduce a method to train Binarized-Neural-
mized GPU kernel, without suffering any loss in Networks (BNNs), neural networks with binary
classification accuracy. The code for training and weights and activations, at run-time, and when com-
running our BNNs is available on-line. puting the parameters gradients at train-time (see Sec-
tion 1).
• We conduct two sets of experiments, each imple-

Introduction
mented on a different framework, namely Torch7
Deep Neural Networks (DNNs) have substantially pushed (Collobert et al., 2011) and Theano (Bergstra et al.,
Artificial Intelligence (AI) limits in a wide range of tasks, 2010; Bastien et al., 2012), which show that it is pos-
including but not limited to object recognition from im- sible to train BNNs on MNIST, CIFAR-10 and SVHN
ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech and achieve nearly state-of-the-art results (see Section
recognition (Hinton et al., 2012; Sainath et al., 2013), sta- 2).
• We show that during the forward pass (both at run-

time and train-time), BNNs drastically reduce mem-
ory consumption (size and number of accesses), and
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
replace most arithmetic operations with bit-wise oper- 1.2. Gradient Computation and Accumulation
ations, which potentially lead to a substantial increase
Although our BNN training method uses binary weights
in power-efficiency (see Section 3). Moreover, a bi-
and activation to compute the parameters gradients, the
narized CNN can lead to binary convolution kernel
real-valued gradients of the weights are accumulated in
repetitions; We argue that dedicated hardware could
real-valued variables, as per Algorithm 1. Real-valued
reduce the time complexity by 60% .
weights are likely required for Stochasic Gradient Descent
• Last but not least, we programed a binary matrix mul- (SGD) to work at all. SGD explores the space of param-
tiplication GPU kernel with which it is possible to run eters in small and noisy steps, and that noise is averaged
our MNIST BNN 7 times faster than with an unopti- out by the stochastic gradient contributions accumulated in
mized GPU kernel, without suffering any loss in clas- each weight. Therefore, it is important to keep sufficient
sification accuracy (see Section 4). resolution for these accumulators, which at first glance sug-
gests that high precision is absolutely required.
• The code for training and running our BNNs is avail-
able on-line (In both Theano framework 1 and Torch Moreover, adding noise to weights and activations when
framework 2 ). computing the parameters gradients provide a form of reg-
ularization that can help to generalize better, as previ-
ously shown with variational weight noise (Graves, 2011),
1. Binarized Neural Networks Dropout (Srivastava, 2013; Srivastava et al., 2014) and
In this section, we detail our binarization function, show DropConnect (Wan et al., 2013). Our method of training
how we use it to compute the parameters gradients, and BNNs can be seen as a variant of Dropout, in which instead
how we backpropagate through it. of randomly setting half of the activations to zero when
computing the parameters gradients, we binarize both the
1.1. Deterministic vs Stochastic Binarization activations and the weights.
When training a BNN, we constrain both the weights and 1.3. Propagating Gradients Through Discretization
the activations to either +1 or −1. Those two values are
very advantageous from a hardware perspective, as we ex- The derivative of the sign function is zero almost every-
plain in Section 4. In order to transform the real-valued where, making it apparently incompatible with backpropa-
variables into those two values, we use two different bi- gation, since the exact gradient of the cost with respect to
narization functions, as in (Courbariaux et al., 2015). Our the quantities before the discretization (pre-activations or
first binarization function is deterministic: weights) would be zero. Note that this remains true even
if stochastic quantization is used. Bengio (2013) studied
b +1 if x ≥ 0, the question of estimating or propagating gradients through
x = Sign(x) = (1)
−1 otherwise, stochastic discrete neurons. They found in their experi-
where xb is the binarized variable (weight or activation) ments that the fastest training was obtained when using the
and x the real-valued variable. It is very straightforward to “straight-through estimator,” previously introduced in Hin-
implement and works quite well in practice. Our second ton (2012)’s lectures.
binarization function is stochastic: We follow a similar approach but use the version of

+1 with probability p = σ(x), the straight-through estimator that takes into account the
xb = (2) saturation effect, and does use deterministic rather than
−1 with probability 1 − p,
stochastic sampling of the bit. Consider the sign function
where σ is the “hard sigmoid” function: quantization
x+1 x+1 q = Sign(r),
σ(x) = clip( , 0, 1) = max(0, min(1, )). (3)
2 2 and assume that an estimator gq of the gradient ∂C ∂q has
The stochastic binarization is more appealing than the sign been obtained (with the straight-through estimator when
function, but harder to implement as it requires the hard- needed). Then, our straight-through estimator of ∂C
∂r is sim-
ware to generate random bits when quantizing. As a re- ply
sult, we mostly use the deterministic binarization function gr = gq 1|r|≤1 . (4)
(i.e, the sign function), with the exception of activations at Note that this preserves the gradient’s information and can-
train-time in some of our experiments. cels the gradient when r is too large. Not cancelling the
1
https://github.com/MatthieuCourbariaux/ gradient when r is too large significantly worsens the per-
BinaryNet formance. The use of this straight-through estimator is il-
2
https://github.com/itayhubara/BinaryNet lustrated in Algorithm 1. The derivative 1|r|≤1 can also be
Algorithm 2 Shift based Batch Normalizing Transform,

applied to activation x over a mini-batch. AP 2(x) =
sign(x) × 2round(log2|x|) is the approximate power-of-2 3 ,
Algorithm 1 Training a BNN. C is the cost function for and stands for both left and right binary shift.
minibatch, λ - the learning rate decay factor and L the num- Require: Values of x over a mini-batch: B = {x1...m };
ber of layers. ◦ indicates element-wise multiplication. The Parameters to be learned: γ, β
function Binarize() specifies how to (stochastically or de- Ensure: {yiP = BN(xi ,γ, β)}
1 m
terministically) binarize the activations and weights, and µB ← m i=1 xi {mini-batch mean}
Clip(), how to clip the weights. BatchNorm() specifies how C(xi ) ←P(xi − µB ) {centered input}
2 1 m
to batch-normalize the activations, using either batch nor- σB ←m i=1(C(xi )AP 2(C(xi ))){apx variance}
p
x̂i ← C(xi ) AP 2(( σB 2 + )−1 ) {normalize}
malization (Ioffe & Szegedy, 2015) or its shift-based vari-
ant we describe in Algorithm 3. BackBatchNorm() speci- yi ← AP 2(γ) x̂i {scale and shift}
fies how to backpropagate through the normalization. Up-
date() specifies how to update the parameters when their
Algorithm 3 Shift based Batch Normalizing Transform,
gradients are known, using either ADAM (Kingma & Ba,
applied to activation (x) over a mini-batch. Where AP2 is
2014) or the shift-based AdaMax we describe in Algorithm
the approximate power-of-2 and stands for both left
4.
and right binary shift.
Require: a minibatch of inputs and targets (a0 , a∗ ), pre-
vious weights W , previous BatchNorm parameters θ, Require: Values of x over a mini-batch: B = {x1...m };
weights initialization coefficients from (Glorot & Ben- Parameters to be learned: γ, β
gio, 2010) γ, and previous learning rate η. Ensure: {yiP = BN(xi ,γ, β)}
1 m
Ensure: updated weights W t+1 , updated BatchNorm pa- µB ← m i=1 xi {mini-batch mean}
rameters θt+1 and updated learning rate η t+1 . C(xi ) ←P(xi − µB ) {centered input}
2 1 m
σB ←m i=1(C(xi )AP 2(C(xi ))){apx variance}
{1. Computing the parameters gradients:} p
x̂i ← C(xi ) AP 2(( σB 2 + )−1 ) {normalize}
{1.1. Forward propagation:}
for k = 1 to L do yi ← AP 2(γ) x̂i {scale and shift}
Wkb ← Binarize(Wk )
sk ← abk−1 Wkb
ak ← BatchNorm(sk , θk ) seen as propagating the gradient through hard tanh, which
if k < L then is the following piece-wise linear activation function:
abk ← Binarize(ak )
end if Htanh(x) = Clip(x, −1, 1) = max(−1, min(1, x)). (5)
end for
{1.2. Backward propagation:}
{Please note that the gradients are not binary.} For hidden units, we use the sign function non-linearity to
Compute gaL = ∂a ∂C
knowing aL and a∗ obtain binary activations, and for weights we combine two
L
for k = L to 1 do ingredients:
if k < L then
gak ← gabk ◦ 1|ak |≤1 • Constrain each real-valued weight between -1 and 1,
end if by projecting wr to -1 or 1 when the weight update
(gsk , gθk ) ← BackBatchNorm(gak , sk , θk ) brings wr outside of [−1, 1], i.e., clipping the weights
gabk−1 ← gsk Wkb during training, as per Algorithm 1. The real-valued
gWkb ← gs>k abk−1 weights would otherwise grow very large without any
end for impact on the binary weights.
{2. Accumulating the parameters gradients:}
for k = 1 to L do • When using a weight wr , quantize it using wb =
θkt+1 ← Update(θk , η, gθk ) Sign(wr ).
Wkt+1 ← Clip(Update(Wk , γk η, gWkb ), −1, 1)
η t+1 ← λη This is consistent with the gradient canceling when |wr | >
end for 1, according to Eq. 4.
3
Hardware implementation of AP2 is as simple as extracting
the index of the most significant bit from the number’s binary
representation.
Algorithm 4 Shift based AdaMax learning rule (Kingma did not observe accuracy loss when using the shift based
& Ba, 2014). gt2 indicates the element-wise square gt ◦ gt . BN algorithm instead of the vanilla BN algorithm.
Good default settings are α = 2−10 , 1−β1 = 2−3 , 1−β2 =
2−10 . All operations on vectors are element-wise. With β1t 1.5. Shift based AdaMax
and β2t we denote β1 and β2 to the power t.
The ADAM learning rule (Kingma & Ba, 2014) also seems
Require: Previous parameters θt−1 and their gradient gt ,
to reduce the impact of the weight scale. Since ADAM re-
and learning rate α.
quires many multiplications, we suggest using instead the
Ensure: Updated parameters θt
shift-based AdaMax we detail in Algorithm 4. In the ex-
{Biased 1st and 2nd raw moment estimates:}
periment we conducted we did not observe accuracy loss
mt ← β1 · mt−1 + (1 − β1 ) · gt
when using the shift-based AdaMax algorithm instead of
vt ← max(β2 · vt−1 , |gt |)
the vanilla ADAM algorithm.
{Updated parameters:}
θt ← θt−1 − (α (1 − β1 )) · m̂ vt−1 )
1.6. First Layer
Algorithm 5 Running a BNN. L is the number of layers. In a BNN, only the binarized values of the weights and ac-
Require: a vector of 8-bit inputs a0 , the binary weights tivations are used in all calculations. As the output of one
W b , and the BatchNorm parameters θ. layer is the input of the next, all the layers inputs are bi-
Ensure: the MLP output aL . nary, with the exception of the first layer. However, we
{1. First layer:} do not believe this to be a major issue. First, in computer
a1 ← 0 vision, the input representation typically has much fewer
for n = 1 to 8 do channels (e.g, Red, Green and Blue) than internal repre-
a1 ← a1 + 2n−1 × XnorDotProduct(an0 , W1b ) sentations (e.g, 512). As a result, the first layer of a Con-
end for vNet is often the smallest convolution layer, both in terms
ab1 ← Sign(BatchNorm(a1 , θ1 )) of parameters and computations (Szegedy et al., 2014).
{2. Remaining hidden layers:} Second, it is relatively easy to handle continuous-valued
for k = 2 to L − 1 do inputs as fixed point numbers, with m bits of precision. For
ak ← XnorDotProduct(abk−1 , Wkb ) example, in the common case of 8-bit fixed point inputs:
abk ← Sign(BatchNorm(ak , θk ))
end for s = x · wb (6)
{3. Output layer:} 8
X
aL ← XnorDotProduct(abL−1 , WLb ) s= 2n−1 (xn · wb ), (7)
aL ← BatchNorm(aL , θL ) n=1
where x is a vector of 1024 8-bit inputs, x81 is the most

significant bit of the first input, wb is a vector of 1024 1-bit
1.4. Shift based Batch Normalization weights, and s is the resulting weighted sum. This trick is
Batch Normalization (BN) (Ioffe & Szegedy, 2015), accel- used in Algorithm 5.
erates the training and also seems to reduces the overall
impact of the weights’ scale. The normalization noise may 2. Benchmark Results
also help to regularize the model. However, at train-time,
BN requires many multiplications (calculating the standard We conduct two sets of experiments, each based on a differ-
deviation and dividing by it), namely, dividing by the run- ent framework, namely Torch7 (Collobert et al., 2011) and
ning variance (the weighted mean of the training set acti- Theano (Bergstra et al., 2010; Bastien et al., 2012). Other
vation variance). Although the number of scaling calcula- than the framework, the two sets of experiments are very
tions is the same as the number of neurons, in the case of similar:
ConvNets this number is quite large. For example, in the
CIFAR-10 dataset (using our architecture), the first convo- • In both sets of experiments, we obtain near state-of-
lution layer, consisting of only 128×3×3 filter masks, con- the-art results with BNNs on MNIST, CIFAR-10 and
verts an image of size 3 × 32 × 32 to size 3 × 128 × 28 × 28, the SVHN benchmark datasets.
which is two orders of magnitude larger than the number of • In our Torch7 experiments, the activations are stochas-
weights. To achieve the results that BN would obtain, we tically binarized at train-time, whereas in our Theano
use a shift-based batch normalization (SBN) technique. de- experiments they are deterministically binarized.
tailed in Algorithm 3. SBN approximates BN almost with-
out multiplications. In the experiment we conducted we • In our Torch7 experiments, we use the shift-based BN
Table 1. Classification test error rates of DNNs trained on MNIST (MLP architecture without unsupervised pretraining), CIFAR-10
(without data augmentation) and SVHN.
Data set MNIST SVHN CIFAR-10
Binarized activations+weights, during training and test
BNN (Torch7) 1.40% 2.53% 10.15%
BNN (Theano) 0.96% 2.80% 11.40%
Committee Machines’ Array (Baldassi et al., 2015) 1.35% - -
Binarized weights, during training and test
BinaryConnect (Courbariaux et al., 2015) 1.29± 0.08% 2.30% 9.90%
Binarized activations+weights, during test
EBP (Cheng et al., 2015) 2.2± 0.1% - -
Bitwise DNNs (Kim & Smaragdis, 2016) 1.33% - -
Ternary weights, binary activations, during test
(Hwang & Sung, 2014) 1.45% - -
No binarization (standard results)
Maxout Networks (Goodfellow et al.) 0.94% 2.47% 11.68%
Network in Network (Lin et al.) - 2.35% 10.41%
Gated pooling (Lee et al., 2015) - 1.69% 7.62%
Figure 1. Training curves of a ConvNet on CIFAR-10 depend- Figure 2. Binary weight filters, sampled from of the first convolu-
2
ing on the method. The dotted lines represent the training costs tion layer. Since we have only 2k unique 2D filters (where k is
(square hinge losses) and the continuous lines the corresponding the filter size), filter replication is very common. For instance, on
validation error rates. Although BNNs are slower to train, they our CIFAR-10 ConvNet, only 42% of the filters are unique.
are nearly as accurate as 32-bit float DNNs.
(Tang, 2013; Lee et al., 2014). We regularize the model

with Dropout (Srivastava, 2013; Srivastava et al., 2014).
The square hinge loss is minimized with the ADAM adap-
tive learning rate method (Kingma & Ba, 2014). We use
an exponentially decaying global learning rate, as per Al-
and AdaMax variants, which are detailed in Algo- gorithm 1, and also scale the learning rates of the weights
rithms 3 and 4, whereas in our Theano experiments, with their initialization coefficients from (Glorot & Bengio,
we use vanilla BN and ADAM. 2010), as suggested by Courbariaux et al. (2015). We use
Batch Normalization with a minibatch of size 100 to speed
2.1. MLP on MNIST (Theano) up the training. As is typical, we use the last 10K samples
of the training set as a validation set for early stopping and
MNIST is an image classification benchmark dataset (Le- model selection. We report the test error rate associated
Cun et al., 1998). It consists of a training set of 60K and with the best validation error rate after 1000 epochs (we do
a test set of 10K 28 × 28 gray-scale images represent- not retrain on the validation set). The results are reported
ing digits ranging from 0 to 9. In order for this bench- in Table 1.
mark to remain a challenge, we did not use any convo-
lution, data-augmentation, preprocessing or unsupervised
2.2. MLP on MNIST (Torch7)
learning. The MLP we train on MNIST consists of 3 hid-
den layers of 4096 binary units (see Section 1) and a L2- We use a similar architecture as in our Theano experiments,
SVM output layer; L2-SVM has been shown to perform without dropout, and with 2048 binary units per layer in-
better than Softmax on several classification benchmarks stead of 4096. Additionally, we use the shift base AdaMax
and BN (with a minibatch of size 100) instead of the vanilla

Table 2. Energy consumption of multiply-accumulations
implementations, to reduce the number of multiplications.
(Horowitz, 2014)
Likewise, we decay the learning rate by using a 1-bit right Operation MUL ADD
shift every 10 epochs. The results are presented in Table 1. 8bit Integer 0.2pJ 0.03pJ
32bit Integer 3.1pJ 0.1pJ
16bit Floating Point 1.1pJ 0.4pJ
2.3. ConvNet on CIFAR-10 (Theano) 32tbit Floating Point 3.7pJ 0.9pJ
CIFAR-10 is an image classification benchmark dataset. It

consists of a training set of size 50K and a test set of size
Table 3. Energy consumption of memory accesses (Horowitz,
10K, where instance are 32 × 32 color images represent-
2014)
ing airplanes, automobiles, birds, cats, deer, dogs, frogs, Memory size 64-bit memory access
horses, ships and trucks. We do not use any preprocessing 8K 10pJ
or data-augmentation (which can really be a game changer 32K 20pJ
1M 100pJ
for this dataset (Graham, 2014)). The architecture of our DRAM 1.3-2.6nJ
ConvNet is the same architecture as ?’s except for the bi-
narization of the activations. Courbariaux et al. (2015)’s
architecture is itself mainly inspired by VGG (Simonyan logic. During the forward pass (both at run-time and train-
& Zisserman, 2015). The square hinge loss is minimized time), BNNs drastically reduce memory size and accesses,
with ADAM. We use an exponentially decaying learning and replace most arithmetic operations with bit-wise op-
rate, as we did for MNIST. We scale the learning rates of erations, which might lead to a great increase in power-
the weights with their initialization coefficients from (Glo- efficiency. Moreover, a binarized CNN can lead to binary
rot & Bengio, 2010). We use Batch Normalization with a convolution kernel repetitions, and we argue that dedicated
minibatch of size 50 to speed up the training. We use the hardware could reduce the time complexity by 60% .
last 5000 samples of the training set as a validation set. We
report the test error rate associated with the best validation
3.1. Memory Size and Accesses
error rate after 500 training epochs (we do not retrain on
the validation set). The results are presented in Table 1 and Improving computing performance has always been and re-
Figure 1. mains a challenge. Over the last decade, power has been the
main constraint on performance (Horowitz, 2014). This is
2.4. ConvNet on CIFAR-10 (Torch7) why much research effort has been devoted to reducing the
energy consumption of neural networks. Horowitz (2014)
We use the same architecture as in our Theano experiments. provides rough numbers for the computations’ energy con-
We apply shift-based AdaMax and BN (with a minibatch sumption (the given numbers are for 45nm technology) as
of size 200) instead of the vanilla implementations to re- summarized in Tables 2 and 3. Importantly, we can see
duce the number of multiplications. Likewise, we decay that memory accesses typically consume more energy than
the learning rate by using a 1-bit right shift every 50 epochs. arithmetic operations, and memory access’ cost augments
The results are presented in Table 1 and Figure 1. with memory size. In comparison with 32-bit DNNs, BNNs
require 32 times smaller memory size and 32 times fewer
2.5. ConvNet on SVHN memory accesses. This is expected to reduce energy con-
SVHN is also an image classification benchmark dataset. It sumption drastically (i.e., more than 32 times).
consists of a training set of size 604K examples and a test
set of size 26K, where instances are 32 × 32 color images 3.2. XNOR-Count
representing digits ranging from 0 to 9. In both sets of Applying a DNN mainly consists of convolutions and ma-
experiments, we follow the same procedure used for the trix multiplications. The key arithmetic operation of deep
CIFAR-10 experiments, with a few notable exceptions: we learning is thus the multiply-accumulate operation. Artifi-
use half the number of units in the convolution layers, and cial neurons are basically multiply-accumulators comput-
we train for 200 epochs instead of 500 (because SVHN is a ing weighted sums of their inputs. In BNNs, both the ac-
much larger dataset than CIFAR-10). The results are given tivations and the weights are constrained to either −1 or
in Table 1. +1. As a result, most of the 32-bit floating point multiply-
accumulations are replaced by 1-bit XNOR-count opera-
3. Very Power Efficient in Forward Pass tions. This could have a big impact on deep learning ded-
icated hardware. For instance, a 32-bit floating point mul-
Computer hardware, be it general-purpose or specialized, tiplier costs about 200 Xilinx FPGA slices (Govindu et al.,
is composed of memories, arithmetic operators and control 2004; Beauchamp et al., 2006), whereas a 1-bit XNOR gate
only costs a single slice.

Figure 3. The first three columns represent the time it takes to
perform a 8192 × 8192 × 8192 (binary) matrix multiplication on
3.3. Exploiting Filter Repetitions a GTX750 Nvidia GPU, depending on which kernel is used. We
can see that our XNOR kernel is 23 times faster than our baseline
When using a ConvNet architecture with binary weights,
kernel and 3.4 times faster than cuBLAS. The next three columns
the number of unique filters is bounded by the filter size.
represent the time it takes to run the MLP from Section 2 on the
For example, in our implementation we use filters of size full MNIST test set. As MNIST’s images are not binary, the first
3 × 3, so the maximum number of unique 2D filters is layer’s computations are always performed by the baseline ker-
29 = 512. However, this should not prevent expanding nel. The last three columns show that the MLP accuracy does not
the number of feature maps beyond this number, since the depend on which kernel is used.
actual filter is a 3D matrix. Assuming we have M` fil-
ters in the ` convolutional layer, we have to store a 4D
weight matrix of size M` × M`−1 × k × k. Consequently,
2
the number of unique filters is 2k M`−1 . When necessary,
we apply each filter on the map and perform the required
multiply-accumulate (MAC) operations (in our case, using
XNOR and popcount operations). Since we now have bi-
nary filters, many 2D filters of size k ×k repeat themselves.
By using dedicated hardware/software, we can apply only
the unique 2D filters on each feature map and sum the re-
sult wisely to receive each 3D filter’s convolutional result.
Note that an inverse filter (i.e., [-1,1,-1] is the inverse of
[1,-1,1]) can also be treated as a repetition; it is merely a
multiplication of the original filter by -1. For example, in
our ConvNet architecture trained on the CIFAR-10 bench-
mark, there are only 42% unique filters per layer on av-
erage. Hence we can reduce the number of the XNOR-
popcount operations by 3.
4. Seven Times Faster on GPU at Run-Time trix multiplication kernel.

It is possible to speed up GPU implementations of BNNs, • The second kernel (XNOR) is nearly identical to the
by using a method sometimes called SIMD (single in- baseline kernel, except that it uses the SWAR method,
struction, multiple data) within a register (SWAR). The as in Equation (8).
basic idea of SWAR is to concatenate groups of 32 bi-
nary variables into 32-bit registers, and thus obtain a 32-
times speed-up on bitwise operations (e.g, XNOR). Using The two GPU kernels return identical outputs when their
SWAR, it is possible to evaluate 32 connections with only inputs are constrained to −1 or +1 (but not otherwise). The
3 instructions: XNOR kernel is about 23 times faster than the baseline ker-
nel and 3.4 times faster than cuBLAS, as shown in Figure
a1 + = popcount(xnor(a32b 32b
0 , w1 )), (8) 3. Last but not least, the MLP from Section 2 runs 7 times
faster with the XNOR kernel than with the baseline kernel,
where a1 is the resulting weighted sum, and a32b0 and w132b
without suffering any loss in classification accuracy (see
are the concatenated inputs and weights. Those 3 instruc-
Figure 3).
tions (accumulation, popcount, xnor) take 1 + 4 + 1 = 6
clock cycles on recent Nvidia GPUs (and if they were to be-
come a fused instruction, it would only take a single clock 5. Discussion and Related Work
cycle). Consequently, we obtain a theoretical Nvidia GPU
Until recently, the use of extremely low-precision networks
speed-up of factor of 32/6 ≈ 5.3. In practice, this speed-up
(binary in the extreme case) was believed to be highly de-
is quite easy to obtain as the memory bandwidth to compu-
structive to the network performance (Courbariaux et al.,
tation ratio is also increased by 6 times.
2014). Soudry et al. (2014); ? showed the contrary by
In order to validate those theoretical results, we programed showing that good performance could be achieved even if
two GPU kernels: all neurons and weights are binarized to ±1 . This was
done using Expectation BackPropagation (EBP), a varia-
• The first kernel (baseline) is a quite unoptimized ma- tional Bayesian approach, which infers networks with bi-
nary weights and neurons by updating the posterior distri- et al. also indicated satisfactory empirical performance of
butions over the weights. These distributions are updated neural networks with 8-bit precision. Kim & Paris (2015)
by differentiating their parameters (e.g., mean values) via retrained neural networks with binary weights and activa-
the back propagation (BP) algorithm. Esser et al. (2015) tions.
implemented a fully binary network at run time using a very
So far, to the best of our knowledge, no work has succeeded
similar approach to EBP, showing significant improvement
in binarizing weights and neurons, at the inference phase
in energy efficiency. The drawback of EBP is that the bina-
and the entire training phase of a deep network. This was
rized parameters were only used during inference.
achieved in the present work. We relied on the idea that bi-
The probabilistic idea behind EBP was extended in the Bi- narization can be done stochastically, or be approximated
naryConnect algorithm of Courbariaux et al. (2015). In as random noise. This was previously done for the weights
BinaryConnect, the real-valued version of the weights is by Courbariaux et al. (2015), but our BNNs extend this to
saved and used as a key reference for the binarization pro- the activations. Note that the binary activations are espe-
cess. The binarization noise is independent between dif- cially important for ConvNets, where there are typically
ferent weights, either by construction (by using stochas- many more neurons than free weights. This allows highly
tic quantization) or by assumption (a common simplifica- efficient operation of the binarized DNN at run time, and
tion; see Spang (1962). The noise would have little effect at the forward propagation phase during training. More-
on the next neuron’s input because the input is a summa- over, our training method has almost no multiplications,
tion over many weighted neurons. Thus, the real-valued and therefore might be implemented efficiently in dedi-
version could be updated by the back propagated error by cated hardware. However, we have to save the value of the
simply ignoring the binarization noise in the update. Us- full precision weights. This is a remaining computational
ing this method, Courbariaux et al. (2015) were the first bottleneck during training, since it requires relatively high
to binarize weights in CNNs and achieved near state-of- energy resources. Novel memory devices might be used to
the-art performance on several datasets. They also argued alleviate this issue in the future; see e.g. (Soudry et al.).
that noisy weights provide a form of regularization, which
could help to improve generalization, as previously shown Conclusion
in (Wan et al., 2013). This method binarized weights while
still maintaining full precision neurons. We have introduced BNNs, DNNs with binary weights and
activations at run-time and when computing the parame-
Lin et al. (2015) carried over the work of Courbariaux et al.
ters gradients at train-time (see Section 1). We have con-
(2015) to the back-propagation process by quantizing the
ducted two sets of experiments on two different frame-
representations at each layer of the network, to convert
works, Torch7 and Theano, which show that it is possible to
some of the remaining multiplications into binary shifts by
train BNNs on MNIST, CIFAR-10 and SVHN, and achieve
restricting the neurons values of power-of-two integers. Lin
nearly state-of-the-art results (see Section 2). Moreover,
et al. (2015)’s work and ours seem to share similar charac-
during the forward pass (both at run-time and train-time),
teristics . However, their approach continues to use full pre-
BNNs drastically reduce memory size and accesses, and re-
cision weights during the test phase. Moreover, Lin et al.
place most arithmetic operations with bit-wise operations,
(2015) quantize the neurons only during the back propaga-
which might lead to a great increase in power-efficiency
tion process, and not during forward propagation.
(see Section 3). Last but not least, we programed a binary
Other research (Baldassi et al., 2015) showed that fully bi- matrix multiplication GPU kernel with which it is possible
nary training and testing is possible in an array of com- to run our MNIST MLP 7 times faster than with an unopti-
mittee machines with randomized input, where only one mized GPU kernel, without suffering any loss in classifica-
weight layer is being adjusted. Judd et al. and Gong tion accuracy (see Section 4). Future works should explore
et al. aimed to compress a fully trained high precision net- how to extend the speed-up to train-time (e.g., by binariz-
work by using a quantization or matrix factorization meth- ing some gradients), and also extend benchmark results to
ods. These methods required training the network with full other models (e.g, RNN) and datasets (e.g, ImageNet).
precision weights and neurons, thus requiring numerous
MAC operations avoided by the proposed BNN algorithm. Acknowledgments
Hwang & Sung (2014) focused on a fixed-point neural net-
work design and achieved performance almost identical to We would like to express our appreciation to Elad Hoffer,
that of the floating-point architecture. Kim et al. (2014) for his technical assistance and constructive comments. We
provided evidence that DNNs with ternary weights, used thank our fellow MILA lab members who took the time to
on a dedicated circuit, consume very low power and can read the article and give us some feedback. We thank the
be operated with only on-chip memory, at run time. Sung developers of Torch, (Collobert et al., 2011) a Lua based
environment, and Theano (Bergstra et al., 2010; Bastien Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Catanzaro,
et al., 2012), a Python library which allowed us to easily Bryan, and Andrew, Ng. Deep learning with COTS HPC sys-
develop a fast and optimized code for GPU. We also thank tems. In Proceedings of the 30th international conference on
machine learning, pp. 1337–1345, 2013.
the developers of Pylearn2 (Goodfellow et al., 2013) and
Lasagne (Dieleman et al., 2015), two Deep Learning li- Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clément.
braries built on the top of Theano. We thank Yuxin Wu Torch7: A matlab-like environment for machine learning. In
BigLearn, NIPS Workshop, 2011.
for helping us compare our GPU kernels with cuBLAS. We
are also grateful for funding from CIFAR, NSERC, IBM, Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre.
Samsung, and the Israel Science Foundation (ISF). Training deep neural networks with low precision multiplica-
tions. ArXiv e-prints, abs/1412.7024, December 2014.
References Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre.

Binaryconnect: Training deep neural networks with binary
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neu- weights during propagations. ArXiv e-prints, abs/1511.00363,
ral machine translation by jointly learning to align and trans- November 2015.
late. In ICLR’2015, arXiv:1409.0473, 2015.
Devlin, Jacob, Zbib, Rabih, Huang, Zhongqiang, Lamar, Thomas,
Baldassi, Carlo, Ingrosso, Alessandro, Lucibello, Carlo, Saglietti, Schwartz, Richard, and Makhoul, John. Fast and robust neu-
Luca, and Zecchina, Riccardo. Subdominant Dense Clusters ral network joint models for statistical machine translation. In
Allow for Simple Learning and High Computational Perfor- Proc. ACL’2014, 2014.
mance in Neural Networks with Discrete Synapses. Physical Dieleman, Sander, Schlter, Jan, Raffel, Colin, Olson, Eben,
Review Letters, 115(12):1–5, 2015. Snderby, Sren Kaae, Nouri, Daniel, Maturana, Daniel, Thoma,
Martin, Battenberg, Eric, Kelly, Jack, Fauw, Jeffrey De, Heil-
Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, man, Michael, diogo149, McFee, Brian, Weideman, Hendrik,
James, Goodfellow, Ian J., Bergeron, Arnaud, Bouchard, Nico- takacsg84, peterderivaz, Jon, instagibbs, Rasul, Dr. Kashif,
las, and Bengio, Yoshua. Theano: new features and speed im- CongLiu, Britefury, and Degrave, Jonas. Lasagne: First re-
provements. Deep Learning and Unsupervised Feature Learn- lease., August 2015.
ing NIPS 2012 Workshop, 2012.
Esser, Steve K, Appuswamy, Rathinakumar, Merolla, Paul,
Beauchamp, Michael J, Hauck, Scott, Underwood, Keith D, and Arthur, John V, and Modha, Dharmendra S. Backpropagation
Hemmert, K Scott. Embedded floating-point units in FPGAs. for energy-efficient neuromorphic computing. In Advances in
In Proceedings of the 2006 ACM/SIGDA 14th international Neural Information Processing Systems, pp. 1117–1125, 2015.
symposium on Field programmable gate arrays, pp. 12–20. Farabet, Clément, LeCun, Yann, Kavukcuoglu, Koray, Culur-
ACM, 2006. ciello, Eugenio, Martini, Berin, Akselrod, Polina, and Talay,
Selcuk. Large-scale FPGA-based convolutional networks. Ma-
Bengio, Yoshua. Estimating or propagating gradients through chine Learning on Very Large Data Sets, 1, 2011a.
stochastic neurons. Technical Report arXiv:1305.2982, Uni-
versite de Montreal, 2013. Farabet, Clément, Martini, Berin, Corda, Benoit, Akselrod,
Polina, Culurciello, Eugenio, and LeCun, Yann. Neuflow: A
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lam- runtime reconfigurable dataflow processor for vision. In Com-
blin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, puter Vision and Pattern Recognition Workshops (CVPRW),
Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a 2011 IEEE Computer Society Conference on, pp. 109–116.
CPU and GPU math expression compiler. In Proceedings of IEEE, 2011b.
the Python for Scientific Computing Conference (SciPy), June
2010. Oral Presentation. Glorot, Xavier and Bengio, Yoshua. Understanding the diffi-
culty of training deep feedforward neural networks. In AIS-
TATS’2010, 2010.
Chen, Tianshi, Du, Zidong, Sun, Ninghui, Wang, Jia, Wu,
Chengyong, Chen, Yunji, and Temam, Olivier. Diannao: Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir.
A small-footprint high-throughput accelerator for ubiquitous Compressing Deep Convolutional Networks using Vector
machine-learning. In Proceedings of the 19th international Quantization. pp. 1–10.
conference on Architectural support for programming lan-
guages and operating systems, pp. 269–284. ACM, 2014a. Gong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir.
Compressing deep convolutional networks using vector quan-
Chen, Yunji, Luo, Tao, Liu, Shaoli, Zhang, Shijin, He, Liqiang, tization. arXiv preprint arXiv:1412.6115, 2014.
Wang, Jia, Li, Ling, Chen, Tianshi, Xu, Zhiwei, Sun, Ninghui, Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi,
et al. Dadiannao: A machine-learning supercomputer. In Mi- Courville, Aaron, and Bengio, Yoshua. Maxout Networks.
croarchitecture (MICRO), 2014 47th Annual IEEE/ACM Inter- arXiv preprint, pp. 1319–1327.
national Symposium on, pp. 609–622. IEEE, 2014b.
Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pas-
Cheng, Zhiyong, Soudry, Daniel, Mao, Zexi, and Lan, Zhen- cal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan,
zhong. Training binary multilayer neural networks for image Bergstra, James, Bastien, Frédéric, and Bengio, Yoshua.
classification using expectation backpropgation. arXiv preprint Pylearn2: a machine learning research library. arXiv preprint
arXiv:1503.03562, 2015. arXiv:1308.4214, 2013.
Govindu, Gokul, Zhuo, Ling, Choi, Seonil, and Prasanna, Vik- Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang,
tor. Analysis of high-performance floating-point arithmetic on Zhengyou, and Tu, Zhuowen. Deeply-supervised nets. arXiv
FPGAs. In Parallel and Distributed Processing Symposium, preprint arXiv:1409.5185, 2014.
2004. Proceedings. 18th International, pp. 149. IEEE, 2004.
Lee, Chen-Yu, Gallagher, Patrick W, and Tu, Zhuowen. Gen-
Graham, Benjamin. Spatially-sparse convolutional neural net- eralizing pooling functions in convolutional neural networks:
works. arXiv preprint arXiv:1409.6070, 2014. Mixed, gated, and tree. arXiv preprint arXiv:1509.08985,
2015.
Graves, Alex. Practical variational inference for neural networks.
In Advances in Neural Information Processing Systems, pp. Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network In Net-
2348–2356, 2011. work. arXiv preprint, pp. 10.
Han, Song, Pool, Jeff, Tran, John, and Dally, William. Learn- Lin, Zhouhan, Courbariaux, Matthieu, Memisevic, Roland, and
ing both weights and connections for efficient neural network. Bengio, Yoshua. Neural networks with few multiplications.
In Advances in Neural Information Processing Systems, pp. ArXiv e-prints, abs/1510.03009, October 2015.
1135–1143, 2015.
Mnih, Volodymyr, Kavukcuoglo, Koray, Silver, David, Rusu, An-
Hinton, Geoffrey. Neural networks for machine learning. Cours- drei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Ried-
era, video lectures, 2012. miller, Martin, Fidgeland, Andreas K., Ostrovski, Georg, Pe-
tersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioan-
Hinton, Geoffrey, Deng, Li, Dahl, George E., Mohamed, Abdel- nis, King, Helen, Kumaran, Dharsan, Wierstra, Daan, Legg,
rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Shane, and Hassabis, Demis. Human-level control through
Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep deep reinforcement learning. Nature, 518:529–533, 2015.
neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine, 29(6):82–97, Nov. 2012. Mordvintsev, Alexander, Olah, Christopher, and Tyka, Mike. In-
ceptionism: Going deeper into neural networks, 2015. Ac-
Horowitz, Mark. Computing’s Energy Problem (and what we can cessed: 2015-06-30.
do about it). IEEE Interational Solid State Circuits Conference,
pp. 10–14, 2014. Pham, Phi-Hung, Jelaca, Darko, Farabet, Clement, Martini,
Berin, LeCun, Yann, and Culurciello, Eugenio. Neuflow:
Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward dataflow vision processing system-on-a-chip. In Circuits and
deep neural network design using weights+ 1, 0, and- 1. In Systems (MWSCAS), 2012 IEEE 55th International Midwest
Signal Processing Systems (SiPS), 2014 IEEE Workshop on, Symposium on, pp. 1044–1047. IEEE, 2012.
pp. 1–6. IEEE, 2014.
Romero, Adriana, Ballas, Nicolas, Kahou, Samira Ebrahimi,
Ioffe, Sergey and Szegedy, Christian. Batch normalization: Ac- Chassang, Antoine, Gatta, Carlo, and Bengio, Yoshua. Fit-
celerating deep network training by reducing internal covariate nets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550,
shift. 2015. 2014.
Judd, Patrick, Albericio, Jorge, Hetherington, Tayler, Aamodt,
Sainath, Tara, rahman Mohamed, Abdel, Kingsbury, Brian, and
Tor, Jerger, Natalie Enright, Urtasun, Raquel, and Moshovos,
Ramabhadran, Bhuvana. Deep convolutional neural networks
Andreas. Reduced-Precision Strategies for Bounded Memory
for LVCSR. In ICASSP 2013, 2013.
in Deep Neural Nets. pp. 12.
Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur,
Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wonyong. X1000
Sifre, Laurent, van den Driessche, George, Schrittwieser,
real-time phoneme recognition vlsi using feed-forward deep
Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanc-
neural networks. In Acoustics, Speech and Signal Processing
tot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John,
(ICASSP), 2014 IEEE International Conference on, pp. 7510–
Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach,
7514. IEEE, 2014.
Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hass-
Kim, M. and Smaragdis, P. Bitwise Neural Networks. ArXiv e- abis, Demis. Mastering the game of go with deep neural net-
prints, January 2016. works and tree search. Nature, 529(7587):484–489, Jan 2016.
Article.
Kim, Minje and Paris, Smaragdis. Bitwise Neural Networks.
ICML Workshop on Resource-Efficient Machine Learning, 37, Simonyan, Karen and Zisserman, Andrew. Very deep convolu-
2015. tional networks for large-scale image recognition. In ICLR,
2015.
Kingma, Diederik and Ba, Jimmy. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:1412.6980, 2014. Soudry, Daniel, Di Castro, Dotan, Gal, Asaf, Kolodny, Avinoam,
and Kvatinsky, Shahar. Memristor-Based Multilayer Neu-
Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet classifica- ral Networks With Online Gradient Descent Training. IEEE
tion with deep convolutional neural networks. In NIPS’2012. Transactions on Neural Networks and Learning Systems, (10):
2012. 2408–2421.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Soudry, Daniel, Hubara, Itay, and Meir, Ron. Expectation back-
Patrick. Gradient-based learning applied to document recogni- propagation: Parameter-free training of multilayer neural net-
tion. Proceedings of the IEEE, 86(11):2278–2324, November works with continuous or discrete weights. In NIPS’2014,
1998. 2014.
Srivastava, Nitish. Improving neural networks with dropout. Mas-

ter’s thesis, U. Toronto, 2013.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever,

Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to
prevent neural networks from overfitting. Journal of Machine
Learning Research, 15:1929–1958, 2014.
Sung, Wonyong, Shin, Sungho, and Hwang, Kyuyeon. Resiliency
of Deep Neural Networks under Quantization. (2014):1–9.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to

sequence learning with neural networks. In NIPS’2014, 2014.
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre,

Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Van-
houcke, Vincent, and Rabinovich, Andrew. Going deeper with
convolutions. Technical report, arXiv:1409.4842, 2014.
Tang, Yichuan. Deep learning using linear support vector ma-

chines. Workshop on Challenges in Representation Learning,
ICML, 2013.
Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Im-

proving the speed of neural networks on CPUs. In Proc. Deep
Learning and Unsupervised Feature Learning NIPS Workshop,
2011.
Wan, Li, Zeiler, Matthew, Zhang, Sixin, LeCun, Yann, and Fer-
gus, Rob. Regularization of neural networks using dropcon-
nect. In ICML’2013, 2013.
Journal of Machine Learning Research 13 (2012) 281-305 Submitted 3/11; Revised 9/11; Published 2/12
Random Search for Hyper-Parameter Optimization
James Bergstra JAMES . BERGSTRA @ UMONTREAL . CA

Yoshua Bengio YOSHUA . BENGIO @ UMONTREAL . CA
Département d’Informatique et de recherche opérationnelle
Université de Montréal
Montréal, QC, H3C 3J7, Canada
Editor: Leon Bottou
Abstract
Grid search and manual search are the most widely used strategies for hyper-parameter optimiza-
tion. This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a compar-
ison with a large previous study that used grid search and manual search to configure neural net-
works and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time. Granting random search the same computational
budget, random search finds better models by effectively searching a larger, less promising con-
figuration space. Compared with deep belief networks configured by a thoughtful combination of
manual search and grid search, purely random search over the same 32-dimensional configuration
space found statistically equal performance on four of seven data sets, and superior performance
on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation
set performance reveals that for most data sets only a few of the hyper-parameters really matter,
but that different hyper-parameters are important on different data sets. This phenomenon makes
grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some
light on why recent “High Throughput” methods achieve surprising success—they appear to search
through a large number of hyper-parameters because most hyper-parameters do not matter much.
We anticipate that growing interest in large hierarchical models will place an increasing burden on
techniques for hyper-parameter optimization; this work shows that random search is a natural base-
line against which to judge progress in the development of adaptive (sequential) hyper-parameter
optimization algorithms.
Keywords: global optimization, model selection, neural networks, deep learning, response surface
modeling
1. Introduction
The ultimate objective of a typical learning algorithm A is to find a function f that minimizes some
expected loss L (x; f ) over i.i.d. samples x from a natural (grand truth) distribution Gx . A learning
algorithm A is a functional that maps a data set X (train) (a finite set of samples from Gx ) to a function
f . Very often a learning algorithm produces f through the optimization of a training criterion with
respect to a set of parameters θ. However, the learning algorithm itself often has bells and whistles
called hyper-parameters λ, and the actual learning algorithm is the one obtained after choosing
λ, which can be denoted Aλ , and f = Aλ (X (train) ) for a training set X (train) . For example, with a
2012
c James Bergstra and Yoshua Bengio.
B ERGSTRA AND B ENGIO
Gaussian kernel SVM, one has to select a regularization penalty C for the training criterion (which
controls the margin) and the bandwidth σ of the Gaussian kernel, that is, λ = (C, σ).
What we really need in practice is a way to choose λ so as to minimize generalization error
Ex∼Gx [L (x; Aλ (X (train) ))]. Note that the computation performed by A itself often involves an inner
optimization problem, which is usually iterative and approximate. The problem of identifying a
good value for hyper-parameters λ is called the problem of hyper-parameter optimization. This
paper takes a look at algorithms for this difficult outer-loop optimization problem, which is of great
practical importance in empirical machine learning work:

λ(∗) = argmin Ex∼Gx [L x; Aλ (X (train) ) ]. (1)
λ∈Λ
In general, we do not have efficient algorithms for performing the optimization implied by Equa-
tion 1. Furthermore, we cannot even evaluate the expectation over the unknown natural distribution
Gx , the value we wish to optimize. Nevertheless, we must carry out this optimization as best we
can. With regards to the expectation over Gx , we will employ the widely used technique of cross-
validation to estimate it. Cross-validation is the technique of replacing the expectation with a mean
over a validation set X (valid) whose elements are drawn i.i.d x ∼ Gx . Cross-validation is unbiased
as long as X (valid) is independent of any data used by Aλ (see Bishop, 1995, pp. 32-33). We see in
Equations 2-4 the hyper-parameter optimization problem as it is addressed in practice:

λ(∗) ≈ argmin mean L x; Aλ (X (train) ) . (2)
λ∈Λ x∈X (valid)
≡ argmin Ψ(λ) (3)

λ∈Λ
≈ argmin Ψ(λ) ≡ λ̂ (4)

λ∈{λ(1) ...λ(S) }
Equation 3 expresses the hyper-parameter optimization problem in terms of a hyper-parameter

response function, Ψ. Hyper-parameter optimization is the minimization of Ψ(λ) over λ ∈ Λ. This
function is sometimes called the response surface in the experiment design literature. Different data
sets, tasks, and learning algorithm families give rise to different sets Λ and functions Ψ. Knowing
in general very little about the response surface Ψ or the search space Λ, the dominant strategy for
finding a good λ is to choose some number (S) of trial points {λ(1) ...λ(S) }, to evaluate Ψ(λ) for each
one, and return the λ(i) that worked the best as λ̂. This strategy is made explicit by Equation 4.
The critical step in hyper-parameter optimization is to choose the set of trials {λ(1) ...λ(S) }.
The most widely used strategy is a combination of grid search and manual search (e.g., LeCun
et al., 1998b; Larochelle et al., 2007; Hinton, 2010), as well as machine learning software packages
such as libsvm (Chang and Lin, 2001) and scikits.learn.1 If Λ is a set indexed by K configuration
variables (e.g., for neural networks it would be the learning rate, the number of hidden units, the
strength of weight regularization, etc.), then grid search requires that we choose a set of values for
each variable (L(1) ...L(K) ). In grid search the set of trials is formed by assembling every possible
combination of values, so the number of trials in a grid search is S = ∏Kk=1 |L(k) | elements. This
product over K sets makes grid search suffer from the curse of dimensionality because the number
of joint values grows exponentially with the number of hyper-parameters (Bellman, 1961). Manual
1. scikits.learn: Machine Learning in Python can be found at http://scikit-learn.sourceforge.net.
282
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
search is used to identify regions in Λ that are promising and to develop the intuition necessary to
choose the sets L(k) . A major drawback of manual search is the difficulty in reproducing results.
This is important both for the progress of scientific research in machine learning as well as for ease
of application of learning algorithms by non-expert users. On the other hand, grid search alone does
very poorly in practice (as discussed here). We propose random search as a substitute and baseline
that is both reasonably efficient (roughly equivalent to or better than combinining manual search
and grid search, in our experiments) and keeping the advantages of implementation simplicity and
reproducibility of pure grid search. Random search is actually more practical than grid search
because it can be applied even when using a cluster of computers that can fail, and allows the
experimenter to change the “resolution” on the fly: adding new trials to the set or ignoring failed
trials are both feasible because the trials are i.i.d., which is not the case for a grid search. Of course,
random search can probably be improved by automating what manual search does, i.e., a sequential
optimization, but this is left to future work.
There are several reasons why manual search and grid search prevail as the state of the art despite
decades of research into global optimization (e.g., Nelder and Mead, 1965; Kirkpatrick et al., 1983;
Powell, 1994; Weise, 2009) and the publishing of several hyper-parameter optimization algorithms
(e.g., Nareyek, 2003; Czogiel et al., 2005; Hutter, 2009):
• Manual optimization gives researchers some degree of insight into Ψ;
• There is no technical overhead or barrier to manual optimization;
• Grid search is simple to implement and parallelization is trivial;
• Grid search (with access to a compute cluster) typically finds a better λ̂ than purely manual
sequential optimization (in the same amount of time);
• Grid search is reliable in low dimensional spaces (e.g., 1-d, 2-d).
We will come back to the use of global optimization algorithms for hyper-parameter selection
in our discussion of future work (Section 6). In this paper, we focus on random search, that is, inde-
pendent draws from a uniform density from the same configuration space as would be spanned by a
regular grid, as an alternative strategy for producing a trial set {λ(1) ...λ(S) }. We show that random
search has all the practical advantages of grid search (conceptual simplicity, ease of implementation,
trivial parallelism) and trades a small reduction in efficiency in low-dimensional spaces for a large
improvement in efficiency in high-dimensional search spaces.
In this work we show that random search is more efficient than grid search in high-dimensional
spaces because functions Ψ of interest have a low effective dimensionality; essentially, Ψ of interest
are more sensitive to changes in some dimensions than others (Caflisch et al., 1997). In particular, if
a function f of two variables could be approximated by another function of one variable ( f (x1 , x2 ) ≈
g(x1 )), we could say that f has a low effective dimension. Figure 1 illustrates how point grids
and uniformly random point sets differ in how they cope with low effective dimensionality, as in
the above example with f . A grid of points gives even coverage in the original 2-d space, but
projections onto either the x1 or x2 subspace produces an inefficient coverage of the subspace. In
contrast, random points are slightly less evenly distributed in the original space, but far more evenly
distributed in the subspaces.
If the researcher could know ahead of time which subspaces would be important, then he or she
could design an appropriate grid. However, we show the failings of this strategy in Section 2. For a
283
Grid Layout Random Layout
Unimportant parameter
Unimportant parameter
Important parameter Important parameter
Figure 1: Grid and random search of nine trials for optimizing a function f (x, y) = g(x) + h(y) ≈
g(x) with low effective dimensionality. Above each square g(x) is shown in green, and
left of each square h(y) is shown in yellow. With grid search, nine trials only test g(x)
in three distinct places. With random search, all nine trials explore distinct values of
g. This failure of grid search is the rule rather than the exception in high dimensional
hyper-parameter optimization.
given learning algorithm, looking at several relatively similar data sets (from different distributions)
reveals that on different data sets, different subspaces are important, and to different degrees. A grid
with sufficient granularity to optimizing hyper-parameters for all data sets must consequently be
inefficient for each individual data set because of the curse of dimensionality: the number of wasted
grid search trials is exponential in the number of search dimensions that turn out to be irrelevant for
a particular data set. In contrast, random search thrives on low effective dimensionality. Random
search has the same efficiency in the relevant subspace as if it had been used to search only the
relevant dimensions.
This paper is organized as follows. Section 2 looks at the efficiency of random search in practice
vs. grid search as a method for optimizing neural network hyper-parameters. We take the grid search
experiments of Larochelle et al. (2007) as a point of comparison, and repeat similar experiments
using random search. Section 3 uses Gaussian process regression (GPR) to analyze the results of
the neural network trials. The GPR lets us characterize what Ψ looks like for various data sets,
and establish an empirical link between the low effective dimensionality of Ψ and the efficiency
of random search. Section 4 compares random search and grid search with more sophisticated
point sets developed for Quasi Monte-Carlo numerical integration, and argues that in the regime of
interest for hyper-parameter selection grid search is inappropriate and more sophisticated methods
bring little advantage over random search. Section 5 compares random search with the expert-
guided manual sequential optimization employed in Larochelle et al. (2007) to optimize Deep Belief
Networks. Section 6 comments on the role of global optimization algorithms in future work. We
conclude in Section 7 that random search is generally superior to grid search for optimizing hyper-
parameters.
284
2. Random vs. Grid for Optimizing Neural Networks

In this section we take a second look at several of the experiments of Larochelle et al. (2007) us-
ing random search, to compare with the grid searches done in that work. We begin with a look
at hyper-parameter optimization in neural networks, and then move on to hyper-parameter opti-
mization in Deep Belief Networks (DBNs). To characterize the efficiency of random search, we
present two techniques in preliminary sections: Section 2.1 explains how we estimate the general-
ization performance of the best model from a set of candidates, taking into account our uncertainty
in which model is actually best; Section 2.2 explains the random experiment efficiency curve that
we use to characterize the performance of random search experiments. With these preliminaries
out of the way, Section 2.3 describes the data sets from Larochelle et al. (2007) that we use in our
work. Section 2.4 presents our results optimizing neural networks, and Section 5 presents our results
optimizing DBNs.
2.1 Estimating Generalization

Because of finite data sets, test error is not monotone in validation error, and depending on the set
of particular hyper-parameter values λ evaluated, the test error of the best-validation error configu-
ration may vary. When reporting performance of learning algorithms, it can be useful to take into
account the uncertainty due to the choice of hyper-parameters values. This section describes our
procedure for estimating test set accuracy, which takes into account any uncertainty in the choice
of which trial is actually the best-performing one. To explain this procedure, we must distinguish
between estimates of performance Ψ(valid) = Ψ and Ψ(test) based on the validation and test sets
respectively:

Ψ(valid) (λ) = meanx∈X (valid) L x; Aλ (X (train) ) ,

Ψ(test) (λ) = meanx∈X (test) L x; Aλ (X (train) ) .
Likewise, we must define the estimated variance V about these means on the validation and test sets,
for example, for the zero-one loss (Bernoulli variance):
Ψ(valid) (λ) 1 − Ψ(valid) (λ)

(valid)
V (λ) = , and
|X (valid) | − 1
Ψ(test) (λ) 1 − Ψ(test) (λ)

(test)
V (λ) = .
|X (test) | − 1
With other loss functions the estimator of variance will generally be different.
The standard practice for evaluating a model found by cross-validation is to report Ψ(test) (λ(s) )
for the λ(s) that minimizes Ψ(valid) (λ(s) ). However, when different trials have nearly optimal val-
idation means, then it is not clear which test score to report, and a slightly different choice of λ
could have yielded a different test error. To resolve the difficulty of choosing a winner, we report a
weighted average of all the test set scores, in which each one is weighted by the probability that its
particular λ(s) is in fact the best. In this view, the uncertainty arising from X (valid) being a finite sam-
ple of Gx makes the test-set score of the best model among λ(1) , ..., λ(S) a random variable, z. This
score z is modeled by a Gaussian mixture model whose S components have means µs = Ψ(test) (λ(s) ),
285
variances σ2s = V(test) (λ(s) ), and weights ws defined by

′

ws = P Z (s) < Z (s ) , ∀s′ 6= s , where

Z ∼N Ψ
(i) (valid) (i)
(λ ), V (valid) (i)
(λ ) .
To summarize, the performance z of the best model in an experiment of S trials has mean µz and
standard error σ2z ,
S
µz = ∑ ws µs , and (5)
s=1
S
σ2z = ∑ ws µ2s + σ2s − µ2z .

(6)
s=1
It is simple and practical to estimate weights ws by simulation. The procedure for doing so is to
repeatedly draw hypothetical validation scores Z (s) from Normal distributions whose means are the
Ψ(valid) (λ(s) ) and whose variances are the squared standard errors V(valid) (λ(s) ), and to count how
often each trial generates a winning score. Since the test scores of the best validation scores are
typically relatively close, ws need not be estimated very precisely and a few tens of hypothetical
draws suffice.
In expectation, this technique for estimating generalization gives a higher estimate than the
traditional technique of reporting the test set error of the best model in validation. The difference is
related to the variance Ψ(valid) and the density of validation set scores Ψ(λ(i) ) near the best value. To
the extent that Ψ(valid) casts doubt on which model was best, this technique averages the performance
of the best model together with the performance of models which were not the best. The next section
(Random Experiment Efficieny Curve) illustrates this phenomenon and discusses it in more detail.
2.2 Random Experiment Efficiency Curve

Figure 2 illustrates the results of a random experiment: an experiment of 256 trials training neural
networks to classify the rectangles data set. Since the trials of a random experiment are indepen-
dently identically distributed (i.i.d.), a random search experiment involving S i.i.d. trials can also
be interpreted as N independent experiments of s trials, as long as sN ≤ S. This interpretation al-
lows us to estimate statistics such as the minimum, maximum, median, and quantiles of any random
experiment of size s, where s is a divisor of S.
There are two general trends in random experiment efficiency curves, such as the one in Figure 2:
a sharp upward slope of the lower extremes as experiments grow, and a gentle downward slope of
the upper extremes. The sharp upward slope occurs because when we take the maximum over
larger subsets of the S trials, trials with poor performance are rarely the best within their subset. It
is natural that larger experiments find trials with better scores. The shape of this curve indicates
the frequency of good models under random search, and quantifies the relative volumes (in search
space) of the various levels of performance.
The gentle downward slope occurs because as we take the maximum over larger subsets of trials
(in Equation 6), we are less sure about which trial is actually the best. Large experiments average
together good validation trials with unusually high test scores with other good validation trials with
unusually low test scores to arrive at a more accurate estimate of generalization. For example,
286
rectangles images
0.80
0.75
0.70
accuracy
0.65
0.60
0.55
0.50
0.45
1 2 4 8 16 32 64 128
experiment size (# trials)
Figure 2: A random experiment efficiency curve. The trials of a random experiment are i.i.d, so
an experiment of many trials (here, 256 trials optimizing a neural network to classify the
rectangles basic data set, Section 2.3) can be interpreted as several independent smaller
experiments. For example, at horizontal axis position 8, we consider our 256 trials to
be 32 experiments of 8 trials each. The vertical axis shows the test accuracy of the best
trial(s) from experiments of a given size, as determined by Equation 5. When there are
sufficiently many experiments of a given size (i.e., 10), the distribution of performance
is illustrated by a box plot whose boxed section spans the lower and upper quartiles and
includes a line at the median. The whiskers above and below each boxed section show
the position of the most extreme data point within 1.5 times the inter-quartile range of the
nearest quartile. Data points beyond the whiskers are plotted with ’+’ symbols. When
there are not enough experiments to support a box plot, as occurs here for experiments of
32 trials or more, the best generalization score of each experiment is shown by a scatter
plot. The two thin black lines across the top of the figure mark the upper and lower
boundaries of a 95% confidence interval on the generalization of the best trial overall
(Equation 6).
consider what Figure 2 would look like if the experiment had included lucky trial whose validation
score were around 77% as usual, but whose test score were 80%. In the bar plot for trials of size
1, we would see the top performer scoring 80%. In larger experiments, we would average that 80%
performance together with other test set performances because 77% is not clearly the best validation
score; this averaging would make the upper envelope of the efficiency curve slope downward from
80% to a point very close to the current test set estimate of 76%.
Figure 2 characterizes the range of performance that is to be expected from experiments of vari-
ous sizes, which is valuable information to anyone trying to reproduce these results. For example, if
we try to repeat the experiment and our first four random trials fail to find a score better than 70%,
then the problem is likely not in hyper-parameter selection.
287
Figure 3: From top to bottom, samples from the mnist rotated, mnist background random, mnist
background images, mnist rotated background images data sets. In all data sets the
task is to identify the digit (0 - 9) and ignore the various distracting factors of variation.
2.3 Data Sets

Following the work of Larochelle et al. (2007) and Vincent et al. (2008), we use a variety of classi-
fication data sets that include many factors of variation.2
The mnist basic data set is a subset of the well-known MNIST handwritten digit data set (LeCun
et al., 1998a). This data set has 28x28 pixel grey-scale images of digits, each belonging to one of ten
classes. We chose a different train/test/validation splitting in order to have faster experiments and see
learning performance differences more clearly. We shuffled the original splits randomly, and used
10 000 training examples, 2000 validation examples, and 50 000 testing examples. These images
are presented as white (1.0-valued) foreground digits against a black (0.0-valued) background.
The mnist background images data set is a variation on mnist basic in which the white fore-
ground digit has been composited on top of a 28x28 natural image patch. Technically this was done
by taking the maximum of the original MNIST image and the patch. Natural image patches with
very low pixel variance were rejected. As with mnist basic there are 10 classes, 10 000 training
examples, 2000 validation examples, and 50 000 test examples.
The mnist background random data set is a similar variation on mnist basic in which the
white foreground digit has been composited on top of random uniform (0,1) pixel values. As with
mnist basic there are 10 classes, 10 000 training examples, 2000 validation examples, and 50 000
test examples.
The mnist rotated data set is a variation on mnist basic in which the images have been rotated
by an amount chosen randomly between 0 and 2π radians. This data set included 10000 training
examples, 2000 validation examples, 50 000 test examples.
2. Data sets can be found at http://www.iro.umontreal.ca/˜lisa/twiki/bin/view.cgi/Public/
DeepVsShallowComparisonICML2007.
288
Figure 4: Top: Samples from the rectangles data set. Middle: Samples from the rectangles images
data set. Bottom: Samples from the convex data set. In rectangles data sets, the image is
formed by overlaying a small rectangle on a background. The task is to label the small
rectangle as being either tall or wide. In convex, the task is to identify whether the set of
white pixels is convex (images 1 and 4) or not convex (images 2 and 3).
The mnist rotated background images data set is a variation on mnist rotated in which the
images have been rotated by an amount chosen randomly between 0 and 2π radians, and then sub-
sequently composited onto natural image patch backgrounds. This data set included 10000 training
examples, 2000 validation examples, 50 000 test examples.
The rectangles data set (Figure 4, top) is a simple synthetic data set of outlines of rectangles.
The images are 28x28, the outlines are white (1-valued) and the backgrounds are black (0-valued).
The height and width of the rectangles were sampled uniformly, but when their difference was
smaller than 3 pixels the samples were rejected. The top left corner of the rectangles was also
sampled uniformly, with the constraint that the whole rectangle fits in the image. Each image is
labelled as one of two classes: tall or wide. This task was easier than the MNIST digit classification,
so we only used 1000 training examples, and 200 validation examples, but we still used 50 000
testing examples.
The rectangles images data set (Figure 4, middle) is a variation on rectangles in which the
foreground rectangles were filled with one natural image patch, and composited on top of a different
background natural image patch. The process for sampling rectangle shapes was similar to the one
used for rectangles, except a) the area covered by the rectangles was constrained to be between
25% and 75% of the total image, b) the length and width of the rectangles were forced to be of at
least 10 pixels, and c) their difference was forced to be of at least 5 pixels. This task was harder
than rectangles, so we used 10000 training examples, 2000 validation examples, and 50 000 testing
examples.
The convex data set (Figure 4, bottom) is a binary image classification task. Each 28x28 image
consists entirely of 1-valued and 0-valued pixels. If the 1-valued pixels form a convex region in
image space, then the image is labelled as being convex, otherwise it is labelled as non-convex. The
convex sets consist of a single convex region with pixels of value 1.0. Candidate convex images
were constructed by taking the intersection of a number of half-planes whose location and orienta-
289
tion were chosen uniformly at random. The number of intersecting half-planes was also sampled
randomly according to a geometric distribution with parameter 0.195. A candidate convex image
was rejected if there were less than 19 pixels in the convex region. Candidate non-convex images
were constructed by taking the union of a random number of convex sets generated as above, but
with the number of half-planes sampled from a geometric distribution with parameter 0.07 and with
a minimum number of 10 pixels. The number of convex sets was sampled uniformly from 2 to
4. The candidate non-convex images were then tested by checking a convexity condition for every
pair of pixels in the non-convex set. Those sets that failed the convexity test were added to the data
set. The parameters for generating the convex and non-convex sets were balanced to ensure that the
conditional overall pixel mean is the same for both classes.
2.4 Case Study: Neural Networks

In Larochelle et al. (2007), the hyper-parameters of the neural network were optimized by search
over a grid of trials. We describe the hyper-parameter configuration space of our neural network
learning algorithm in terms of the distribution that we will use to randomly sample from that con-
figuration space. The first hyper-parameter in our configuration is the type of data preprocessing:
with equal probability, one of (a) none, (b) normalize (center each feature dimension and divide by
its standard deviation), or (c) PCA (after removing dimension-wise means, examples are projected
onto principle components of the data whose norms have been divided by their eigenvalues). Part
of PCA preprocessing is choosing how many components to keep. We choose a fraction of variance
to keep with a uniform distribution between 0.5 and 1.0. There have been several suggestions for
how the random weights of a neural network should be initialized (we will look at unsupervised
learning pretraining algorithms later in Section 5). We experimented with two distributions and two
scaling heuristics. The possible distributions were (a) uniform on (−1, 1), and (b) unit normal. The
two scaling heuristics were (a) a hyper-parameter multiplier between 0.1 and 10.0 divided by the
square root of the number of inputs (LeCun et al., 1998b), and (b) the square root of 6 divided by
the square root of the number of inputs plus hidden units (Bengio and Glorot, 2010). The weights
themselves were chosen using one of three random seeds to the Mersenne Twister pseudo-random
number generator. In the case of the first heuristic, we chose a multiplier uniformly from the range
(0.2, 2.0). The number of hidden units was drawn geometrically3 from 18 to 1024. We selected
either a sigmoidal or tanh nonlinearity with equal probability. The output weights from hidden units
to prediction units were initialized to zero. The cost function was the mean error over minibatches
of either 20 or 100 (with equal probability) examples at a time: in expectation these give the same
gradient directions, but with more or less variance. The optimization algorithm was stochastic gra-
dient descent with [initial] learning rate ε0 drawn geometrically from 0.001 to 10.0. We offered the
possibility of an annealed learning rate via a time point t0 drawn geometrically from 300 to 30000.
The effective learning rate εt after t minibatch iterations was
t0 ε0
εt = . (7)
max(t,t0 )
We permitted a minimum of 100 and a maximum of 1000 iterations over the training data, stopping
if ever, at iteration t, the best validation performance was observed before iteration t/2. With 50%
3. We will use the phrase drawn geometrically from A to B for 0 < A < B to mean drawing uniformly in the log domain
between log(A) and log(B), exponentiating to get a number between A and B, and then rounding to the nearest integer.
The phrase drawn exponentially means the same thing but without rounding.
290
probability, an ℓ2 regularization penalty was applied, whose strength was drawn exponentially from
3.1 × 10−7 to 3.1 × 10−5 . This sampling process covers roughly the same domain with the same
density as the grid used in Larochelle et al. (2007), except for the optional preprocessing steps. The
grid optimization of Larochelle et al. (2007) did not consider normalizing or keeping only leading
PCA dimensions of the inputs; we compare to random sampling with and without these restrictions.4
We formed experiments for each data set by drawing S = 256 trials from this distribution. The
results of these experiments are illustrated in Figures 5 and 6. Random sampling of trials is surpris-
ingly effective in these settings. Figure 5 shows that even among the fraction of jobs (71/256) that
used no preprocessing, the random search with 8 trials is better than the grid search employed in
Larochelle et al. (2007).
Typically, the extent of a grid search is determined by a computational budget. Figure 6 shows
what is possible if we use random search in a larger space that requires more trials to explore. The
larger search space includes the possibility of normalizing the input or applying PCA preprocessing.
In the larger space, 32 trials were necessary to consistently outperform grid search rather than 8,
indicating that there are many harmful ways to preprocess the data. However, when we allowed
larger experiments of 64 trials or more, random search found superior results to those found more
quickly within the more restricted search. This tradeoff between exploration and exploitation is
central to the design of an effective random search.
The efficiency curves in Figures 5 and 6 reveal that different data sets give rise to functions Ψ
with different shapes. The mnist basic results converge very rapidly toward what appears to be a
global maximum. The fact that experiments of just 4 or 8 trials often have the same maximum as
much larger experiments indicates that the region of Λ that gives rise to the best performance is
approximately a quarter or an eighth respectively of the entire configuration space. Assuming that
the random search has not missed a tiny region of significantly better performance, we can say that
random search has solved this problem in 4 or 8 guesses. It is hard to imagine any optimization
algorithm doing much better on a non-trivial 7-dimensional function. In contrast the mnist rotated
background images and convex curves show that even with 16 or 32 random trials, there is consid-
erable variation in the generalization of the reportedly best model. This indicates that the Ψ function
in these cases is more peaked, with small regions of good performance.
3. The Low Effective Dimension of Ψ

Section 2 showed that random sampling is more efficient than grid sampling for optimizing func-
tions Ψ corresponding to several neural network families and classification tasks. In this section
we show that indeed Ψ has a low effective dimension, which explains why randomly sampled trials
found better values. One simple way to characterize the shape of a high-dimensional function is
to look at how much it varies in each dimension. Gaussian process regression gives us the statis-
tical machinery to look at Ψ and measure its effective dimensionality (Neal, 1998; Rasmussen and
Williams, 2006).
We estimated the sensitivity of Ψ to each hyper-parameter by fitting a Gaussian process (GP)
with squared exponential kernels to predict Ψ(λ) from λ. The squared exponential kernel (or
Gaussian kernel) measures similarity between two real-valued hyper-parameter values a and b by
2
exp(− a−b l ). The positive-valued l governs the sensitivity of the GP to change in this hyper-
4. Source code for the simulations is available at https://github.com/jaberg/hyperopt.
291
mnist basic mnist background images mnist background random

1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8

accuracy
accuracy
accuracy
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
experiment size (# trials) experiment size (# trials) experiment size (# trials)
mnist rotated mnist rotated background images convex

1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8

accuracy
accuracy
accuracy
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
rectangles rectangles images

1.0 1.0
0.9 0.9
0.8 0.8
accuracy
accuracy
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1 2 4 8 16 32 1 2 4 8 16 32
experiment size (# trials) experiment size (# trials)
Figure 5: Neural network performance without preprocessing. Random experiment efficiency

curves of a single-layer neural network for eight of the data sets used in Larochelle et al.
(2007), looking only at trials with no preprocessing (7 hyper-parameters to optimize).
The vertical axis is test-set accuracy of the best model by cross-validation, the horizontal
axis is the experiment size (the number of models compared in cross-validation). The
dashed blue line represents grid search accuracy for neural network models based on a
selection by grids averaging 100 trials (Larochelle et al., 2007). Random searches of 8
trials match or outperform grid searches of (on average) 100 trials.
parameter. The kernels defined for each hyper-parameter were combined by multiplication (joint
Gaussian kernel). We fit a GP to samples of Ψ by finding the length scale (l) for each hyper-
parameter that maximized the marginal likelihood. To ensure relevance could be compared between
hyper-parameters, we shifted and scaled each one to the unit interval. For hyper-parameters that
were drawn geometrically or exponentially (e.g., learning rate, number of hidden units), kernel
calculations were based on the logarithm of the effective value.
292

1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8

accuracy
accuracy
accuracy
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3

1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
mnist rotated mnist rotated background images convex

1.0 1.0 1.0
0.9 0.9 0.9
0.8 0.8 0.8

accuracy
accuracy
accuracy
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3

1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

1.0 1.0
0.9 0.9
0.8 0.8
accuracy
accuracy
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Figure 6: Neural network performance when standard preprocessing algorithms are considered (9
hyper-parameters). Dashed blue line represents grid search accuracy using (on average)
100 trials (Larochelle et al., 2007), in which no preprocessing was done. Often the extent
of a search is determined by a computational budget, and with random search 64 trials are
enough to find better models in a larger less promising space. Exploring just four PCA
variance levels by grid search would have required 5 times as many (average 500) trials
per data set.
Figure 7 shows the relevance of each component of Λ in modelling Ψ(λ). Finding the length
scales that maximize marginal likelihood is not a convex problem and many local minima exist. To
get a sense of what length scales were supported by the data, we fit each set of samples from Ψ
50 times, resampling different subsets of 80% of the observations every time, and reinitializing the
length scale estimates randomly between 0.1 and 2. Figure 7 reveals two important properties of Ψ
for neural networks that suggest why grid search performs so poorly relative to random experiments:
1. a small fraction of hyper-parameters matter for any one data set, but
293

h.u.
a.f.
w.a.
w.n.
w.p.
l.r.
l.a.

relevance (1 / length scale) relevance (1 / length scale) relevance (1 / length scale)
mnist rotated mnist rotated back. images convex

h.u.
a.f.
w.a.
w.n.
w.p.
l.r.
l.a.

relevance (1 / length scale) relevance (1 / length scale) relevance (1 / length scale)
Legend rectangles rectangles images

n. hidden units h.u.
activation . a.f.
initial W algo. w.a.
initial W norm w.n.
weight penalty w.p.
learning rate l.r.
learn rate anneal. l.a.

relevance (1 / length scale) relevance (1 / length scale)
Figure 7: Automatic Relevance Determination (ARD) applied to hyper-parameters of neural net-

work experiments (with raw preprocessing). For each data set, a small number of hyper-
parameters dominate performance, but the relative importance of each hyper-parameter
varies from each data set to the next. Section 2.4 describes the seven hyper-parameters in
each panel. Boxplots are obtained by randomizing the subset of data used to fit the length
scales, and randomizing the length scale initialization. (Best viewed in color.)
294
2. different hyper-parameters matter on different data sets.

Even in this simple 7-d problem, Ψ has a much lower effective dimension of between 1 and 4,
depending on the data set. It would be impossible to cover just these few dimensions with a reli-
able grid however, because different data sets call for grids on different dimensions. The learning
rate is always important, but sometimes the learning rate annealing rate was important (rectangles
images), sometimes the ℓ2 -penalty was important (convex, mnist rotated), sometimes the number
of hidden units was important (rectangles), and so on. While random search optimized these Ψ
functions with 8 to 16 trials, a grid with, say, four values in each of these axes would already require
256 trials, and yet provide no guarantee that Ψ for a new data set would be well optimized.
Figure 7 also allows us to establish a correlation between effective dimensionality and ease of
optimization. The data sets for which the effective dimensionality was lowest (1 or 2) were mnist
basic, mnist background images, mnist background random, and rectangles images. Looking
back at the corresponding efficiency curves (Figure 5) we find that these are also the data sets
whose curves plateau most sharply, indicating that these functions are the easiest to optimize. They
are often optimized reasonably well by just 2 random trials. Looking to Figure 7 at the data sets with
largest effective dimensionality (3 or 4), we identify convex, mnist rotated, rectangles. Looking
at their efficiency curves in Figure 5 reveals that they consistently required at least 8 random trials.
This correlation offers another piece of evidence that the effective dimensionality of Ψ is playing a
strong role in determining the difficulty of hyper-parameter optimization.
4. Grid Search and Sets with Low Effective Dimensionality

It is an interesting mathematical challenge to choose a set of trials for sampling functions of un-
known, but low effective dimensionality. We would like it to be true that no matter which dimen-
sions turn out to be important, our trials sample the important dimensions evenly. Sets of points with
this property are well studied in the literature of Quasi-Random methods for numerical integration,
where they are known as low-discrepancy sets because they try to match (minimize discrepancy
with) the uniform distribution. Although there are several formal definitions of low discrepancy,
they all capture the intuition that the points should be roughly equidistant from one another, in order
that there be no “clumps” or “holes” in the point set.
Several procedures for constructing low-discrepancy point sets in multiple dimensions also try
to ensure as much as possible that subspace projections remain low-discrepancy sets in the subspace.
For example, the Sobol (Antonov and Saleev, 1979), Halton (Halton, 1960), and Niederreiter (Brat-
ley et al., 1992) sequences, as well as latin hypercube sampling (McKay et al., 1979) are all more
or less deterministic schemes for getting point sets that are more representative of random uniform
draws than actual random uniform draws. In Quasi Monte-Carlo integration, such point sets are
shown to asymptotically minimize the variance of finite integrals faster than true random uniform
samples, but in this section, we will look at these point sets in the setting of relatively small sample
sizes, to see if they can be used for more efficient search than random draws.
Rather than repeat the very computationally expensive experiments conducted in Section 2,
we used an artificial simulation to compare the efficiency of grids, random draws, and the four
low-discrepancy point sets mentioned in the previous paragraph. The artificial search problem was
to find a uniformly randomly placed multi-dimensional target interval, which occupies 1% of the
volume of the unit hyper-cube. We looked at four variants of the search problem, in which the target
was
295
1. a cube in a 3-dimensional space,
2. a hyper-rectangle in a 3-dimensional space,
3. a hyper-cube in a 5-dimensional space,
4. a hyper-rectangle in a 5-dimensional space.
The shape of the target rectangle in variants (2) and (4) was determined by sampling side lengths
uniformly from the unit interval, and then scaling the rectangle to have a volume of 1%. This
process gave the rectangles a shape that was often wide or tall - much longer along some axes than
others. The position of the target was drawn uniformly among the positions totally inside the unit
hyper-cube. In the case of tall or wide targets (2) and (4), the indicator function [of the target] had
a lower effective dimension than the dimensionality of the overall space because the dimensions in
which the target is elongated can be almost ignored.
The simulation experiment began with the generation of 100 random search problems. Then for
each experiment design method (random, Sobol, latin hypercube, grid) we created experiments of
1, 2, 3, and so on up to 512 trials.5 The Sobol, Niederreiter, and Halton sequences yielded similar
results, so we used the Sobol sequence to represent the performance of these low-discepancy set
construction methods. There are many possible grid experiments of any size in multiple dimensions
(at least for non-prime experiment sizes). We did not test every possible grid, instead we tested
every grid with a monotonic resolution. For example, for experiments of size 16 in 5 dimensions
we tried the five grids with resolutions (1, 1, 1, 1, 16), (1, 1, 1, 2, 8), (1, 1, 2, 2, 4), (1, 1, 1, 4,
4), (1, 2, 2, 2, 2); for experiments of some prime size P in 3 dimensions we tried one grid with
resolution (1, 1, P). Since the target intervals were generated in such a way that rectangles identical
up to a permutation of side lengths have equal probability, grids with monotonic resolution are
representative of all grids. The score of an experiment design method for each experiment size was
the fraction of the 100 targets that it found.
To characterize the performance of random search, we used the analytic form of the expectation.
The expected probability of finding the target is 1.0 minus the probability of missing the target
with every single one of T trials in the experiment. If the volume of the target relative to the unit
hypercube is (v/V = 0.01) and there are T trials, then this probability of finding the target is
v T
1 − (1 − ) = 1 − 0.99T .
V
Figure 8 illustrates the efficiency of each kind of point set at finding the multidimensional in-
tervals. There were some grids that were best at finding cubes and hyper-cubes in 3-d and 5-d, but
most grids were the worst performers. No grid was competitive with the other methods at finding
the rectangular-shaped intervals, which had low effective dimension (cases 2 and 4; Figure 8, right
panels). Latin hypercubes, commonly used to initialize experiments in Bayesian optimization, were
no more efficient than the expected performance of random search. Interestingly, the Sobol se-
quence was consistently best by a few percentage points. The low-discrepancy property that makes
the Sobol useful in integration helps here, where it has the effect of minimizing the size of holes
where the target might pass undetected. The advantage of the Sobol sequence is most pronounced in
experiments of 100-300 trials, where there are sufficiently many trials for the structure in the Sobol
5. Samples from the Sobol sequence were provided by the GNU Scientific Library (M. Galassi et al., 2009).
296
Figure 8: The efficiency in simulation of low-discrepancy sequences relative to grid and pseudo-
random experiments. The simulation tested how reliably various experiment design meth-
ods locate a multidimensional interval occupying 1% of a unit hyper-cube. There is one
grey dot in each sub-plot for every grid of every experiment size that has at least two ticks
in each dimension. The black dots indicate near-perfect grids whose finest and coarsest
dimensional resolutions differ by either 0 or 1. Hyper-parameter search is most typi-
cally like the bottom-right scenario. Grid search experiments are inefficient for finding
axis-aligned elongated regions in high dimensions (i.e., bottom-right). Pseudo-random
samples are as efficient as latin hypercube samples, and slightly less efficient than the
Sobol sequence.
depart significantly from i.i.d points, but not sufficiently many trials for random search to succeed
with high probability.
A thought experiment gives some intuition for why grid search fails in the case of rectangles.
Long thin rectangles tend to intersect with several points if they intersect with any, reducing the
effective sample size of the search. If the rectangles had been rotated away from the axes used to
build the grid, then depending on the angle the efficiency of grid could approach the efficiency of
random or low-discrepancy trials. More generally, if the target manifold were not systematically
aligned with subsets of trial points, then grid search would be as efficient as the random and quasi-
random searches.
297
5. Random Search vs. Sequential Manual Optimization

To see how random search compares with a careful combination of grid search and hand-tuning
in the context of a model with many hyper-parameters, we performed experiments with the Deep
Belief Network (DBN) model (Hinton et al., 2006). A DBN is a multi-layer graphical model with
directed and undirected components. It is parameterized similarly to a multilayer neural network for
classification, and it has been argued that pretraining a multilayer neural network by unsupervised
learning as a DBN acts both to regularize the neural network toward better generalization, and to
ease the optimization associated with finetuning the neural network for a classification task (Erhan
et al., 2010).
A DBN classifier has many more hyper-parameters than a neural network. Firstly, there is the
number of units and the parameters of random initialization for each layer. Secondly, there are
hyper-parameters governing the unsupervised pretraining algorithm for each layer. Finally, there
are hyper-parameters governing the global finetuning of the whole model for classification. For the
details of how DBN models are trained (stacking restricted Boltzmann machines trained by con-
trastive divergence), the reader is referred to Larochelle et al. (2007), Hinton et al. (2006) or Bengio
(2009). We evaluated random search by training 1-layer, 2-layer and 3-layer DBNs, sampling from
the following distribution:
• We chose 1, 2, or 3 layers with equal probability.
• For each layer, we chose:
– a number of hidden units (log-uniformly between 128 and 4000),

– a weight initialization heuristic that followed from a distribution (uniform or normal),
a multiplier (uniformly between 0.2 and 2), a decision to divide by the fan-out (true or
false),
– a number of iterations of contrastive divergence to perform for pretraining (log-uniformly
from 1 to 10000),
– whether to treat the real-valued examples used for unsupervised pretraining as Bernoulli
means (from which to draw binary-valued training samples) or as a samples themselves
(even though they are not binary),
– an initial learning rate for contrastive divergence (log-uniformly between 0.0001 and
1.0),
– a time point at which to start annealing the contrastive divergence learning rate as in
Equation 7 (log-uniformly from 10 to 10 000).
• There was also the choice of how to preprocess the data. Either we used the raw pixels or
we removed some of the variance using a ZCA transform (in which examples are projected
onto principle components, and then multiplied by the transpose of the principle components
to place them back in the inputs space).
• If using ZCA preprocessing, we kept an amount of variance drawn uniformly from 0.5 to 1.0.
• We chose to seed our random number generator with one of 2, 3, or 4.
• We chose a learning rate for finetuning of the final classifier log-uniformly from 0.001 to 10.
298
• We chose an anneal start time for finetuning log-uniformly from 100 to 10000.
• We chose ℓ2 regularization of the weight matrices at each layer during finetuning to be either
0 (with probability 0.5), or log-uniformly from 10−7 to 10−4 .
This hyper-parameter space includes 8 global hyper-parameters and 8 hyper-parameters for each
layer, for a total of 32 hyper-parameters for 3-layer models.
A grid search is not practical for the 32-dimensional search problem of DBN model selection,
because even just 2 possible values for each of 32 hyper-parameters would yield more trials than
we could conduct (232 > 109 trials and each can take hours). For many of the hyper-parameters,
especially real valued ones, we would really like to try more than two values. The approach taken
in Larochelle et al. (2007) was a combination of manual search, multi-resolution grid search and
coordinate descent. The algorithm (including manual steps) is somewhat elaborate, but sensible,
and we believe that it is representative of how model search is typically done in several research
groups, if not the community at large. Larochelle et al. (2007) describe it as follows:
“The hyper-parameter search procedure we used alternates between fixing a neural net-
work architecture and searching for good optimization hyper-parameters similarly to
coordinate descent. More time would usually be spent on finding good optimization
parameters, given some empirical evidence that we found indicating that the choice of
the optimization hyper-parameters (mostly the learning rates) has much more influence
on the obtained performance than the size of the network. We used the same procedure
to find the hyper-parameters for DBN-1, which are the same as those of DBN-3 except
the second hidden layer and third hidden layer sizes. We also allowed ourselves to
test for much larger first-hidden layer sizes, in order to make the comparison between
DBN-1 and DBN-3 fairer.
“We usually started by testing a relatively small architecture (between 500 and 700
units in the first and second hidden layer, and between 1000 and 2000 hidden units
in the last layer). Given the results obtained on the validation set (compared to those
of NNet for instance) after selecting appropriate optimization parameters, we would
then consider growing the number of units in all layers simultaneously. The biggest
networks we eventually tested had up to 3000, 4000 and 6000 hidden units in the first,
second and third hidden layers respectively.
“As for the optimization hyper-parameters, we would proceed by first trying a few com-
binations of values for the stochastic gradient descent learning rate of the supervised
and unsupervised phases (usually between 0.1 and 0.0001). We then refine the choice of
tested values for these hyper-parameters. The first trials would simply give us a trend on
the validation set error for these parameters (is a change in the hyper-parameter making
things worse of better) and we would then consider that information in selecting ap-
propriate additional trials. One could choose to use learning rate adaptation techniques
(e.g., slowly decreasing the learning rate or using momentum) but we did not find these
techniques to be crucial.
There was large variation in the number of trials used in Larochelle et al. (2007) to optimize the
DBN-3. One data set (mnist background images) benefited from 102 trials, while another (mnist
background random) only 13 because a good result was found more quickly. The average number
299

mnist rotated mnist rotated back. images convex

accuracy
accuracy


accuracy
accuracy

Figure 9: Deep Belief Network (DBN) performance according to random search. Here random
search is used to explore up to 32 hyper-parameters. Results obtained by grid-assisted
manual search using an average of 41 trials are marked in finely-dashed green (1-layer
DBN) and coarsely-dashed red (3-layer DBN). Random experiments of 128 random trials
found an inferior best model for three data sets, a competitive model in four, and superior
model in one (convex). (Best viewed in color.)
of trials across data sets for the DBN-3 model was 41. In considering the number of trials per data
set, it is important to bear in mind that the experiments on different data sets were not performed
independently. Rather, later experiments benefited from the experience the authors had drawn from
earlier ones. Although grid search was part of the optimization loop, the manual intervention turns
the overall optimization process into something with more resemblance to an adaptive sequential
algorithm.
Random search versions of the DBN experiments from Larochelle et al. (2007) are shown in
Figure 9. In this more challenging optimization problem random search is still effective, but not
300
superior as it was as in the case of neural network optimization. Comparing to the 3-layer DBN
results in Larochelle et al. (2007), random search found a better model than the manual search in
one data set (convex), an equally good model in four (mnist basic, mnist rotated, rectangles, and
rectangles images), and an inferior model in three (mnist background images, mnist background
random, mnist rotated background images). Comparing to the 1-layer DBN results, random
search of the 1-layer, 2-layer and 3-layer configuration space found at least a good a model in all
cases. In comparing these scores, the reader should bear in mind that the scores in the original
experiments were not computed using the same score-averaging technique that we described in
Section 2.1, and our averaging technique is slightly biased toward underestimation. In the DBN
efficiency curves we see that even experiments with larger numbers of trials (64 and larger) feature
significant variability. This indicates that the regions of the search space with the best performance
are small, and randomly chosen i.i.d. trials do not reliably find them.
6. Future Work
Our result on the multidimensional interval task, together with the GPR characterization of the shape
of Ψ, together with the computational constraint that hyper-parameter searches only draw on a few
hundred trials, all suggest that pseudo-random or quasi-random trials are optimal for non-adaptive
hyper-parameter search. There is still work to be done for each model family, to establish how it
should be parametrized for i.i.d. random search to be as reliable as possible, but the most promising
and interesting direction for future work is certainly in adaptive algorithms.
There is a large body of literature on global optimization, a great deal of which bears on the ap-
plication of hyper-parameter optimization. General numeric methods such as simplex optimization
(Nelder and Mead, 1965), constrained optimization by linear approximation (Powell, 1994; Weise,
2009), finite difference stochastic approximation and simultaneous prediction stochastic approxi-
mation (Kleinman et al., 1999) could be useful, as well as methods for search in discrete spaces
such as simulated annealing (Kirkpatrick et al., 1983) and evolutionary algorithms (Rechenberg,
1973; Hansen et al., 2003). Drew and de Mello (2006) have already proposed an optimization al-
gorithm that identifies effective dimensions, for more efficient search. They present an algorithm
that distinguishes between important and unimportant dimensions: a low-discrepancy point set is
used to choose points in the important dimensions, and unimportant dimensions are “padded” with
thinner coverage and cheaper samples. Their algorithm’s success hinges on the rapid and successful
identification of important dimensions. Sequential model-based optimization methods and partic-
ularly Bayesian optimization methods are perhaps more promising because they offer principled
approaches to weighting the importance of each dimension (Hutter, 2009; Hutter et al., 2011; Srini-
vasan and Ramakrishnan, 2011).
With so many sophisticated algorithms to draw on, it may seem strange that grid search is still
widely used, and, with straight faces, we now suggest using random search instead. We believe the
reason for this state of affairs is a technical one. Manual optimization followed by grid search is
easy to implement: grid search requires very little code infrastructure beyond access to a cluster
of computers. Random search is just as simple to carry out, uses the same tools, and fits in the
same workflow. Adaptive search algorithms on the other hand require more code complexity. They
require client-server architectures in which a master process keeps track of the trials that have com-
pleted, the trials that are in progress, the trials that were started but failed to complete. Some kind
of shared database and inter-process communication mechanisms are required. Trials in an adaptive
301
experiment cannot be queued up all at once; the master process must be involved somehow in the
scheduling and timing of jobs on the cluster. These technical hurdles are not easy to jump with the
standard tools of the trade such as MATLAB or Python; significant software engineering is required.
Until that engineering is done and adopted by a community of researchers, progress on the study of
sophisticated hyper-parameter optimization algorithms will be slow.
7. Conclusion
Grid search experiments are common in the literature of empirical machine learning, where they are
used to optimize the hyper-parameters of learning algorithms. It is also common to perform multi-
stage, multi-resolution grid experiments that are more or less automated, because a grid experiment
with a fine-enough resolution for optimization would be prohibitively expensive. We have shown
that random experiments are more efficient than grid experiments for hyper-parameter optimization
in the case of several learning algorithms on several data sets. Our analysis of the hyper-parameter
response surface (Ψ) suggests that random experiments are more efficient because not all hyper-
parameters are equally important to tune. Grid search experiments allocate too many trials to the
exploration of dimensions that do not matter and suffer from poor coverage in dimensions that are
important. Compared with the grid search experiments of Larochelle et al. (2007), random search
found better models in most cases and required less computational time.
Random experiments are also easier to carry out than grid experiments for practical reasons
related to the statistical independence of every trial.
• The experiment can be stopped any time and the trials form a complete experiment.
• If extra computers become available, new trials can be added to an experiment without having
to adjust the grid and commit to a much larger experiment.
• Every trial can be carried out asynchronously.
• If the computer carrying out a trial fails for any reason, its trial can be either abandoned or
restarted without jeopardizing the experiment.
Random search is not incompatible with a controlled experiment. To investigate the effect
of one hyper-parameter of interest X, we recommend random search (instead of grid search) for
optimizing over other hyper-parameters. Choose one set of random values for these remaining
hyper-parameters and use that same set for each value of X.
Random experiments with large numbers of trials also bring attention to the question of how
to measure test error of an experiment when many trials have some claim to being best. When
using a relatively small validation set, the uncertainty involved in selecting the best model by cross-
validation can be larger than the uncertainty in measuring the test set performance of any one model.
It is important to take both of these sources of uncertainty into account when reporting the uncer-
tainty around the best model found by a model search algorithm. This technique is useful to all
experiments (including both random and grid) in which multiple models achieve approximately the
best validation set performance.
Low-discrepancy sequences developed for QMC integration are also good alternatives to grid-
based experiments. In low dimensions (e.g., 1-5) our simulated results suggest that they can hold
some advantage over pseudo-random experiments in terms of search efficiency. However, the trials
302
of a low-discrepancy experiment are not i.i.d. which makes it inappropriate to analyze performance
with the random efficiency curve. It is also more difficult in practice to conduct a quasi-random
experiment because like a grid experiment, the omission of a single point can be more severe.
Finally, when there are many hyper-parameter dimensions relative to the computational budget for
the experiment, a low-discrepancy trial set is not expected to behave very differently from a pseudo-
random one.
Finally, the hyper-parameter optimization strategies considered here are non-adaptive: they do
not vary the course of the experiment by considering any results that are already available. Random
search was not generally as good as the sequential combination of manual and grid search from
an expert (Larochelle et al., 2007) in the case of the 32-dimensional search problem of DBN op-
timization, because the efficiency of sequential optimization overcame the inefficiency of the grid
search employed at each step of the procedure. Future work should consider sequential, adaptive
search/optimization algorithms in settings where many hyper-parameters of an expensive function
must be optimized jointly and the effective dimensionality is high. We hope that future work in that
direction will consider random search of the form studied here as a baseline for performance, rather
than grid search.
Acknowledgments
This work was supported by the National Science and Engineering Research Council of Canada and
Compute Canada, and implemented with Theano (Bergstra et al., 2010).
References
I. A. Antonov and V. M. Saleev. An economic method of computing LPτ -sequences. USSR Compu-
tational Mathematics and Mathematical Physics, 19(1):252–256, 1979.
R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey,
1961.
Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):
1–127, 2009. doi: 10.1561/2200000006.
Y. Bengio and X. Glorot. Understanding the difficulty of training deep feedforward neural networks.
In Y. W. Teh and M. Titterington, editors, Proc. of The Thirteenth International Conference on
Artificial Intelligence and Statistics (AISTATS’10), pages 249–256, 2010.
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, and Y. Bengio.
Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific
Computing Conference (SciPy), June 2010. Oral.
C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, London, UK, 1995.
P. Bratley, B. L. Fox, and H. Niederreiter. Implementation and tests of low-discrepancy sequences.
Transactions on Modeling and Computer Simulation, (TOMACS), 2(3):195–213, 1992.
R. E. Caflisch, W. Morokoff, and A. Owen. Valuation of mortgage backed securities using brownian
bridges to reduce effective dimension, 1997.
303
C. Chang and C. Lin. LIBSVM: A Library for Support Vector Machines, 2001.
I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parame-
ters. Technical report, Universität Dortmund Fachbereich Statistik, September 2005.
S. S. Drew and T. Homem de Mello. Quasi-Monte Carlo strategies for stochastic optimization. In
Proc. of the 38th Conference on Winter Simulation, pages 774 – 782, 2006.
D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised
pre-training help deep learning? Journal of Machine Learning Research, 11:625–660, 2010.
J. H. Halton. On the efficiency of certain quasi-random sequences of points in evaluating multi-

dimensional integrals. Numerische Mathematik, 2:84–90, 1960.
N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized
evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11
(1):1–18, 2003.
G. E. Hinton. A practical guide to training restricted Boltzmann machines. Technical Report 2010-
003, University of Toronto, 2010. version 1.
G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural
Computation, 18:1527–1554, 2006.
F. Hutter. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD
thesis, University of British Columbia, 2009.
F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algo-
rithm configuration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10.
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220

(4598):671–680, 1983.
N. L. Kleinman, J. C. Spall, and D. Q. Naiman. Simulation-based optimization with stochastic ap-

proximation using common random numbers. Management Science, 45(11):1570–1578, Novem-
ber 1999. doi: doi:10.1287/mnsc.45.11.1570.
H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep

architectures on problems with many factors of variation. In Z. Ghahramani, editor, Proceedings
of the Twenty-fourth International Conference on Machine Learning (ICML’07), pages 473–480.
ACM, 2007.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document

recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998a.
Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and K. Muller, editors,
Neural Networks: Tricks of the Trade. Springer, 1998b.
M. Galassi et al. GNU Scientific Library Reference Manual, 3rd edition, 2009.
304
M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selecting

values of input variables in the analysis of output from a computer code. Technometrics, 21(2):
239–245, May 1979. doi: doi:10.2307/1268522.
A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Opti-

mization, 86:523–544, 2003.
R. M. Neal. Assessing relevance determination methods using DELVE. In C. M. Bishop, editor,

Neural Networks and Machine Learning, pages 97–129. Springer-Verlag, 1998.
J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7:
308–313, 1965.
M. J. D. Powell. A direct search optimization method that models the objective and constraint
functions by linear interpolation. Advances in Optimization and Numerical Analysis, pages 51–
67, 1994.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
Ingo Rechenberg. Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biol-
ogischen Evolution. Fommann-Holzboog, Stuttgart, 1973.
A. Srinivasan and G. Ramakrishnan. Parameter screening and optimisation for ILP using designed
experiments. Journal of Machine Learning Research, 12:627–662, February 2011.
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features
with denoising autoencoders. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Pro-
ceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages
1096–1103. ACM, 2008.
T. Weise. Global Optimization Algorithms - Theory and Application. Self-Published, second edi-
tion, 2009. Online available at http://www.it-weise.de/.
305
Journal of Machine Learning Research 12 (2011) 2121-2159 Submitted 3/10; Revised 3/11; Published 7/11
Adaptive Subgradient Methods for

Online Learning and Stochastic Optimization∗
John Duchi JDUCHI @ CS . BERKELEY. EDU

Computer Science Division
University of California, Berkeley
Berkeley, CA 94720 USA
Elad Hazan EHAZAN @ IE . TECHNION . AC . IL
Technion - Israel Institute of Technology
Technion City
Haifa, 32000, Israel
Yoram Singer SINGER @ GOOGLE . COM
Google
1600 Amphitheatre Parkway
Mountain View, CA 94043 USA
Editor: Tong Zhang
Abstract
We present a new family of subgradient methods that dynamically incorporate knowledge of the
geometry of the data observed in earlier iterations to perform more informative gradient-based
learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very
predictive but rarely seen features. Our paradigm stems from recent advances in stochastic op-
timization and online learning which employ proximal functions to control the gradient steps of
the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal func-
tion, which significantly simplifies setting a learning rate and results in regret guarantees that are
provably as good as the best proximal function that can be chosen in hindsight. We give several
efficient algorithms for empirical risk minimization problems with common and important regu-
larization functions and domain constraints. We experimentally study our theoretical analysis and
show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient
algorithms.
Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
1. Introduction
In many applications of online and stochastic learning, the input instances are of very high di-
mension, yet within any particular instance only a few features are non-zero. It is often the case,
however, that infrequently occurring features are highly informative and discriminative. The infor-
mativeness of rare features has led practitioners to craft domain-specific feature weightings, such as
TF-IDF (Salton and Buckley, 1988), which pre-emphasize infrequently occurring features. We use
this old idea as a motivation for applying modern learning-theoretic techniques to the problem of
online and stochastic learning, focusing concretely on (sub)gradient methods.
∗. A preliminary version of this work was published in COLT 2010.
2011
c John Duchi, Elad Hazan and Yoram Singer.
D UCHI , H AZAN AND S INGER
Standard stochastic subgradient methods largely follow a predetermined procedural scheme that
is oblivious to the characteristics of the data being observed. In contrast, our algorithms dynamically
incorporate knowledge of the geometry of the data observed in earlier iterations to perform more
informative gradient-based learning. Informally, our procedures give frequently occurring features
very low learning rates and infrequent features high learning rates, where the intuition is that each
time an infrequent feature is seen, the learner should “take notice.” Thus, the adaptation facilitates
finding and identifying very predictive but comparatively rare features.
1.1 The Adaptive Gradient Algorithm

Before introducing our adaptive gradient algorithm, which we term A DAG RAD, we establish no-
tation. Vectors and scalars are lower case italic letters, such as x ∈ X . We denote a sequence of
vectors by subscripts, that is, xt , xt+1 , . . ., and entries of each vector by an additional subscript, for
example, xt, j . The subdifferential set of a function f evaluated at x is denoted ∂ f (x), and a partic-
ular vector in the subdifferential set is denoted by f ′ (x) ∈ ∂ f (x) or gt ∈ ∂ ft (xt ). When a function
is differentiable, we write ∇ f (x). We use hx, yi to denote the inner product between x and y. The
Bregman divergence associated with a strongly convex and differentiable function ψ is
Bψ (x, y) = ψ(x) − ψ(y) − h∇ψ(y), x − yi .
We also make frequent use of the following two matrices. Let g1:t = [g1 · · · gt ] denote the matrix
obtained by concatenating the subgradient sequence. We denote the ith row of this matrix, which
amounts to the concatenation of the ith component of each subgradient we observe, by g1:t,i . We
also define the outer product matrix Gt = ∑tτ=1 gτ gτ⊤ .
Online learning and stochastic optimization are closely related and basically interchangeable
(Cesa-Bianchi et al., 2004). In order to keep our presentation simple, we confine our discussion and
algorithmic descriptions to the online setting with the regret bound model. In online learning, the
learner repeatedly predicts a point xt ∈ X ⊆ Rd , which often represents a weight vector assigning
importance values to various features. The learner’s goal is to achieve low regret with respect to a
static predictor x∗ in the (closed) convex set X ⊆ Rd (possibly X = Rd ) on a sequence of functions
ft (x), measured as
T T
R(T ) = ∑ ft (xt ) − inf ∑ ft (x) .
x∈X
t=1 t=1
At every timestep t, the learner receives the (sub)gradient information gt ∈ ∂ ft (xt ). Standard sub-
gradient algorithms then move the predictor xt in the opposite direction of gt while maintaining
xt+1 ∈ X via the projected gradient update (e.g., Zinkevich, 2003)
xt+1 = ΠX (xt − ηgt ) = argmin kx − (xt − ηgt )k22 .

x∈X
p
In contrast, let the Mahalanobis norm k·kA = h·, A·i and denote the projection of a point y onto X
according to A by ΠAX (y) = argminx∈X kx − ykA = argminx∈X hx − y, A(x − y)i. Using this notation,
our generalization of standard gradient descent employs the update
1/2

G −1/2
xt+1 = ΠX t xt − ηGt gt .
2122
A DAPTIVE S UBGRADIENT M ETHODS
The above algorithm is computationally impractical in high dimensions since it requires computa-
tion of the root of the matrix Gt , the outer product matrix. Thus we specialize the update to
diag(Gt )1/2

xt+1 = ΠX xt − η diag(Gt )−1/2 gt . (1)
Both the inverse and root of diag(Gt ) can be computed in linear time. Moreover, as we discuss later,
when the gradient vectors are sparse the update above can often be performed in time proportional
to the support of the gradient. We now elaborate and give a more formal discussion of our setting.
In this paper we consider several different online learning algorithms and their stochastic convex
optimization counterparts. Formally, we consider online learning with a sequence of composite
functions φt . Each function is of the form φt (x) = ft (x) + ϕ(x) where ft and ϕ are (closed) convex
functions. In the learning settings we study, ft is either an instantaneous loss or a stochastic estimate
of the objective function in an optimization task. The function ϕ serves as a fixed regularization
function and is typically used to control the complexity of x. At each round the algorithm makes a
prediction xt ∈ X and then receives the function ft . We define the regret with respect to the fixed
(optimal) predictor x∗ as
T T
Rφ (T ) , ∑ [φt (xt ) − φt (x∗ )] = ∑ [ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )] . (2)
t=1 t=1
Our goal is to devise algorithms which are guaranteed to suffer asymptotically sub-linear regret,
namely, Rφ (T ) = o(T ).
Our analysis applies to related, yet different, methods for minimizing the regret (2). The first
is Nesterov’s primal-dual subgradient method (2009), and in particular Xiao’s (2010) extension,
regularized dual averaging, and the follow-the-regularized-leader (FTRL) family of algorithms (see
for instance Kalai and Vempala, 2003; Hazan et al., 2006). In the primal-dual subgradient method
the algorithm makes a prediction xt on round t using the average gradient ḡt = 1t ∑tτ=1 gτ . The update
encompasses a trade-off between a gradient-dependent linear term, the regularizer ϕ, and a strongly-
convex term ψt for well-conditioned predictions. Here ψt is the proximal term. The update amounts
to solving
1
xt+1 = argmin η hḡt , xi + ηϕ(x) + ψt (x) , (3)
x∈X t
where η is a fixed step-size and x1 = argminx∈X ϕ(x). The second method similarly has numer-
ous names, including proximal gradient, forward-backward splitting, and composite mirror descent
(Tseng, 2008; Duchi et al., 2010). We use the term composite mirror descent. The composite mirror
descent method employs a more immediate trade-off between the current gradient gt , ϕ, and staying
close to xt using the proximal function ψ,
xt+1 = argmin η hgt , xi + ηϕ(x) + Bψt (x, xt ) .

(4)
x∈X
Our work focuses on temporal adaptation √ of the proximal function in a data driven way, while
previous work simply sets ψt ≡ ψ, ψt (·) = tψ(·), or ψt (·) = tψ(·) for some fixed ψ.
We provide formal analyses equally applicable to the above two updates and show how to au-
tomatically choose the function ψt so as to achieve asymptotically small regret. We describe and
analyze two algorithms. Both algorithms use squared Mahalanobis norms as their proximal func-
tions, setting ψt (x) = hx, Ht xi for a symmetric matrix Ht 0. The first uses diagonal matrices while
2123
the second constructs full dimensional matrices. Concretely, for some small fixed δ ≥ 0 (specified
later, though in practice δ can be set to 0) we set
1/2
Ht = δI + diag(Gt )1/2 (Diagonal) and Ht = δI + Gt (Full) . (5)
Plugging the appropriate matrix from the above equation into ψt in (3) or (4) gives rise to our
A DAG RAD family of algorithms. Informally, we obtain algorithms which are similar to second-
order gradient descent by constructing approximations to the Hessian of the functions ft , though we
use roots of the matrices.
1.2 Outline of Results

We now outline our results, deferring formal statements of the theorems to later sections. Recall the
definitions of g1:t as the matrix of concatenated subgradients and Gt as the outer product matrix in
the prequel. The A DAG RAD algorithm with full matrix divergences entertains bounds of the form

∗ 1/2 ∗ 1/2
Rφ (T ) = O kx k2 tr(GT ) and Rφ (T ) = O max kxt − x k2 tr(GT ) .
t≤T
We further show that

v ( )
u T
∑ hgt , S−1gt i

1/2
u
1/2 t
tr GT =d inf : S 0, tr(S) ≤ d .
S
t=1
These results are formally given in Theorem 7 and its corollaries. When our proximal function
ψt (x) = x, diag(Gt )1/2 x we have bounds attainable in time at most linear in the dimension d of

our problems of the form

! !
d d
Rφ (T ) = O kx∗ k∞ ∑ kg1:T,i k2 and Rφ (T ) = O max kxt − x∗ k∞ ∑ kg1:T,i k2 .
t≤T
i=1 i=1
Similar to the above, we will show that

v ( )
d
u T
∑ ∑ hgt , diag(s)−1gt i
u
1/2 t
kg1:T,i k = d2 inf : s 0, h1, si ≤ d .
s
i=1 t=1
We formally state the above two regret bounds in Theorem 5 and its corollaries.
Following are a simple example and corollary to Theorem 5 to illustrate one regime in which
we expect substantial improvements (see also the next subsection). Let ϕ ≡ 0 and consider Zinke-
vich’s online gradient descent algorithm. Given a compact convex set X ⊆ Rd and sequence
of convex functions√ft , Zinkevich’s algorithm makes the sequence of predictions x1 , . . . , xT with
xt+1 = ΠX (xt − (η/ t)gt ). If the diameter of X is bounded, thus supx,y∈X kx − yk2 ≤ D2 , then on-
line gradient descent, with the optimal choice in hindsight for the stepsize η (see the bound (7) in
Section 1.4), achieves a regret bound of
s
T T √ T
∑ ft (xt ) − inf ∑ ft (x) ≤ 2D2 ∑ kgt k22 .
x∈X t=1
(6)
t=1 t=1
When X is bounded via supx,y∈X kx − yk∞ ≤ D∞ , the following corollary is a simple consequence of
our Theorem 5.
2124
Corollary 1 Let the sequence {xt } ⊂ Rd be√generated by the update (4) and assume that
maxt kx∗ − xt k∞ ≤ D∞ . Using stepsize η = D∞ / 2, for any x∗ , the following bound holds.
s
√ T √ d
Rφ (T ) ≤ 2dD∞ inf ∑ kgt k2diag(s)−1 = 2D∞ ∑ kg1:T,i k2 .
s0,h1,si≤d t=1 i=1
The important feature of the bound above is the infimum under the square root, which allows us to
perform better than simply using the identity matrix, and the√fact that the stepsize is easy to set a
priori. For example, if the set X = {x : kxk∞ ≤ 1}, then D2 = 2 d while D∞ = 2, which suggests that
if we are learning a dense predictor over a box, the adaptive method should perform well. Indeed,
in this case we are guaranteed that the bound in Corollary 1 is better than (6) as the identity matrix
belongs to the set over which we take the infimum.
To conclude the outline of results, we would like to point to two relevant research papers. First,
Zinkevich’s regret bound is tight and cannot be improved in a minimax sense (Abernethy et al.,
2008). Therefore, improving the regret bound requires further reasonable assumptions on the input
space. Second, in a independent work, performed concurrently to the research presented in this
paper, McMahan and Streeter (2010) study competitive ratios, showing guaranteed improvements
of the above bounds relative to families of online algorithms.
1.3 Improvements and Motivating Example

As mentioned in the prequel, we expect our adaptive methods to outperform standard online learning
methods when the gradient vectors are sparse. We give empirical evidence supporting the improved
performance of the adaptive methods in Section 6. Here we give a few abstract examples that show
that for sparse data (input sequences where gt has many zeros) the adaptive methods herein have
better performance than non-adaptive methods. In our examples we use the hinge loss, that is,
ft (x) = [1 − yt hzt , xi]+ ,
where yt is the label of example t and zt ∈ Rd is the data vector.

For our first example, which was also given by McMahan and Streeter (2010), consider the
following sparse random data scenario, where the vectors zt ∈ {−1, 0, 1}d . Assume that at in each
round t, feature i appears with probability pi = min{1, ci−α } for some α ∈ (1, ∞) and a dimension-
independent constant c. Then taking the expectation of the gradient terms in the bound in Corol-
lary 1, we have
d d q d q d p
E ∑ kg1:T,i k2 = ∑ E |{t : |gt,i | = 1}| ≤ ∑ E|{t : |gt,i | = 1}| = ∑ pi T
i=1 i=1 i=1 i=1
by Jensen’s inequality. In the rightmost sum, we have c ∑di=1 i−α/2 = O(log d) for α ≥ 2, and
∑di=1 i−α/2 = O(d 1−α/2 ) for α ∈ (1, 2). If the domain X is a hypercube, say X √ = {x : kxk∞ ≤ 1}, then
in Corollary 1 D∞ = 2, and the regret of A DAG RAD is O(max{log√ d, d 1−α/2 } T ). For contrast, the
standard regret√bound (6) for online gradient descent has D2 = 2 d and kgt k22 ≥ 1, yielding best
case regret O( dT ). So we see that in this sparse yet heavy tailed feature setting, A DAG RAD’s re-
gret guarantee can be exponentially smaller in the dimension d than the non-adaptive regret bound.
Our remaining examples construct a sparse sequence for which there is a perfect predictor that
the adaptive methods learn after d iterations, while standard online gradient descent (Zinkevich,
2125
2003) suffers significantly higher

√ loss. We assume the domain √ X is compact, so that for online
gradient descent we set ηt = η/ t, which gives the optimal O( T ) regret (the setting of η does not
matter to the adversary we construct).
1.3.1 D IAGONAL A DAPTATION

Consider the diagonal version of our proposed update (4) with X = {x : kxk∞√≤ 1}. Evidently,
we can take D∞ = 2, and this choice simply results in the update xt+1 = xt − 2 diag(Gt )−1/2 gt
followed by projection (1) onto X for A DAG RAD (we use a pseudo-inverse if the inverse does not
exist). Let ei denote the ith unit basis vector, and assume that for each t, zt = ±ei for some i. Also
let yt = sign(h1, zt i) so that there exists a perfect classifier x∗ = 1 ∈ X ⊂ Rd . We initialize x1 to be
the zero vector. Fix some ε > 0, and on rounds rounds t = 1, . . . , η2 /ε2 , set zt = e1 . After these
rounds, simply choose zt = ±ei for index i ∈ {2, . . . , d} chosen at random. It is clear that the update
to parameter xi at these iterations is different, and amounts to
η

xt+1 = xt + ei A DAG RAD xt+1 = xt + √ (Gradient Descent) .
t [−1,1]d
(Here [·][−1,1]d denotes the truncation of the vector to [−1, 1]d ). In particular, after suffering d − 1
more losses, A√ DAG RAD has a perfect classifier. However, on the remaining iterations gradient
descent has η/ t ≤ ε and thus evidently suffers loss at least d/(2ε). Of course, for small ε, we
have d/(2ε) ≫ d. In short, A DAG RAD achieves constant regret per dimension while online gradient
descent can suffer arbitrary loss (for unbounded t). It seems quite silly, then, to use a global learning
rate rather than one for each feature.
Full Matrix Adaptation. We use a similar construction to the diagonal case to show a situation
in which the full matrix update from (5) gives substantially√ lower regret than stochasticd×d gradient
descent. For full divergences we set X = {x : kxk 2 ≤ d}. Let V = [v1 . . . vd ] ∈ R be an
orthonormal matrix. Instead of having zt cycle through the unit vectors, we make zt cycle through
the vi so that zt = ±vi . We let the label yt = sign( 1,V ⊤ zt ) = sign ∑di=1 hvi , zt i . We provide an

elaborated explanation in Appendix A. Intuitively, with ψt (x) = hx, Ht xi and Ht set to be the full
matrix from (5), A DAG RAD again needs to observe each orthonormal vector vi only once while
stochastic gradient descent’s loss can be made Ω(d/ε) for any ε > 0.
1.4 Related Work

Many successful algorithms have been developed over the past few years to minimize regret in
the online learning setting. A modern view of these algorithms casts the problem as the task of
following the (regularized) leader (see Rakhlin, 2009, and the references therein) or FTRL in short.
Informally, FTRL methods choose the best decision in hindsight at every iteration. Verbatim usage
of the FTRL approach fails to achieve low regret, however, adding a proximal1 term to the past
predictions leads to numerous low regret algorithms (Kalai and Vempala, 2003; Hazan and Kale,
2008; Rakhlin, 2009). The proximal term strongly affects the performance of the learning algorithm.
Therefore, adapting the proximal function to the characteristics of the problem at hand is desirable.
Our approach is thus motivated by two goals. The first is to generalize the agnostic online learn-
ing paradigm to the meta-task of specializing an algorithm to fit a particular data set. Specifically,
1. The proximal term is also referred to as regularization in the online learning literature. We use the phrase proximal
term in order to avoid confusion with the statistical regularization function ϕ.
2126
we change the proximal function to achieve performance guarantees which are competitive with the
best proximal term found in hindsight. The second, as alluded to earlier, is to automatically adjust
the learning rates for online learning and stochastic gradient descent on a per-feature basis. The
latter can be very useful when our gradient vectors gt are sparse, for example, in a classification
setting where examples may have only a small number of non-zero features. As we demonstrated
in the examples above, it is rather deficient to employ exactly the same learning rate for a feature
seen hundreds of times and for a feature seen only once or twice.
Our techniques stem from a variety of research directions, and as a byproduct we also extend a
few well-known algorithms. In particular, we consider variants of the follow-the-regularized leader
(FTRL) algorithms mentioned above, which are kin to Zinkevich’s lazy projection algorithm. We
use Xiao’s recently analyzed regularized dual averaging (RDA) algorithm (2010), which builds upon
Nesterov’s (2009) primal-dual subgradient method. We also consider forward-backward splitting
(F OBOS) (Duchi and Singer, 2009) and its composite mirror-descent (proximal gradient) general-
izations (Tseng, 2008; Duchi et al., 2010), which in turn include as special cases projected gradients
(Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Re-
cent work by several authors (Nemirovski et al., 2009; Juditsky et al., 2008; Lan, 2010; Xiao, 2010)
considered efficient and robust methods for stochastic optimization, especially in the case when the
expected objective f is smooth. It may be interesting to investigate adaptive metric approaches in
smooth stochastic optimization.
The idea of adapting first order optimization methods is by no means new and can be traced
back at least to the 1970s with the work on space dilation methods of Shor (1972) and variable
metric methods, such as the BFGS family of algorithms (e.g., Fletcher, 1970). This prior work
often assumed that the function to be minimized was differentiable and, to our knowledge, did not
consider stochastic, online, or composite optimization. In her thesis, Nedić (2002) studied variable
metric subgradient methods, though it seems difficult to derive explicit rates of convergence from the
results there, and the algorithms apply only when the constraint set X = Rd . More recently, Bordes
et al. (2009) proposed a Quasi-Newton stochastic gradient-descent procedure, which is similar in
spirit to our methods. However, their convergence results assume a smooth objective with positive
definite Hessian bounded away from 0. Our results apply more generally.
Prior to the analysis presented in this paper for online and stochastic optimization, the strongly
convex function ψ in the update equations (3) and (4) either remained intact or was simply multiplied
by a time-dependent scalar throughout the run of the algorithm. Zinkevich’s √ projected gradient,
for example, uses ψt (x) = kxk22 , while RDA (Xiao, 2010) employs ψt (x) = tψ(x) where ψ is a
strongly convex function. The bounds for both types of algorithms are similar, and both rely on the
norm k·k (and its associated dual k·k∗ ) with respect to which ψ is strongly convex. Mirror-descent
type first order algorithms, such as projected gradient methods, attain regret bounds of the form
(Zinkevich, 2003; Bartlett et al., 2007; Duchi et al., 2010)
1 η T
Bψ (x∗ , x1 ) + ∑ ft′ (xt ) ∗ .
2
Rφ (T ) ≤ (7)
η 2 t=1
√ √
Choosing η ∝ 1/ √T gives Rφ (T ) = O( T ). When Bψ (x, x∗√ ) is bounded for all x ∈ X , we choose
step sizes ηt ∝ 1/ t which is equivalent to setting ψt (x) = tψ(x). Therefore, no assumption on
the time horizon is necessary. For RDA and follow-the-leader algorithms, the bounds are similar
2127
(Xiao, 2010, Theorem 3):
√ 1 T
∑ ft′ (xt ) ∗ .
2
T ψ(x∗ ) + √

Rφ (T ) ≤ (8)
2 T t=1
The problem of adapting to data and obtaining tighter data-dependent bounds for algorithms
such as those above is a natural one and has been studied in the mistake-bound setting for online
learning in the past. A framework that is somewhat related to ours is the confidence weighted
learning scheme by Crammer et al. (2008) and the adaptive regularization of weights algorithm
(AROW) of Crammer et al. (2009). These papers provide mistake-bound analyses for second-
order algorithms, which in turn are similar in spirit to the second-order Perceptron algorithm (Cesa-
Bianchi et al., 2005). The analyses by Crammer and colleagues, however, yield mistake bounds
dependent on the runs of the individual algorithms and are thus difficult to compare with our regret
bounds.
AROW maintains a mean prediction vector µt ∈ Rd and a covariance matrix Σt ∈ Rd×d over µt
as well. At every step of the algorithm, the learner receives a pair (zt , yt ) where zt ∈ Rd is the tth
example and yt ∈ {−1, +1} is the label. Whenever the predictor µt attains a margin value smaller
than 1, AROW performs the update
1
βt = , αt = [1 − yt hzt , µt i]+ ,
hzt , Σt zt i + λ
µt+1 = µt + αt Σt yt zt , Σt+1 = Σt − βt Σt xt xt⊤ Σt . (9)
In the above scheme, one can force Σt to be diagonal, which reduces the run-time and storage
requirements of the algorithm but still gives good performance (Crammer et al., 2009). In contrast
to AROW, the A DAG RAD algorithm uses the root of the inverse covariance matrix, a consequence of
our formal analysis. Crammer et al.’s algorithm and our algorithms have similar run times, generally
linear in the dimension d, when using diagonal matrices. However, when using full matrices the
runtime of AROW algorithm is O(d 2 ), which is faster than ours as it requires computing the root of
a matrix.
In concurrent work, McMahan and Streeter (2010) propose and analyze an algorithm which
is very similar to some of the algorithms presented in this paper. Our analysis builds on recent
advances in online learning and stochastic optimization (Duchi et al., 2010; Xiao, 2010), whereas
McMahan and Streeter use first-principles to derive their regret bounds. As a consequence of our
approach, we are able to apply our analysis to algorithms for composite minimization with a known
additional objective term ϕ. We are also able to generalize and analyze both the mirror descent and
dual-averaging family of algorithms. McMahan and Streeter focus on what they term the compet-
itive ratio, which is the ratio of the worst case regret of the adaptive algorithm to the worst case
regret of a non-adaptive algorithm with the best proximal term ψ chosen in hindsight. We touch on
this issue briefly in the sequel, but refer the interested reader to McMahan and Streeter (2010) for
this alternative elegant perspective. We believe that both analyses shed insights into the problems
studied in this paper and complement each other.
There are also other lines of work on adaptive gradient methods that are not directly related to
our work but nonetheless relevant. Tighter regret bounds using the variation of the cost functions ft
were proposed by Cesa-Bianchi et al. (2007) and derived by Hazan and Kale (2008). Bartlett et al.
(2007) explore another adaptation technique for ηt where they adapt the step size to accommodate
2128
both strongly and weakly convex functions. Our approach differs from previous approaches as it
does not focus on a particular loss function or mistake bound. Instead, we view the problem of
adapting the proximal function as a meta-learning problem. We then obtain a bound comparable to
the bound obtained using the best proximal function chosen in hindsight.
2. Adaptive Proximal Functions

Examining the bounds (7) and (8), we see that most of the regret depends on dual norms of ft′ (xt ),
and the dual norms in turn depend on the choice of ψ. This naturally leads to the question of whether
we can modify the proximal term ψ along the run of the algorithm in order to lower the contribution
of the aforementioned norms. We achieve this goal by keeping second order information about the
sequence ft and allow ψ to vary on each round of the algorithms.
We begin by providing two corollaries based on previous work that give the regret of our base
algorithms when the proximal function ψt is allowed to change. These corollaries are used in
the sequel in our regret analysis. We assume that ψt is monotonically non-decreasing, that is,
ψt+1 (x) ≥ ψt (x). We also assume that ψt is 1-strongly convex with respect to a time-dependent
semi-norm k·kψt . Formally, ψ is 1-strongly convex with respect to k·kψ if
1
ψ(y) ≥ ψ(x) + h∇ψ(x), y − xi + kx − yk2ψ .
2
Strong convexity is guaranteed if and only if Bψt (x, y) ≥ 21 kx − yk2ψt . We also denote the dual norm
of k·kψt by k·kψt∗ . For completeness, we provide the proofs of following two results in Appendix F,
as they build straightforwardly on work by Duchi et al. (2010) and Xiao (2010). For the primal-dual
subgradient update, the following bound holds.
Proposition 2 Let the sequence {xt } be defined by the update (3). For any x∗ ∈ X ,
T
1 η T
∑ ψT (x∗ ) + ∑ ft′ (xt ) ψ∗ .
2
ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ ) ≤ (10)
t=1 η 2 t=1 t−1
For composite mirror descent algorithms a similar result holds.
Proposition 3 Let the sequence {xt } be defined by the update (4). Assume w.l.o.g. that ϕ(x1 ) = 0.
For any x∗ ∈ X ,
T
∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )
t=1
1 1 T −1 η T
Bψ1 (x∗ , x1 ) + ∑ Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) + ∑ ft′ (xt ) ψ∗ .
2
≤ (11)
η η t=1 2 t=1 t
The above corollaries allow us to prove regret bounds for a family of algorithms that iteratively
modify the proximal functions ψt in attempt to lower the regret bounds.
2129
I NPUT: η > 0, δ ≥ 0
VARIABLES: s ∈ Rd , H ∈ Rd×d , g1:t,i ∈ Rt for i ∈ {1, . . . , d}
I NITIALIZE x1 = 0, g1:0 = []
F OR t = 1 to T
Suffer loss ft (xt )
Receive subgradient gt ∈ ∂ ft (xt ) of ft at xt
U PDATE g1:t = [g1:t−1 gt ], st,i = kg1:t,i k2
S ET Ht = δI + diag(st ), ψt (x) = 12 hx, Ht xi
Primal-Dual Subgradient
( * Update
+ (3): )
t
1 1
xt+1 = argmin η
x∈X
∑ gτ , x + ηϕ(x) + t ψt (x) .
t τ=1
Composite Mirror
Descent Update (4):
xt+1 = argmin ηhgt , xi + ηϕ(x) + Bψt (x, xt ) .

x∈X
Figure 1: A DAG RAD with diagonal matrices
3. Diagonal Matrix Proximal Functions

We begin by restricting ourselves to using diagonal matrices to define matrix proximal functions
and (semi)norms. This restriction serves a two-fold purpose. First, the analysis for the general case
is somewhat complicated and thus the analysis of the diagonal restriction serves as a proxy for better
understanding. Second, in problems with high dimension where we expect this type of modification
to help, maintaining more complicated proximal functions is likely to be prohibitively expensive.
Whereas earlier analysis requires a learning rate to slow changes between predictors xt and xt+1 , we
will instead automatically grow the proximal function we use to achieve asymptotically low regret.
To remind the reader, g1:t,i is the ith row of the matrix obtained by concatenating the subgradients
from iteration 1 through t in the online algorithm.
To provide some intuition for the algorithm we show in Algorithm 1, let us examine the problem
T d 2
gt,i
min ∑∑ s.t. s 0, h1, si ≤ c .
t=1 i=1 si
s
This problem is solved by setting si = kg1:T,i k2 and scaling s so that hs, 1i = c. To see this, we can
write the Lagrangian of the minimization problem by introducing multipliers λ 0 and θ ≥ 0 to get
d kg1:T,i k22
L (s, λ, θ) = ∑ − hλ, si + θ(h1, si − c).
i=1 si
Taking partial derivatives to find the infimum of L , we see that − kg1:T,i k22 /s2i − λi + θ = 0, and com-
plementarity conditions on λi si (Boyd and Vandenberghe, 2004) imply that λi = 0. Thus we have
1
si = θ− 2 kg1:T,i k2 , and normalizing appropriately using θ gives that si = c kg1:T,i k2 / ∑dj=1 g1:T, j 2 .

2130
As a final note, we can plug si into the objective above to see

( ) !2
T d g2 d
1
inf ∑ ∑ ∑ kg1:T,i k2
t,i
: s 0, h1, si ≤ c = . (12)
t=1 i=1 si c
s
i=1
Let diag(v) denote the diagonal matrix with diagonal v. It is natural to suspect that for s achieving
the infimum in Equation (12), if we use
a proximal function similar to ψ(x) = hx, diag(s)xi with
2 −1

associated squared dual norm kxkψ∗ = x, diag(s) x , we should do well lowering the gradient
terms in the regret bounds (10) and (11).
To prove a regret bound for our Algorithm 1, we note that both types of updates suffer losses that
include a term depending solely on the gradients obtained along their run. The following lemma
is applicable to both updates, and was originally proved by Auer and Gentile (2000), though we
provide a proof in Appendix C. McMahan and Streeter (2010) also give an identical lemma.
Lemma 4 Let gt = ft′ (xt ) and g1:t and st be defined as in Algorithm 1. Then
T d
∑ gt , diag(st )−1 gt ≤ 2 ∑ kg1:T,i k2 .

t=1 i=1
To obtain a regret bound, we need to consider the terms consisting of the dual-norm of the sub-
gradient in the regret bounds (10) and (11), which is k ft′ (xt )k2ψt∗ . When ψt (x) = hx, (δI + diag(st ))xi,
it is easy to see that the associated dual-norm is
kgk2ψt∗ = g, (δI + diag(st ))−1 g .

From the definition of st in Algorithm 1, we clearly have k ft′ (xt )k2ψt∗ ≤ gt , diag(st )−1 gt . Note that

if st,i = 0 then gt,i = 0 by definition of st,i . Thus, for any δ ≥ 0, Lemma 4 implies
T d
∑ t t ψ∗ ∑ kg1:T,i k2 .
′ 2
f (x ) ≤ 2 (13)
t
t=1 i=1
To obtain a
bound for a primal-dual subgradient method, we set δ ≥ maxt kgt k∞ , in which case
kgt k2ψ∗ ≤ gt , diag(st )−1 gt , and we follow the same lines of reasoning to achieve the inequal-

t−1
ity (13).
It remains to bound the various Bregman divergence terms for Corollary 3 and the term ψT (x∗ )
for Corollary 2. We focus first on the composite mirror-descent update. Examining the bound (11)
and Algorithm 1, we notice that
1 ∗
Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) = hx − xt+1 , diag(st+1 − st )(x∗ − xt+1 )i
2
1
≤ max(xi∗ − xt+1,i )2 kst+1 − st k1 .
2 i
Since kst+1 − st k1 = hst+1 − st , 1i and hsT , 1i = ∑di=1 kg1:T,i k2 , we have
T −1
1 T −1 ∗
∑ Bψ ∗ ∗
t+1 (x , xt+1 ) − Bψt (x , xt+1 ) ≤ ∑ kx − xt+1 k2∞ hst+1 − st , 1i
2 t=1
t=1
d
1 1
≤ max kx∗ − xt k2∞ ∑ kg1:T,i k2 − kx∗ − x1 k2∞ hs1 , 1i . (14)
2 t≤T i=1 2
2131
We also have
d
ψT (x∗ ) = δ kx∗ k22 + hx∗ , diag(sT )x∗ i ≤ δ kx∗ k22 + kx∗ k2∞ ∑ kg1:T,i k2 .
i=1
Combining the above arguments with Corollaries 2 and 3, and using (14) with the fact that Bψ1 (x∗ , x1 ) ≤
1 ∗ 2
2 kx − x1 k∞ h1, s1 i, we have proved the following theorem.
Theorem 5 Let the sequence {xt } be defined by Algorithm 1. For xt generated using the primal-
dual subgradient update (3) with δ ≥ maxt kgt k∞ , for any x∗ ∈ X ,
δ ∗ 2 1 ∗ 2 d d
Rφ (T ) ≤ kx k2 + kx k∞ ∑ kg1:T,i k2 + η ∑ kg1:T,i k2 .
η η i=1 i=1
For xt generated using the composite mirror-descent update (4), for any x∗ ∈ X
d d
1
Rφ (T ) ≤ max kx∗ − xt k2∞ ∑ kg1:T,i k2 + η ∑ kg1:T,i k2 .
2η t≤T i=1 i=1
The above theorem is a bit unwieldy. We thus perform a few algebraic simplifications to get the
next corollary, which has a more intuitive form. Let us assume that X is compact and set D∞ =
supx∈X kx − x∗ k∞ . Furthermore, define
( )
d T
d
γT , ∑ kg1:T,i k2 = inf ∑ gt , diag(s)−1 gt : h1, si ≤ ∑ kg1:T,i k2 , s 0 .

s
i=1 t=1 i=1
Also w.l.o.g. let 0 ∈ X . The

√ following corollary is immediate (this is equivalent to Corollary 1,
though we have moved the d term in the earlier bound).
Corollary 6 Assume that D∞ and γT are defined as above. For {xt } generated by Algorithm 1 using
the primal-dual subgradient update (3) with η = kx∗ k∞ , for any x∗ ∈ X we have
kx∗ k22
Rφ (T ) ≤ 2 kx∗ k∞ γT + δ ≤ 2 kx∗ k∞ γT + δ kx∗ k1 .
kx∗ k∞
√
Using the composite mirror descent update (4) to generate {xt } and setting η = D∞ / 2, we have
√ d √
Rφ (T ) ≤ 2D∞ ∑ kg1:T,i k2 = 2D∞ γT .
i=1
We now give a short derivation of Corollary 1 from the introduction: use Theorem 5, Corollary 6,
and the fact that
( ) !2
T d g2
1 d
inf ∑ ∑ ∑ kg1:T,i k2 .
t,i
: s 0, h1, si ≤ d =
t=1 i=1 si d i=1
s
√ in the beginning of Section 3. Plugging the γT term in from Corollary 6 and multiplying
as in (12)
D∞ by d completes the proof of the corollary.
2132
As discussed in the introduction, Algorithm 1 should have lower regret than non-adaptive algo-
rithms on sparse data, though this depends on the geometry of the underlying optimization space
X . For example, suppose that our learning problem is a logistic regression with 0/1-valued features.
Then the gradient terms are likewise based on 0/1-valued features
√ and sparse, so the gradient terms
in the bound ∑i=1 1:T,i 2 should all be much smaller than T . If some features appear much more
d
kg k
frequently than others, then the infimal representation of γT and the infimal equality in Corollary 1
show that we have significantly lower regret by using higher learning rates for infrequent features
and lower learning rates on commonly appearing features. Further, if the optimal predictor is rela-
tively dense, as is often the case in predictions problems with sparse inputs, then kx∗ k∞ is the best
p-norm we can have in the regret.
More precisely, McMahan and Streeter (2010) show that if X is contained within an ℓ∞ ball
of radius √R and contains an ℓ∞ ball of radius r, then the bound in the above corollary is within a
factor of 2R/r of the regret of the best diagonal proximal matrix, chosen in hindsight. So, for
example, if X = {x ∈ Rd : kxk p ≤ C}, then R/r = d 1/p , which shows that the domain X does effect
the guarantees we can give on optimality of A DAG RAD.
4. Full Matrix Proximal Functions

In this section we derive and analyze new updates when we estimate a full matrix for the divergence
ψt instead of a diagonal one. In this generalized case, we use the root of the matrix of outer products
of the gradients that we have observed to update our parameters. As in the diagonal case, we build
on intuition garnered from an optimization problem, and in particular, we seek a matrix S which is
the solution to the following minimization problem:
T
∑ gt , S−1 gt s.t. S 0, tr(S) ≤ c .

min (15)
S
t=1
The solution is obtained by defining Gt = ∑tτ=1 gτ gτ⊤ and setting S to be a normalized version of
1/2 1/2
the root of GT , that is, S = c GT / tr(GT ). For a proof, see Lemma 15 in Appendix E, which also
shows that when GT is not full rank D we can E instead use its pseudo-inverse. If we iteratively use
1/2
divergences of the form ψt (x) = x, Gt x , we might expect as in the diagonal case to attain low
regret by collecting gradient information. We achieve our low regret goal by employing a similar
doubling lemma to Lemma 4 and bounding the gradient norm terms. The resulting algorithm is
given in Algorithm 2, and the next theorem provides a quantitative analysis of the brief motivation
above.
Theorem 7 Let Gt be the outer product matrix defined above and the sequence {xt } be defined by
Algorithm 2. For xt generated using the primal-dual subgradient update of (3) and δ ≥ maxt kgt k2 ,
for any x∗ ∈ X
δ 1 1/2 1/2
Rφ (T ) ≤ kx∗ k22 + kx∗ k22 tr(GT ) + η tr(GT ).
η η
For xt generated with the composite mirror-descent update of (4), if x∗ ∈ X and δ ≥ 0
δ ∗ 2 1 1/2 1/2
Rφ (T ) ≤ kx k2 + max kx∗ − xt k22 tr(GT ) + η tr(GT ).
η 2η t≤T
2133
I NPUT: η > 0, δ ≥ 0
VARIABLES: St ∈ Rd×d , Ht ∈ Rd×d , Gt ∈ Rd×d
I NITIALIZE x1 = 0, S0 = 0, H0 = 0, G0 = 0
F OR t = 1 to T
Suffer loss ft (xt )
Receive subgradient gt ∈ ∂ ft (xt ) of ft at xt
1
U PDATE Gt = Gt−1 + gt gt⊤ , St = Gt2
S ET Ht = δI + St , ψt (x) = 21 hx, Ht xi
Primal-Dual Subgradient
( * Update
+ ((3)): )
t
1 1
xt+1 = argmin η
x∈X
∑ gτ , x + ηϕ(x) + t ψt (x) .
t τ=1
Composite Mirror
Descent Update ((4)):
xt+1 = argmin ηhgt , xi + ηϕ(x) + Bψt (x, xt ) .

x∈X
Figure 2: A DAG RAD with full matrices
Proof To begin, we consider the difference between the divergence terms at time t + 1 and time t
from the regret (11) in Corollary 3. Let λmax (M) denote the largest eigenvalue of a matrix M. We
have
∗ ∗ 1D ∗ 1/2 1/2 ∗
E
Bψt+1 (x , xt+1 ) − Bψt (x , xt+1 ) = x − xt+1 , (Gt+1 − Gt )(x − xt+1 )
2
1 ∗ 1/2 1/2 1 1/2 1/2
≤ kx − xt+1 k22 λmax (Gt+1 − Gt ) ≤ kx∗ − xt+1 k22 tr(Gt+1 − Gt ) .
2 2
For the last inequality we used the fact that the trace of a matrix is equal to the sum of its eigenvalues
1/2
along with the property Gt+1 1/2 − Gt 1/2 0 (see Lemma 13 in Appendix B) and therefore tr(Gt+1 −
1/2 1/2 1/2
Gt ) ≥ λmax (Gt+1 − Gt ). Thus, we get
T −1
1 T −1 ∗
∑ ∑

2 1/2 1/2
Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) ≤ kx − x k
t+1 2 tr(Gt+1 ) − tr(Gt ) .
t=1 2 t=1
Now we use the fact that G1 is a rank 1 PSD matrix with non-negative trace to see that
T −1
∑

1/2 1/2
kx∗ − xt+1 k22 tr(Gt+1 ) − tr(Gt )
t=1
1/2
≤ max kx∗ − xt k22 tr(GT 1/2 ) − kx∗ − x1 k22 tr(G1 ) . (16)
t≤T
It remains to bound the gradient terms common to all our bounds. We use the following three
lemmas, which essentially directly applicable. We prove the first two in Appendix D.
Lemma 8 Let B 0 and B−1/2 denote the root of the inverse of B when B ≻ 0 and the root of the
pseudo-inverse of B otherwise. For any ν such that B − νgg⊤ 0 the following inequality holds.
2 tr((B − νgg⊤ )1/2 ) ≤ 2 tr(B1/2 ) − ν tr(B−1/2 gg⊤ ) .
2134
D 1/2 E
Lemma 9 Let δ ≥ kgk2 and A 0, then g, (δI + A1/2 )−1 g ≤ g, (A + gg⊤ )†

g .
Lemma 10 Let St = Gt 1/2 be as defined in Algorithm 2 and A† denote the pseudo-inverse of A.

Then
T D T D
∑ gt , St† gt ≤ 2 ∑ gt , ST† gt = 2 tr(GT 1/2 ) .
E E
t=1 t=1
Proof We prove the lemma by induction. The base case is immediate, since we have
D E hg , g i
1 1
g1 , (G†1 )1/2 g1 = = kg1 k2 ≤ 2 kg1 k2 .
kg1 k2
Now, assume the lemma is true for T − 1, so from the inductive assumption we get
T T −1 D
∑ t t t ∑ t T −1 t
D E E D E
† † †
g , S g ≤ 2 g , S g + g , S
T T T g .
t=1 t=1
D E
T −1
Since ST −1 does not depend on t we can rewrite ∑t=1 gt , ST† −1 gt as
!
T −1
tr ST† −1 , ∑ gt gt⊤ = tr((G†T −1 )1/2 GT −1 ) ,
t=1
where the right-most equality follows from the definitions of St and Gt . Therefore, we get
T
∑
D E D E
gt , St† gt ≤ 2 tr((G†T −1 )1/2 GT −1 ) + gT , (G†T )1/2 gT
t=1
D E
1/2
= 2 tr(GT −1 ) + gT , (G†T )1/2 gT .
Using Lemma 8 with the substitution B = GT , ν = 1, and g = gt lets us exploit the concavity of the
1/2
function tr(A1/2 ) to bound the above sum by 2 tr(GT ). N
We can now finalize our proof of the theorem. As in the diagonal case, we have that the squared
dual norm (seminorm when δ = 0) associated with ψt is
kxk2ψt∗ = x, (δI + St )−1 x .

D E
Thus it is clear that kgt k2ψt∗ ≤ gt , St† gt . For the dual-averaging algorithms, we use Lemma 9 above
D E
show that kgt k2ψ∗ ≤ gt , St† gt so long as δ ≥ kgt k2 . Lemma 10’s doubling inequality then implies
t−1
that
T T
∑ ft′ (xt ) ∗ ≤ 2 tr(G ) and ∑ ft′ (xt ) ∗ ≤ 2 tr(G )
2 1/2 2 1/2
ψt T (17)
ψt−1 T
t=1 t=1
for the mirror-descent and primal-dual subgradient algorithm, respectively.
1/2
To finish the proof, Note that Bψ1 (x∗ , x1 ) ≤ 21 kx∗ − x1 k22 tr(G1 ) when δ = 0. By combining this
T −1
with the first of the bounds (17) and the bound (16) on ∑t=1 Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ), Corol-
lary 3 gives the theorem’s statement for the mirror-descent family of algorithms. Combining the
2135
1/2
fact that ∑t=1
T
k ft′ (xt )k2ψ∗ ≤ 2 tr(GT ) and the bound (16) with Corollary 2 gives the desired bound
t−1
on Rφ (T ) for the primal-dual subgradient algorithms, which completes the proof of the theorem.
As before, we can give a corollary that simplifies the bound implied by Theorem 7. The infimal
equality in the corollary uses Lemma 15 in Appendix B. The corollary underscores that for learn-
ing problems in which there is a rotation U of the space for which the gradient vectors gt have
small inner products hgt ,Ugt i (essentially a sparse basis for the gt ) then using full-matrix proximal
functions can attain significantly lower regret.
Corollary 11 Assume that ϕ(x1 ) = 0. Then the regret of the sequence {xt } generated by Algorithm 2
when using the primal-dual subgradient update with η = kx∗ k2 is
1/2
Rφ (T ) ≤ 2 kx∗ k2 tr(GT ) + δ kx∗ k2 .
√
Let X be compact set so that supx∈X kx − x∗ k2 ≤ D. Taking η = D/ 2 and using the composite
mirror descent update with δ = 0, we have
v ( )
√ √
u T
2dDtinf ∑ gt⊤ S−1 gt : S 0, tr(S) ≤ d .
1/2
u
Rφ (T ) ≤ 2D tr(GT ) =
S
t=1
5. Derived Algorithms
In this section, we derive updates using concrete regularization functions ϕ and settings of the
domain X for the A DAG RAD framework. We focus on showing how to solve Equations (3) and (4)
with the diagonal matrix version of the algorithms we have presented. We focus on the diagonal
case for two reasons. First, the updates often take closed-form in this case and carry some intuition.
Second, the diagonal case is feasible to implement in very high dimensions, whereas the full matrix
version is likely to be confined to a few thousand dimensions. We also discuss how to efficiently
compute the updates when the gradient vectors are sparse.
We begin by noting a simple but useful fact. Let Gt denote either the outer product matrix of
1/2
gradients or its diagonal counterpart and let Ht = δI + Gt , as usual. Simple algebraic manipula-
tions yield that each of the updates (3) and (4) in the prequel can be written in the following form
(omitting the stepsize η):

1
xt+1 = argmin hu, xi + ϕ(x) + hx, Ht xi . (18)
x∈X 2
In particular, at time t for the RDA update, we have u = ηt ḡt . For the composite gradient update (4),
1 1 1
η hgt , xi + hx − xt , Ht (x − xt )i = hηgt − Ht xt , xi + hx, Ht xi + hxt , Ht xt i
2 2 2
so that u = ηgt − Ht xt . We now derive algorithms for solving the general update (18). Since most
of the derivations are known, we generally provide only the closed-form solutions or algorithms for
the solutions in the remainder of the subsection, deferring detailed derivations to Appendix G for
the interested reader.
2136
5.1 ℓ1 -regularization
We begin by considering how to solve the minimization problems necessary for Algorithm 1 with
diagonal matrix divergences and ϕ(x) = λ kxk1 . We consider the two updates we proposed and
denote the ith diagonal element of the matrix Ht = δI + diag(st ) from Algorithm 1 by Ht,ii = δ +
kg1:t,i k2 . For the primal-dual subgradient update, the solution to (3) amounts to the following simple
update for xt+1,i :
ηt
xt+1,i = sign (−ḡt,i ) [|ḡt,i | − λ]+ . (19)
Ht,ii
Comparing the update (19) to the standard dual averaging update (Xiao, 2010), which is
√
xt+1,i = sign (−ḡt,i ) η t [|ḡt,i | − λ]+ ,
it is clear that the difference distills to the step size employed for each coordinate. Our generalization
of RDA yields a dedicated step size for each coordinate inversely proportional to the time-based
norm of the coordinate in the sequence of gradients. Due to the normalization by this term the step
size scales linearly with t, so when Ht,ii is small, gradient information on coordinate i is quickly
incorporated.
The composite mirror-descent update (4) has a similar form that essentially amounts to iterative
shrinkage and thresholding, where the shrinkage differs per coordinate:
η η λη

xt+1,i = sign xt,i − gt,i xt,i − gt,i − .
Ht,ii Ht,ii Ht,ii +
We compare the actual performance of the newly derived algorithms to previously studied versions
in the next section.
For both updates it is clear that we can perform “lazy” computation when the gradient vectors
are sparse, a frequently occurring setting when learning for instance from text corpora. Suppose
that from time step t0 through t, the ith component of the gradient is 0. Then we can evaluate the
above updates on demand since Ht,ii remains intact. For composite mirror-descent, at time t when
xt,i is needed, we update
λη

xt,i = sign(xt0 ,i ) |xt0 ,i | − (t − t0 ) .
Ht0 ,ii +
Even simpler just in time evaluation can be performed for the the primal-dual subgradient update.
Here we need to keep an unnormalized version of the average ḡt . Concretely, we keep track of
ut = t ḡt = ∑tτ=1 gτ = ut−1 + gt , then use the update (19):
ηt |ut,i |

xt,i = sign(−ut,i ) −λ ,
Ht,ii t +
where Ht can clearly be updated lazily in a similar fashion.
5.2 ℓ1 -ball Projections

We next consider the setting in which ϕ ≡ 0 and X = {x : kxk1 ≤ c}, for which it is straightfor-
ward to adapt efficient solutions to continuous quadratic knapsack problems (Brucker, 1984). We
2137
I NPUT: v 0, a 0, c ≥ 0.
I F ∑i vi ≤ c RETURNz∗ = v
S ORT vi /ai inton µ = vi j /ai j s.t. vi j /ai j ≥ vi j+1 /a
oi j+1
ρ vi ρ
S ET ρ := max ρ : ∑ j=1 ai j vi j − aiρ ∑ j=1 a2i j < c
ρ
ρ
∑ j=1 ai j vi j −c
S ET θ = ρ
∑ j=1 a2i j
R ETURN z∗ where z∗i = [vi − θai ]+ .
Figure 3: Project v 0 to {z : ha, zi ≤ c, z 0}.
use the matrix Ht = δI + diag(Gt )1/2 from Algorithm 1. We provide a brief derivation sketch and
an O(d log d) algorithm in this section. First, we convert the problem (18) into a projection prob-
lem onto a scaled ℓ1 -ball. By making the substitutions z = H 1/2 x and A = H −1/2 , it is clear that
problem (18) is equivalent to
2
min z + H −1/2 u s.t. kAzk1 ≤ c .

z 2
−1/2
Now, by appropriate choice of v = −H −1/2 u = −ηtHt ḡt for the primal-dual update (3) and
1/2 −1/2
v = Ht xt − ηHt gt for the mirror-descent update (4), we arrive at the problem
d
1
min
z 2
kz − vk22 s.t. ∑ ai |zi | ≤ c . (20)
i=1
−1/2
We can clearly recover xt+1 from the solution z∗ to the projection (20) via xt+1 = Ht z∗ .
By the symmetry of the objective (20), we can assume without loss of generality that v 0 and
constrain z 0, and a bit of manipulation with the Lagrangian (see Appendix G) for the problem
shows that the solution z∗ has the form
vi − θ∗ ai if vi ≥ θ∗ ai

∗
zi =
0 otherwise
for some θ∗ ≥ 0. The algorithm in Figure 3 constructs the optimal θ and returns z∗ .
5.3 ℓ2 Regularization
We now turn to the case where ϕ(x) = λ kxk2 while X = Rd . This type of regularization is useful
for zeroing multiple weights in a group, for example in multi-task or multiclass learning (Obozinski
et al., 2007). Recalling the general proximal step (18), we must solve
1
min hu, xi + hx, Hxi + λ kxk2 . (21)
x 2
There is no closed form solution for this problem, but we give an efficient bisection-based procedure
for solving (21). We start by deriving the dual. Introducing a variable z = x, we get the equivalent
problem of minimizing hu, xi + 21 hx, Hxi + λ kzk2 subject to x = z. With Lagrange multipliers α for
the equality constraint, we obtain the Lagrangian
1
L (x, z, α) = hu, xi + hx, Hxi + λ kzk2 + hα, x − zi .
2
2138
I NPUT: u ∈ Rd , H 0, λ > 0.
I F kuk2 ≤ λ
R ETURN x = 0
S ET v = H −1 u, θmax = kvk2 /λ − 1/σmin (H)
θmin = kvk2 /λ − 1/σmax (H)
W HILE θmax − θmin > ε
S ET θ = (θmax + θmin )/2, α(θ) = −(H −1 + θI)−1 v
I F kα(θ)k2 > λ
S ET θmin = θ
E LSE
S ET θmax = θ
R ETURN x = −H −1 (u + α(θ))
Figure 4: Minimize hu, xi + 12 hx, Hxi + λ kxk2
Taking the infimum of L with respect to the primal variables x and z, we see that the infimum is
attained at x = −H −1 (u + α). Coupled with the fact that infz λ kzk2 − hα, zi = −∞ unless kαk2 ≤ λ,
in which case the infimum is 0, we arrive at the dual form
− 21 u + α, H −1 (u + α) if kαk2 ≤ λ

inf L (x, z, α) =
x,z −∞ otherwise.
Setting v = H −1 u, we further distill the dual to
min hv, αi + α, H −1 α s.t. kαk2 ≤ λ .

(22)
α 2
We can solve problem (22) efficiently using a bisection search of its equivalent representation in
Lagrange form,
1
θ
min hv, αi + α, H −1 α + kαk22 ,
α 2 2
where θ > 0 is an unknown scalar. The solution to the latter as a function of θ is clearly α(θ) =
−(H −1 + θI)−1 v = −(H −1 + θI)−1 H −1 u. Since kα(θ)k2 is monotonically decreasing in θ (consider
the the eigen-decomposition of the positive definite H −1 ), we can simply perform a bisection search
over θ, checking at each point whether kα(θ)k2 ≷ λ.
To find initial upper and lower bounds on θ, we note that
(1/σmax (H) + θ)−1 kvk2 ≤ kα(θ)k2 ≤ (1/σmin (H) + θ)−1 kvk2
where σmax (H) denotes the maximum singular value of H and σmin (H) the minimum. To guarantee
kα(θmax )k2 ≤ λ, we thus set θmax = kvk2 /λ − 1/σmax (H). Similarly, for θmin we see that so long as
θ ≥ kvk2 /λ − 1/σmin (H) we have kα(θ)k2 ≥ λ. The fact that ∂ kxk2 = {z : kzk2 ≤ 1} when x = 0
implies that the solution for the original problem (21) is x = 0 if and only if kuk2 ≤ λ. We provide
pseudocode for solving (21) in Algorithm 4.
2139
5.4 ℓ∞ Regularization
We again let X = Rd but now choose ϕ(x) = λ kxk∞ . This type of update, similarly to ℓ2 , zeroes
groups of variables, which is handy in finding structurally sparse solutions for multitask or multi-
class problems. Solving the ℓ∞ regularized problem amounts to
1
min hu, xi + hx, Hxi + λ kxk∞ . (23)
x 2
The dual of this problem is a modified ℓ1 -projection problem. As in the case of ℓ2 regularization,
we introduce an equality constrained variable z = x with associated Lagrange multipliers α ∈ Rd to
obtain
1
L (x, z, α) = hu, xi + hx, Hxi + λ kzk∞ + hα, x − zi .
2
Performing identical manipulations to the ℓ2 case, we take derivatives and get that x = −H −1 (u + α)
and, similarly, unless kαk1 ≤ λ, infz L (x, z, α) = −∞. Thus the dual problem for (23) is
1
max − (u + α)H −1 (u + α) s.t. kαk1 ≤ λ .
α 2
When H is diagonal we can find the optimal α∗ using the generalized ℓ1 -projection in Algorithm 3,
then reconstruct the optimal x via x = −H −1 (u + α∗ ).
5.5 Mixed-norm Regularization

Finally, we combine the above results to show how to solve problems with matrix-valued inputs
X ∈ Rd×k , where X = [x1 · · · xd ]⊤ . We consider mixed-norm regularization, which is very useful
for encouraging sparsity across several tasks (Obozinski et al., 2007). Now ϕ is an ℓ1 /ℓ p norm, that
is, ϕ(X) = λ ∑di=1 kxi k p . By imposing an ℓ1 -norm over p-norms of the rows of X, entire rows are
nulled at once.
When p ∈ {2, ∞} and the proximal H in (18) is diagonal, the previous algorithms can be readily
used to solve the mixed norm problems. We simply maintain diagonal matrix information for each
of the rows x̄i of X separately, then solve one of the previous updates for each row independently.
We use this form of regularization in our experiments with multiclass prediction problems in the
next section.
6. Experiments
We performed experiments with several real world data sets with different characteristics: the Im-
ageNet image database (Deng et al., 2009), the Reuters RCV1 text classification data set (Lewis
et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from
the UCI repository (Asuncion and Newman, 2007). For uniformity across experiments, we focus on
the completely online (fully stochastic) optimization setting, in which at each iteration the learning
algorithm receives a single example. We measure performance using two metrics: the online loss
or error and the test set performance of the predictor the learning algorithm outputs at the end of a
single pass through the training data. We also give some results that show how imposing sparsity
constraints (in the form of ℓ1 and mixed-norm regularization) affects the learning algorithm’s per-
formance. One benefit of the A DAG RAD framework is its ability to straightforwardly generalize to
2140
RDA FB A DAG RAD-RDA A DAG RAD-FB PA AROW

ECAT .051 (.099) .058 (.194) .044 (.086) .044 (.238) .059 .049
CCAT .064 (.123) .111 (.226) .053 (.105) .053 (.276) .107 .061
GCAT .046 (.092) .056 (.183) .040 (.080) .040 (.225) .066 .044
MCAT .037 (.074) .056 (.146) .035 (.063) .034 (.176) .053 .039
Table 1: Test set error rates and proportion non-zero (in parenthesis) on Reuters RCV1.
domain constraints X 6= Rd and arbitrary regularization functions ϕ, in contrast to previous adaptive

online algorithms.
We experiment with RDA (Xiao, 2010), F OBOS(Duchi and Singer, 2009), adaptive RDA, adap-
tive F OBOS, the Passive-Aggressive (PA) algorithm (Crammer et al., 2006), and AROW (Crammer
et al., 2009). To remind the reader, PA is an online learning procedure with the update
λ
xt+1 = argmin [1 − yt hzt , xi]+ + kx − xt k22 ,
x 2
where λ is a regularization parameter. PA’s update is similar to the update employed by AROW
(see (9)), but the latter maintains second order information on x. By using a representer theorem
it is also possible to derive efficient updates for PA and AROW when the loss is the logistic loss,
log(1 + exp(−yt hzt , xt i)). We thus we compare the above six algorithms using both hinge and
logistic loss.
6.1 Text Classification

The Reuters RCV1 data set consists of a collection of approximately 800,000 text articles, each
of which is assigned multiple labels. There are 4 high-level categories, Economics, Commerce,
Medical, and Government (ECAT, CCAT, MCAT, GCAT), and multiple more specific categories.
We focus on training binary classifiers for each of the four major categories. The input features
we use are 0/1 bigram features, which, post word stemming, give data of approximately 2 million
dimensions. The feature vectors are very sparse, however, and most examples have fewer than 5000
non-zero features.
We compare the twelve different algorithms mentioned in the prequel as well as variants of
F OBOS and RDA with ℓ1 -regularization. We summarize the results of the ℓ1 -regularized runs as
well as AROW and PA in Table 1. The results for both hinge and logistic losses are qualitatively
and quantitatively very similar, so we report results only for training with the hinge loss in Table 1.
Each row in the table represents the average of four different experiments in which we hold out 25%
of the data for a test set and perform an online pass on the remaining 75% of the data. For RDA
and F OBOS, we cross-validate the stepsize parameter η by simply running multiple passes and then
choosing the output of the learner that had the fewest mistakes during training. For PA and AROW
we choose λ using the same approach. We use the same regularization multiplier on the ℓ1 term for
RDA and F OBOS, selected so that RDA achieved approximately 10% non-zero predictors.
It is evident from the results presented in Table 1 that the adaptive algorithms (AROW and A DA -
G RAD) are far superior to non-adaptive algorithms in terms of error rate on test data. The A DA -
G RAD algorithms naturally incorporate sparsity as well since they are run with ℓ1 -regularization,
though RDA has significantly higher sparsity levels (PA and AROW do not have any sparsity). Fur-
thermore, although omitted from the table to avoid clutter, in every test with the RCV1 corpus, the
2141
Alg. Avg. Prec. P@1 P@3 P@5 P@10 Prop. nonzero

A DAG RAD RDA 0.6022 0.8502 0.8307 0.8130 0.7811 0.7267
AROW 0.5813 0.8597 0.8369 0.8165 0.7816 1.0000
PA 0.5581 0.8455 0.8184 0.7957 0.7576 1.0000
RDA 0.5042 0.7496 0.7185 0.6950 0.6545 0.8996
Table 2: Test set precision for ImageNet
adaptive algorithms outperformed the non-adaptive algorithms. Moreover, both A DAG RAD-RDA
and A DAG RAD-Fobos outperform AROW on all the classification tasks. Unregularized RDA and
F OBOS attained similar results as did the ℓ1 -regularized variants (of course without sparsity), but
we omit the results to avoid clutter and because they do not give much more understanding.
6.2 Image Ranking

ImageNet (Deng et al., 2009) consists of images organized according to the nouns in the WordNet
hierarchy, where each noun is associated on average with more than 500 images collected from
the web. We selected 15,000 important nouns from the hierarchy and conducted a large scale im-
age ranking task for each noun. This approach is identical to the task tackled by Grangier and
Bengio (2008) using the Passive-Aggressive algorithm. To solve this problem, we train 15,000
ranking machines using Grangier and Bengio’s visterms features, which represent patches in an im-
age with 79-dimensional sparse vectors. There are approximately 120 patches per image, resulting
in a 10,000-dimensional feature space.
Based on the results in the previous section, we focus on four algorithms for solving this task:
AROW, A DAG RAD with RDA updates and ℓ1 -regularization, vanilla RDA with ℓ1 , and Passive-
Aggressive. We use the ranking hinge loss, which is [1 − hx, z1 − z2 i]+ when z1 is ranked above
z2 . We train a ranker xc for each of the image classes individually, cross-validating the choice of
initial stepsize for each algorithm on a small held-out set. To train an individual ranker for class
c, at each step of the algorithm we randomly sample a positive image z1 for the category c and
an image z2 from the training set (which with high probability is a negative example for class c)
and perform an update on the example z1 − z2 . We let each algorithm take 100,000 such steps for
each image category, we train four sets of rankers with each algorithm, and the training set includes
approximately 2 million images.
For evaluation, we use a distinct test set of approximately 1 million images. To evaluate a set of
rankers, we iterate through all 15,000 classes in the data set. For each class we take all the positive
image examples in the test set and sample 10 times as many negative image examples. Following
Grangier and Bengio, we then rank the set of positive and negative images and compute precision-
at-k for k = {1, . . . , 10} and the average precision for each category. The precision-at-k is defined
as the proportion of examples ranked in the top k for a category c that actually belong to c, and
the average precision is the average of the precisions at each position in which a relevant picture
appears. Letting Pos(c) denote the positive examples for category c and p(i) denote the position of
the ith returned picture in list of images sorted by inner product with xc , the average precision is
|Pos(c)|
1 i
∑ p(i) .
|Pos(c)| i=1
2142
10000
PA
9000 Ada RDA
RDA
8000 Ada RDA L1/L2
RDA L1/L2
7000
6000
Mistakes
5000
4000
3000
2000
1000
0
0 1 2 3 4 5 6
Examples seen 4
x 10
Figure 5: Learning curves on MNIST
We compute the mean of each measurement across all classes, performing this twelve times for
each of the sets of rankers trained. Table 2 summarizes our results. We do not report variance as the
variance was on the order of 10−5 for each algorithm. One apparent characteristic to note from the
table is that A DAG RAD RDA achieves higher levels of sparsity than the other algorithms—using
only 73% of the input features it achieves very high performance. Moreover, it outperforms all the
algorithms in average precision. AROW has better results than the other algorithms in terms of
precision-at-k for k ≤ 10, though A DAG RAD’s performance catches up to and eventually surpasses
AROW’s as k grows.
6.3 Multiclass Optical Character Recognition

In the well-known MNIST multiclass classification data set, we are given 28 × 28 pixel images ai ,
and the learner’s task is to classify each image as a digit in {0, . . . , 9}. Linear classifiers do not
work well on a simple pixel-based representation. Thus we learn classifiers built on top of a kernel
machine with Gaussian kernels, as do Duchi and Singer (2009), which gives a different (and non-
sparse) structure to the feature space in contrast to our previous experiments.
In particular, for the
1
2
ith example and jth feature, the feature value is zi j = K(ai , a j ) , exp − 2σ2 ai − a j 2 . We use a

support set of approximately 3000 images to compute the kernels and trained multiclass predictors,
which consist of one vector xc ∈ R3000 for each class c, giving a 30,000 dimensional problem.
There is no known multiclass AROW algorithm. We therefore compare adaptive RDA with and
without mixed-norm ℓ1 /ℓ2 and ℓ1 /ℓ∞ regularization (see Section 5.5), RDA, and multiclass Passive
Aggressive to one another using the multiclass hinge loss (Crammer et al., 2006). For each algorithm
we used the first 5000 of 60,000 training examples to choose the stepsize η (for RDA) and λ (for
PA).
In Figure 5, we plot the learning curves (cumulative mistakes made) of multiclass PA, RDA,
RDA with ℓ1 /ℓ2 regularization, adaptive RDA, and adaptive RDA with ℓ1 /ℓ2 regularization (ℓ1 /ℓ∞
2143
Test error rate Prop. nonzero

PA 0.062 1.000
Ada-RDA 0.066 1.000
RDA 0.108 1.000
Ada-RDA λ = 5 · 10−4 0.100 0.569
RDA λ = 5 · 10−4 0.138 0.878
Ada-RDA λ = 10−3 0.137 0.144
RDA λ = 10−3 0.192 0.532
Table 3: Test set error rates and sparsity proportions on MNIST. The scalar λ is the multiplier on
the ℓ1 /ℓ2 regularization term.
is similar). From the curves, we see that Adaptive RDA seems to have similar performance to PA,
and the adaptive versions of RDA are vastly superior to their non-adaptive counterparts. Table 3
further supports this, where we see that the adaptive RDA algorithms outperform their non-adaptive
counterparts both in terms of sparsity (the proportion of non-zero rows) and test set error rates.
6.4 Income Prediction

The KDD census income data set from the UCI repository (Asuncion and Newman, 2007) contains
census data extracted from 1994 and 1995 population surveys conducted by the U.S. Census Bureau.
The data consists of 40 demographic and employment related variables which are used to predict
whether a respondent has income above or below $50,000. We quantize each feature into bins (5
per feature for continuous features) and take products of features to give a 4001 dimensional feature
space with 0/1 features. The data is divided into a training set of 199,523 instances and test set of
99,762 test instances.
As in the prequel, we compare AROW, PA, RDA, and adaptive RDA with and without ℓ1 -
regularization on this data set. We use the first 10,000 examples of the training set to select the
step size parameters λ for AROW and PA and η for RDA. We perform ten experiments on random
shuffles of the training data. Each experiment consists of a training pass through some proportion
of the data (.05, .1, .25, .5, or the entire training set) and computing the test set error rate of the
learned predictor. Table 4 and Figure 6 summarize the results of these experiments. The variance
of the test error rates is on the order of 10−6 so we do not report it. As earlier, the table and figure
make it clear that the adaptive methods (AROW and A DAG RAD-RDA) give better performance
than non-adaptive methods. Further, as detailed in the table, the A DAG RAD methods can give
extremely sparse predictors that still give excellent test set performance. This is consistent with
the experiments we have seen to this point, where A DAG RAD gives sparse but highly accurate
predictors.
6.5 Experiments with Sparsity-Accuracy Tradeoffs

In our final set of experiments, we investigate the tradeoff between the level of sparsity and the
classification accuracy for the A DAG RAD-RDA algorithms. Using the same experimental setup
as for the initial text classification experiments described in Section 6.1, we record the average
test-set performance of A DAG RAD-RDA versus the proportion of features that are non-zero in the
predictor A DAG RAD outputs after a single pass through the training data. To achieve this, we run
2144
0.056
AROW
PA
0.054
RDA
Ada RDA
Test error rate 0.052
0.05
0.048
0.046
0.044
0 0.2 0.4 0.6 0.8 1
Proportion train
Figure 6: Test set error rates as function of proportion of training data seen on Census Income data
set.
Prop. Train 0.05 0.10 0.25 0.50 1.00

AROW 0.049 0.048 0.046 0.045 0.044
PA 0.055 0.052 0.050 0.049 0.048
RDA 0.055 0.054 0.052 0.051 0.050
Ada-RDA 0.053 0.051 0.049 0.048 0.047
ℓ1 RDA 0.056 (0.075) 0.054 (0.066) 0.053 (0.058) 0.052 (0.053) 0.051 (0.050)
ℓ1 Ada-RDA 0.052 (0.062) 0.051 (0.053) 0.050 (0.044) 0.050 (0.040) 0.049 (0.037)
Table 4: Test set error rates as function of proportion of training data seen (proportion of non-zeros
in parenthesis where appropriate) on Census Income data set.
A DAG RAD with ℓ1 -regularization, and we sweep the regularization multiplier λ from 10−8 to 10−1 .
These values result in predictors ranging from a completely dense predictor to an all-zeros predictor,
respectively.
We summarize our results in Figure 7, which shows the test set performance of A DAG RAD
for each of the four categories ECAT, CCAT, GCAT, and MCAT. Within each plot, the horizontal
black line labeled AROW designates the baseline performance of AROW on the text classification
task, though we would like to note that AROW generates fully dense predictors. The plots all
portray a similar story. With high regularization values, A DAG RAD exhibits, as expected, poor
performance as it retains no predictive information from the learning task. Put another way, when
the regularization value is high A DAG RAD is confined to an overly sparse predictor which exhibits
poor generalization. However, as the regularization multiplier λ decreases, the learned predictor
becomes less sparse and eventually the accuracy of A DAG RAD exceeds AROW’s accuracy. It is
interesting to note that for these experiments, as soon as the predictor resulting from a single pass
2145
ECAT CCAT
0.16 0.16
AdaGrad AdaGrad
AROW AROW
0.14 0.14
0.12 0.12
Test−set error rate

0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 −5 −4 −3 −2 −1 0
0.02 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10
Proportion non−zero Proportion non−zero
GCAT MCAT
0.16 0.16
AdaGrad AdaGrad
AROW AROW
0.14 0.14
0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 −5 −4 −3 −2 −1 0
0.02 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10
Proportion non−zero Proportion non−zero
Figure 7: Test set error rates as a function of proportion of non-zeros in predictor x output by A DA -
G RAD (AROW plotted for reference).
through the data has more than 1% non-zero coefficients, A DAG RAD’s performance matches that of
AROW. We also would like to note that the variance in the test-set error rates for these experiments
is on the order of 10−6 , and we thus do not draw error bars in the graphs. The performance of
A DAG RAD as a function of regularization for other sparse data sets, especially in relation to that of
AROW, was qualitatively similar to this experiment.
7. Conclusions
We presented a paradigm that adapts subgradient methods to the geometry of the problem at hand.
The adaptation allows us to derive strong regret guarantees, which for some natural data distributions
achieve better performance guarantees than previous algorithms. Our online regret bounds can be
naturally converted into rate of convergence and generalization bounds (Cesa-Bianchi et al., 2004).
Our experiments show that adaptive methods, specifically A DAG RAD-F OBOS, A DAG RAD-RDA,
and AROW clearly outperform their non-adaptive counterparts. Furthermore, the A DAG RAD fam-
2146
ily of algorithms naturally incorporates regularization and gives very sparse solutions with similar
performance to dense solutions. Our experiments with adaptive methods use a diagonal approxima-
tion to the matrix obtained by taking outer products of subgradients computed along the run of the
algorithm. It remains to be tested whether using the full outer product matrix can further improve
performance.
To conclude we would like to underscore a possible elegant generalization that interpolates
between full-matrix proximal functions and diagonal approximations using block diagonal matrices.
[1] · · · v[k] ] where v[i] ∈ R are subvectors of v with ∑i=1 di = d.
Specifically, for v ∈ Rd let v = [v⊤ ⊤ ⊤ di k
We can define the associated block-diagonal approximation to the outer product matrix ∑tτ=1 gτ g⊤ τ
by
 
gτ,[1] g⊤ τ,[1] 0 ··· 0
 .. 
t 
0 gτ,[2] g⊤ . 0
Gt = ∑ 

τ,[2]
.
 
.. . . . .
τ=1  . . . 0
 

0 ··· 0 gτ,[k] g⊤
τ,[k]
In this case, a combination of Theorems 5 and 7 gives the next corollary.
Corollary 12 Let Gt be the block-diagonal outer productD matrix E defined above and the sequence
1/2
{xt } be defined by the RDA update of (3) with ψt (x) = x, Gt x . Then, for any x∗ ∈ X ,
1 2
1/2 1/2
tr(GT ) + η tr(GT ).
∗
Rφ (T ) ≤ max x[i]
η i 2
A similar bound holds for composite mirror-descent updates, and it is straightforward to get infimal
equalities similar to those in Corollary 11 with the infimum taken over block-diagonal matrices.
Such an algorithm can interpolate between the computational simplicity of the diagonal proximal
functions and the ability of full matrices to capture correlation in the gradient vectors.
A few open questions stem from this line of research. The first is whether we can efficiently
use full matrices in the proximal functions, as in Section 4. A second open issue is whether non-
Euclidean proximal functions, such as the relative entropy, can be used. We also think that the
strongly convex case—when ft or ϕ is strongly convex—presents interesting challenges that we
have not completely resolved. We hope to investigate both empirical and formal extensions of this
work in the near future.
Acknowledgments
There are many people to whom we owe our sincere thanks for this research. Fernando Pereira
helped push us in the direction of working on adaptive online methods and has been a constant
source of discussion and helpful feedback. Samy Bengio provided us with a processed version of
the ImageNet data set and was instrumental in helping to get our experiments running, and Adam
Sadovsky gave many indispensable coding suggestions. The anonymous reviewers also gave several
suggestions that improved the quality of the paper. Lastly, Sam Roweis was a sounding board for
some of our earlier ideas on the subject, and we will miss him dearly.
2147
Appendix A. Full Matrix Motivating Example

As in the diagonal case, as the adversary we choose ε > 0 and on rounds t = 1, . . . , η2 /ε2 play the
vector ±v1 . After the first η2 /ε2 rounds, the adversary simply cycles through the vectors v2 , . . . , vd .
Thus, for Zinkevich’s projected gradient, we have xt = αt,1 v1 for some multiplier αt,1 > 0 when
t ≤ η2 /ε2 . After the first η2 /ε2 rounds, we perform the updates
η

xt+1 = Πkxk ≤ d xt +
√ √ vi
2 t
√
for some index i, but as in the diagonal case, η/ t ≤ ε, and by orthogonality of vi , v j , we have
xt = V αt for some αt 0, and the projection step can only shrink the multiplier αt,i for index
i. Thus, each coordinate incurs loss at least 1/(2ε), and projected gradient descent suffers losses
Ω(d/ε).
On the other hand, A DAG RAD suffers loss at most d. Indeed, since g1 = v1 and kv1 k2 = 1, we
1
†
have G21 = v1 v⊤ ⊤ ⊤
1 v1 v1 = v1 v1 = G1 , so G1 = G1 = G1 , and
2
x2 = x1 + G†1 = x1 + v1 v⊤
1 v1 = x1 + v1 .
Since hx2 , v1 i = 1, we see that A DAG RAD suffers no loss (and Gt = G1 ) until a vector zt = ±vi for
i 6= 1 is played by the adversary. However, an identical argument shows that Gt is simply updated
to v1 v⊤ ⊤
1 + vi vi , in which case xt = v√
1 + vi . Indeed, an inductive argument shows that until all the
vectors vi are seen, we have kxt k2 < d by orthogonality, and eventually we have
s
d d √
xt = ∑ vi and kxt k2 = ∑ kvi k22 = d
i=1 i=1
√
so that xt ∈ X = {x : kxk2 ≤ d} for A DAG RAD for all t. All future predictions thus achieve margin
1 and suffer no loss.
Appendix B. Technical Lemmas

Lemma 13 Let A B 0 be symmetric d × d PSD matrices. Then A1/2 B1/2 .
Proof This is Example 3 of Davis (1963). We include a proof for convenience of the reader.
Let λ be any eigenvalue (with corresponding eigenvector x) of A1/2 − B1/2 ; we show that λ ≥ 0.
2
Clearly A1/2 x − λx = B1/2 x. Taking the inner product of both sides with A1/2 x, we have A1/2 x 2 −

λ A1/2 x, x = A1/2 x, B1/2 x . We use the Cauchy-Schwarz inequality:

E
1/2 2 D
1/2 2

− λ A x, x ≤ A x B x = hAx, xi hBx, xi ≤ hAx, xi = A x
1/2 1/2 1/2 p
A x
2 2 2 2
where the last inequality follows from the assumption that A B. Thus we must have λ A1/2 x, x ≥

0, which implies λ ≥ 0.
The gradient of the function tr(X p ) is easy to compute for integer values of p. However, when p is
real we need the following lemma. The lemma tacitly uses the fact that there is a unique positive
semidefinite X p when X 0 (Horn and Johnson, 1985, Theorem 7.2.6).
2148
Lemma 14 Let p ∈ R and X ≻ 0. Then ∇X tr(X p ) = pX p−1 .
Proof We do a first order expansion of (X + A) p when X ≻ 0 and A is symmetric. Let X = UΛU ⊤

be the symmetric eigen-decomposition of X and V DV ⊤ be the decomposition of Λ−1/2U ⊤ AUΛ−1/2 .
Then
(X + A) p = (UΛU ⊤ + A) p = U(Λ +U ⊤ AU) pU ⊤ = UΛ p/2 (I + Λ−1/2U ⊤ AUΛ−1/2 ) p Λ p/2U ⊤

= UΛ p/2V ⊤ (I + D) pV Λ p/2U ⊤ = UΛ p/2V ⊤ (I + pD + o(D))V Λ p/2U ⊤
= UΛ pU ⊤ + pUΛ p/2V ⊤ DV Λ p/2U ⊤ + o(UΛ−/2V ⊤ DV Λ p/2U ⊤ )
= X p +UΛ(p−1)/2U ⊤ AUΛ(p−1)/2U ⊤ + o(A) = X p + pX (p−1)/2 AX (p−1)/2 + o(A).
In the above, o(A) is a matrix that goes to zero faster than A → 0, and the second line follows via a
first-order Taylor expansion of (1 + di ) p . From the above, we immediately have
tr((X + A) p ) = tr X p + p tr(X p−1 A) + o(tr A),
which completes the proof.
Appendix C. Proof of Lemma 4

We prove the lemma by considering an arbitrary real-valued sequence {ai } and its vector represen-
tation a1:i = [a1 · · · ai ]. We are next going to show that
T
at2
∑ ≤ 2 ka1:T k2 , (24)
t=1 ka1:t k2
where we define 00 = 0. We use induction on T to prove inequality (24). For T = 1, the inequality
trivially holds. Assume the bound (24) holds true for T − 1, in which case
T T −1
a2 a2 a2T a2T
∑ ka1:tt k = ∑ ka1:tt k +
ka1:T k2
≤ 2 ka1:T −1 k2 +
ka1:T k2
,
t=1 2 t=1 2
where the inequalityqfollows from the inductive hypothesis. We define bT = ∑t=1

T
at2 and use con-
√
cavity to obtain that bT − a2T ≤ bT − a2T 2√1b so long as bT − a2T ≥ 0.2 Thus,
T
a2T a2
q p
2 ka1:T −1 k2 + = 2 bT − a2T + √ T ≤ 2 bT = 2 ka1:T k2 .
ka1:T k2 bT
Having proved the bound (24), we note that by construction that st,i = kg1:t,i k2 , so
T T2
gt,i d d
∑ gt , diag(st ) gt = ∑ ∑
−1
≤ 2 ∑ kg1:T,i k2 .

t=1 t=1 i=1 kg1:t,i k2 i=1
2. We note that we use an identical technique in the full-matrix case. See Lemma 8.
2149
Appendix D. Proof of Lemmas 8 and 9

We begin with the more difficult proof of Lemma 8.
Proof of Lemma 8 The core of the proof is based on the concavity of the function tr(A1/2 ). How-
ever, careful analysis is required as A might not be strictly positive definite. We also use the previous
lemma which implies that the gradient of tr(A1/2 ) is 12 A−1/2 when A ≻ 0.
First, A p is matrix-concave for A ≻ 0 and 0 ≤ p ≤ 1 (see, for example, Corollary 4.1 in Ando,
1979 or Theorem 16.1 in Bondar, 1994). That is, for A, B ≻ 0 and α ∈ [0, 1] we have
(αA + (1 − α)B) p αA p + (1 − α)B p . (25)
Now suppose simply A, B 0 (but neither is necessarily strict). Then for any δ > 0, we have
A + δI ≻ 0 and B + δI ≻ 0 and therefore
(α(A + δI) + (1 − α)(B + δI)) p α(A + δI) p + (1 − α)(B + δI) p αA p + (1 − α)B p ,
where we used Lemma 13 for the second matrix inequality. Moreover, αA + (1 − α)B + δI →
αA + (1 − α)B as δ → 0. Since A p is continuous (when we use the unique PSD root), this line of
reasoning proves that (25) holds for A, B 0. Thus, we proved that
tr((αA + (1 − α)B) p ) ≥ α tr(A p ) + (1 − α) tr(B p ) for 0 ≤ p ≤ 1 .
Recall now that Lemma 14 implies that the gradient of tr(A1/2 ) is 12 A−1/2 when A ≻ 0. There-
fore, from the concavity of A1/2 and the form of its gradient, we can use the standard first-order
inequality for concave functions so that for any A, B ≻ 0,
1
tr(A1/2 ) ≤ tr(B1/2 ) + tr(B−1/2 (A − B)) . (26)
2
Let A = B − νgg⊤ 0 and suppose only that B 0. We must take some care since B−1/2 may
not necessarily exist, and the above inequality does not hold true in the pseudo-inverse sense when
B 6≻ 0. However, for any δ > 0 we know that 2∇B tr((B+δI)1/2 ) = (B+δI)−1/2 , and A−B = −νgg⊤ .
From (26) and Lemma 13, we have
2 tr(B − tgg⊤ )1/2 = 2 tr(A1/2 ) ≤ 2 tr((A + δI)1/2 )

≤ 2 tr(B + δI)1/2 − ν tr((B + δI)−1/2 gg⊤ ) . (27)
Note that g ∈ Range(B), because ifit were not, we could choose some u with Bu = 0 and hg, ui 6= 0,
which would give u, (B − cgg⊤ )u = −c hg, ui2 < 0, a contradiction. Now let B = V diag(λ)V ⊤ be

the eigen-decomposition of B. Since g ∈ Range(B),

p
g⊤ (B + δI)−1/2 g = g⊤V diag 1/ λi + δ V ⊤ g
1
∑ ∑
−1/2
= √ (g⊤ vi )2 −→ λi (g⊤ vi )2 = g⊤ (B† )1/2 g .
i:λi >0 λi + δ δ↓0
i:λi >0
Thus, by taking δ ↓ 0 in (27), and since both tr(B + δI)1/2 and tr((B + δI)−1/2 gg⊤ ) are evidently
continuous in δ, we complete the proof.
2150
Proof of Lemma 9 We begin by noting that δ2 I gg⊤ , so from Lemma 13 we get (A + gg⊤ )1/2
√ + δ I) √. Since
(A 2 1/2
√ A and I are simultaneously diagonalizable, we can generalize the inequality
a + b ≤ a + b, which holds for a, b ≥ 0, to positive semi-definite matrices, thus,
(A + δ2 I)1/2 A1/2 + δI .
Therefore, if A + gg⊤ is of full rank, we have (A + gg⊤ )−1/2 (A1/2 + δI)−1 (Horn and Johnson,
1985, Corollary 7.7.4(a)). Since g ∈ Range((A + gg⊤ )1/2 ), we can apply an analogous limiting ar-
gument to the one used in the proof of Lemma 8 and discard all zero eigenvalues of A + gg⊤ , which
completes the lemma.
Appendix E. Solution to Problem (15)

We prove here a technical lemma that is useful in characterizing the solution of the optimization
problem below. Note that the second part of the lemma implies that we can treat the inverse of the
solution matrix S−1 as S† . We consider solving
min tr(S−1 A) subject to S 0, tr(S) ≤ c where A 0 . (28)

S
1 1
Lemma 15 If A is of full rank, then the minimizer of (28) is S = cA 2 / tr(A 2 ). If A is not of full rank,
1 1
then setting S = cA 2 / tr(A 2 ) gives
tr(S† A) = inf tr(S−1 A) : S 0, tr(S) ≤ c .

S
1
In either case, tr(S† A) = tr(A 2 )2 /c.
Proof Both proofs rely on constructing the Lagrangian for (28). We introduce θ ∈ R+ for the trace
constraint and Z 0 for the positive semidefinite constraint on S. In this case, the Lagrangian is
L (S, θ, Z) = tr(S−1 A) + θ(tr(S) − c) − tr(SZ).
The derivative of L with respect to S is
−S−1 AS−1 + θI − Z. (29)
If S is full rank, then to satisfy the generalized complementarity conditions for the problem (Boyd
and Vandenberghe, 2004), we must have Z = 0. Therefore, we get S−1 AS−1 = θI. We now can
1
multiply by S on the right and the left to get that A = θS2 , which implies that S ∝ A 2 . If A is of full
rank, the optimal solution for S ≻ 0 forces θ to be positive so that tr(S) = c. This yields the solution
1 1
S = cA 2 / tr(A 2 ). In order to verify optimality of this solution, we set Z = 0 and θ = c−2 tr(A1/2 )2
which gives ∇S L (S, θ, Z) = 0, as is indeed required.
Suppose now that A is not full rank and that
Λ 0 ⊤

A=Q Q
0 0
2151
is the eigen-decomposition of A. Let n be the dimension of the null-space of A (so the rank of A is
d − n). Define the variables
1 1
Λ2 0 Λ2 0

0 0 1 c
Z(θ) = , S(θ, δ) = √ Q ⊤
Q , S(δ) = Q Q⊤ .
0 θI θ 0 δI 1
tr(A 2 ) + δn 0 δI
It is easy to see that tr S(δ) = c, and

1 1 1
lim tr(S(δ)−1 A) = tr(S(0)† A) = tr(A 2 ) tr(Λ 2 )/c = tr(A 2 )2 /c.
δ→0
Further, let g(θ) = infS L (S, θ, Z(θ)) be the dual of (28). From the above analysis and (29), it is
evident that
−1 1
Λ 2 ΛΛ− 2

0 0 0
−S(θ, δ) AS(θ, δ) + θI − Z(θ) = −θQ
−1 −1
Q + θI −
⊤
= 0.
0 δ−2 I · 0 0 θI
So S(θ, δ) achieves the infimum in the dual for any δ > 0, tr(S(0)Z(θ)) = 0, and
√ 1 √ 1 √
g(θ) = θ tr(Λ 2 ) + θ tr(Λ 2 ) + θδn − θc.
1 1 1 1
Setting θ = tr(Λ 2 )2 /c2 gives g(θ) = tr(Λ 2 )2 /c − δn tr(Λ 2 )/c. Taking δ → 0 gives g(θ) = tr(A 2 )2 /c,
1
which means that limδ→0 tr(S(δ)−1 A) = tr(A 2 )2 /c = g(θ). Thus the duality gap for the original
problem is 0 so S(0) is the limiting solution.
1 1
The last statement of the lemma is simply plugging S† = (A† ) 2 tr(A 2 )/c in to the objective being
minimized.
Appendix F. Proofs of Propositions 2 and 3

We begin with the proof of Proposition 2. The proof essentially builds upon Xiao (2010) and
Nesterov (2009), with some modification to deal with the indexing of ψt . We include the proof for
completeness.
Proof of Proposition 2 Define ψt∗ to be the conjugate dual of tϕ(x) + ψt (x)/η:
n 1 o
ψt∗ (g) = sup hg, xi − tϕ(x) − ψt (x) .
x∈X η
Since ψt /η is 1/η-strongly convex with respect to the norm k·kψt , the function ψt∗ has η-Lipschitz
continuous gradients with respect to k·kψt∗ :
k∇ψt∗ (g1 ) − ∇ψt∗ (g2 )kψt ≤ η kg1 − g2 kψt∗ (30)
for any g1 , g2 (see, e.g., Nesterov, 2005, Theorem 1 or Hiriart-Urruty and Lemaréchal, 1996, Chap-
ter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has
L-Lipschitz gradients, f (y) ≤ f (x) + h∇ f (x), y − xi + (L/2) ky − xk2 , and

1
∇ψt (g) = argmin − hg, xi + tϕ(x) + ψt (x) .
∗
(31)
x∈X η
2152
Using the bound (30) and identity (31), we can give the proof of the corollary. Indeed, letting
gt ∈ ∂ ft (xt ) and defining zt = ∑tτ=1 gτ , we have
T
∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )
t=1
T
≤ ∑ hgt , xt − x∗ i − ϕ(x∗ ) + ϕ(xt )
t=1
( )
T T
1
≤ ∑ hgt , xt i + ϕ(xt ) + sup − ∑ hgt , xi − T ϕ(x) − ψT (x) + ψT (x∗ )
t=1 x∈X t=1 η
T
1
= ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt ) + ψ∗T (−zT ) .
η t=1
Since ψt+1 ≥ ψt , it is clear that

T
1
ψ∗T (−zT ) = − ∑ hgt , xT +1 i − T ϕ(xT +1 ) − ψT (xT +1 )
t=1 η
T
1
≤ − ∑ hgt , xT +1 i − (T − 1)ϕ(xT +1 ) − ϕ(xT +1 ) − ψT −1 (xT +1 )
t=1 η

1
≤ sup − hzT , xi − (T − 1)ϕ(x) − ψT −1 (x) − ϕ(xT +1 ) = ψ∗T −1 (−zT ) − ϕ(xT +1 ).
x∈X η
The Lipschitz continuity of ∇ψt∗ , the identity (31), and the fact that zT − zT −1 = −gT give
T
∑ ft (xt ) + ϕ(xt+1) − ft (x∗ ) − ϕ(x∗ )
t=1
T
1
≤ ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) + ψ∗T −1 (−zT ) − ϕ(xT +1 )
η t=1
T
1
≤ ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) − ϕ(xT +1 )
η t=1
η
+ ψT −1 (−zT −1 ) − ∇ψ∗T −1 (zT −1 ), gT + kgT k2ψ∗
∗

2 T −1
1 T −1
η
= ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) + ψ∗T −1 (−zT −1 ) + kgT k2ψ∗ .
η t=1 2 T −1
We can repeat the same sequence of steps that gave the last equality to see that
T
1 η T
∑ ft (xt ) + ϕ(xt+1 ) − ft (x∗ ) − ϕ(x∗ ) ≤
η
ψT (x∗ ) + ∑ kgt k2ψ∗ + ψ∗0 (−z0 ).
2 t=1 t−1
t=1
Recalling that x1 = argminx∈X {ϕ(x)} and that ψ∗0 (0) = 0 completes the proof.
We now turn to the proof of Proposition 3. We begin by stating and fully proving an (essentially)
immediate corollary to Lemma 2.3 of Duchi et al. (2010).
2153
Lemma 16 Let {xt } be the sequence defined by the update (4) and assume that Bψt (·, ·) is strongly
convex with respect to a norm k·kψt . Let k·kψt∗ be the associated dual norm. Then for any x∗ ,
η2
ft′ (xt ) 2 ∗
η ( ft (xt ) − ft (x∗ )) + η (ϕ(xt+1 ) − ϕ(x∗ )) ≤ Bψt (x∗ , xt ) − Bψt (x∗ , xt+1 ) +

2 ψt
Proof The optimality of xt+1 for (4) implies for all x ∈ X and ϕ′ (xt+1 ) ∈ ∂ϕ(xt+1 )
x − xt+1 , η f ′ (xt ) + ∇ψt (xt+1 ) − ∇ψt (xt ) + ηϕ′ (xt+1 ) ≥ 0.

(32)
In particular, this obtains for x = x∗ . From the subgradient inequality for convex functions, we have
ft (x∗ ) ≥ ft (xt ) + h ft′ (xt ), x∗ − xt i, or ft (xt ) − ft (x∗ ) ≤ h ft′ (xt ), xt − x∗ i, and likewise for ϕ(xt+1 ). We
thus have
η [ ft (xt ) + ϕ(xt+1 ) − ft (x∗ ) − ϕ(x∗ )]

≤ η xt − x∗ , ft′ (xt ) + η xt+1 − x∗ , ϕ′ (xt+1 )

= η xt+1 − x∗ , ft′ (xt ) + η xt+1 − x∗ , ϕ′ (xt+1 ) + η xt − xt+1 , ft′ (xt )

= x∗ − xt+1 , ∇ψt (xt ) − ∇ψt (xt+1 ) − η ft′ (xt ) − ηϕ′ (xt+1 )

+ hx∗ − xt+1 , ∇ψt (xt+1 ) − ∇ψt (xt )i + η xt − xt+1 , ft′ (xt ) .

Now, by (32), the first term in the last equation is non-positive. Thus we have that
η [ ft (xt ) + ϕ(xt+1 ) − ft (x∗ ) − ϕ(x∗ )]

≤ hx∗ − xt+1 , ∇ψt (xt+1 ) − ∇ψt (xt )i + η xt − xt+1 , ft′ (xt )

= Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + η xt − xt+1 , ft′ (xt )

D 1 √ E
= Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + η η− 2 (xt − xt+1 ), η ft′ (xt )
1 η2 2
≤ Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + kxt − xt+1 k2ψt + ft′ (xt ) ψ∗
2 2 t
η 2
2
≤ Bψt (x∗ , xt ) − Bψt (x∗ , xt+1 ) + ft′ (xt ) ψ∗ .

2 t
In the above, the first equality follows from simple algebra with Bregman divergences, the second
to last inequality follows from Fenchel’s inequality applied to the conjugate functions 21 k·k2ψt and
1 2
2 k·kψt∗ (Boyd and Vandenberghe, 2004, Example 3.27), and the last inequality follows from the
assumed strong convexity of Bψt with respect to the norm k·kψt .
Proof of Proposition 3 Sum the equation in the conclusion of Lemma 16.
Appendix G. Derivations of Algorithms

In this appendix, we give the formal derivations of the solution to the A DAG RAD update for ℓ1 -
regularization and projection to an ℓ1 -ball, as described originally in Section 5.
2154
G.1 ℓ1 -regularization
We give the derivation for the primal-dual subgradient update, as composite mirror-descent is en-
tirely similar. We need to solve update (3), which amounts to
1 1
min η hḡt , xi + δ kxk22 + hx, diag(st )xi + ηλ kxk1 .
x 2t 2t
Let x̂ denote the optimal solution of the above optimization problem. Standard subgradient calculus
implies that when |ḡt,i | ≤ λ the solution is x̂i = 0. Similarly, when ḡt,i < −λ, then x̂i > 0, the
objective is differentiable, and the solution is obtained by setting the gradient to zero:
Ht,ii ηt
ηḡt,i + x̂i + ηλ = 0 , so that x̂i = (−ḡt,i − λ) .
t Ht,ii
ηt
Likewise, when ḡt,i > λ then x̂i < 0, and the solution is x̂i = Ht,ii (−ḡt,i + λ). Combining the three
cases, we obtain the simple update (19) for xt+1,i .
G.2 ℓ1 -ball projections

The derivation we give is somewhat terse, and we refer the interested reader to Brucker (1984) or
Pardalos and Rosen (1990) for more depth. Recall that our original problem (20) is symmetric in its
objective and constraints, so we assume without loss of generality that v 0 (otherwise, we reverse
the sign of each negative component in v, then flip the sign of the corresponding component in the
solution vector). This gives
1
min kz − vk22 s.t. ha, zi ≤ c, z 0 .
z 2
Clearly, if ha, vi ≤ c the optimal z∗ = v, hence we assume that ha, vi > c. We also assume without loss
of generality that vi /ai ≥ vi+1 /ai+1 for simplicity of our derivation. (We revisit this assumption at
the end of the derivation.) Introducing Lagrange multipliers θ ∈ R+ for the constraint that ha, zi ≤ c
and α ∈ Rd+ for the positivity constraint on z, we get
1
L (z, α, θ) = kz − vk22 + θ(ha, zi − c) − hα, zi .
2
Computing the gradient of L , we have ∇z L (z, α, θ) = z − v + θa − α. Suppose that we knew the
optimal θ∗ ≥ 0. Using the complementarity conditions on z and α for optimality of z (Boyd and
Vandenberghe, 2004), we see that the solution z∗i satisfies
vi − θ∗ ai if vi ≥ θ∗ ai

∗
zi =
0 otherwise .
Analogously, the complimentary conditions on ha, zi ≤ c show that given θ∗ , we have

d d
2 vi
∑ ai [vi − θ ai ]+ = c or ∑ ai ai − θ = c .
∗ ∗
i=1 i=1 +
Conversely, had we obtained a value θ ≥ 0 satisfying the above equation, then θ would evidently
induce the optimal z∗ through the equation zi = [vi − θai ]+ .
2155
Now, let ρ be the largest index in {1, . . . , d} such that vi − θ∗ ai > 0 for i ≤ ρ and vi − θ∗ ai ≤ 0
for i > ρ. From the assumption that vi /ai ≤ vi+1 /ai+1 , we have vρ+1 /aρ+1 ≤ θ∗ < vρ /aρ . Thus, had
we known the last non-zero index ρ, we would have obtained
ρ
vρ ρ 2 ρ
2 vi vρ
∑ ai vi − aρ ∑ ai = ∑ ai ai − aρ < c ,
i=1 i=1 i=1
ρ ρ ρ+1
vρ+1 2 vi vρ+1
∑ i i aρ+1 ∑ i ∑ i ai aρ+1 ≥ c .
a v − a 2
= a −
i=1 i=1 i=1
Given ρ satisfying the above inequalities, we can reconstruct the optimal θ∗ by noting that the latter
inequality should equal c exactly when we replace vρ /aρ with θ, that is,
ρ
∑ ai vi − c
θ = i=1ρ 2 .
∗
(33)
∑i=1 ai
The above derivation results in the following procedure (when ha, vi > c). We sort v in descend-
ρ ρ−1
ing order of vi /ai and find the largest index ρ such that ∑i=1 ai vi − (vρ /aρ ) ∑i=1 a2i < c. We then
reconstruct θ∗ using equality (33) and return the soft-thresholded values of vi (see Algorithm 3). It
is easy to verify that the algorithm can be implemented in O(d log d) time. A randomized search
with bookkeeping (Pardalos and Rosen, 1990) can be straightforwardly used to derive a linear time
algorithm.
References
J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower
bounds for online convex games. In Proceedings of the Twenty First Annual Conference on
Computational Learning Theory, 2008.
T. Ando. Concavity of certain maps on positive definite matrices and applications to Hadamard
products. Linear Algebra and its Applications, 26:203–241, 1979.
A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. URL http://www.ics.
uci.edu/˜mlearn/MLRepository.html.
P. Auer and C. Gentile. Adaptive and self-confident online learning algorithms. In Proceedings of
the Thirteenth Annual Conference on Computational Learning Theory, 2000.
P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural
Information Processing Systems 20, 2007.
A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex
optimization. Operations Research Letters, 31:167–175, 2003.
J. V. Bondar. Comments on and complements to Inequalities: Theory of Majorization and Its

Applications. Linear Algebra and its Applications, 199:115–129, 1994.
A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent.
Journal of Machine Learning Research, 10:1737–1754, 2009.
2156
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3
(3):163–166, 1984.
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning

algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004.
N. Cesa-Bianchi, A. Conconi, , and C. Gentile. A second-order perceptron algorithm. SIAM Journal

on Computing, 34(3):640–668, 2005.
N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with
expert advice. Machine Learning, 66:321–352, 2007.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive

algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
K. Crammer, M. Dredze, and F. Pereira. Exact convex confidence-weighted learning. In Advances

in Neural Information Processing Systems 22, 2008.
K. Crammer, M. Dredze, and A. Kulesza. Adaptive regularization of weight vectors. In Advances

in Neural Information Processing Systems 23, 2009.
C. Davis. Notions generalizing convexity for functions defined on spaces of matrices. In Proceed-
ings of the Symposia in Pure Mathematics, volume 7, pages 187–201. American Mathematical
Society, 1963.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchi-
cal image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal
of Machine Learning Research, 10:2873–2908, 2009.
J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In

Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.
R. Fletcher. A new approach to variable metric algorithms. Computer Journal, 13:317–322, 1970.
D. Grangier and S. Bengio. A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1371–1384, 2008.
E. Hazan and S. Kale. Extracting certainty from uncertainty: regret bounded by variation in costs.
In Proceedings of the Twenty First Annual Conference on Computational Learning Theory, 2008.
E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex
optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning
Theory, 2006.
J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II. Springer-
Verlag, 1996.
2157
R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.

A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with the stochastic
mirror-prox algorithm. http://arxiv.org/abs/0809.0815, 2008.
A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer
and System Sciences, 71(3):291–307, 2003.
G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming
Series A, 2010. Online first; to appear.
D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization
research. Journal of Machine Learning Research, 5:361–397, 2004.
H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In
Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.
A. Nedić. Subgradient Methods for Convex Minimization. PhD thesis, Massachusetts Institute of
Technology, 2002.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to
stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
A. S. Nemirovski and D. B. Yudin. Problem Complexity and Efficiency in Optimization. John Wiley
and Sons, 1983.
Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:
127–152, 2005.
Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming,
120(1):221–259, 2009.
G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classification. Tech-
nical Report 743, Dept. of Statistics, University of California Berkeley, 2007.
P. M. Pardalos and J. B. Rosen. An algorithm for a singly constrained class of quadratic programs
subject to upper and lower bounds. Mathematical Programming, 46:321–328, 1990.
A. Rakhlin. Lecture notes on online learning. For the Statistical Machine Learning Course at
University of California, Berkeley, 2009.
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information
Processing and Management, 24(5), 1988.
N. Z. Shor. Utilization of the operation of space dilation in the minimization of convex functions.
Cybernetics and Systems Analysis, 6(1):7–15, 1972. Translated from Kibernetika.
P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical
report, Department of Mathematics, University of Washington, 2008.
L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Tech-
nical Report MSR-TR-2010-23, Microsoft Research, 2010.
2158
M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Pro-
ceedings of the Twentieth International Conference on Machine Learning, 2003.
2159
Journal of Machine Learning Research 15 (2014) 1929-1958 Submitted 11/13; Published 6/14
Dropout: A Simple Way to Prevent Neural Networks from

Overfitting
Nitish Srivastava nitish@cs.toronto.edu

Geoffrey Hinton hinton@cs.toronto.edu
Alex Krizhevsky kriz@cs.toronto.edu
Ilya Sutskever ilya@cs.toronto.edu
Ruslan Salakhutdinov rsalakhu@cs.toronto.edu
Department of Computer Science
University of Toronto
10 Kings College Road, Rm 3302
Toronto, Ontario, M5S 3G4, Canada.
Editor: Yoshua Bengio
Abstract
Deep neural nets with a large number of parameters are very powerful machine learning
systems. However, overfitting is a serious problem in such networks. Large networks are also
slow to use, making it difficult to deal with overfitting by combining the predictions of many
different large neural nets at test time. Dropout is a technique for addressing this problem.
The key idea is to randomly drop units (along with their connections) from the neural
network during training. This prevents units from co-adapting too much. During training,
dropout samples from an exponential number of different “thinned” networks. At test time,
it is easy to approximate the effect of averaging the predictions of all these thinned networks
by simply using a single unthinned network that has smaller weights. This significantly
reduces overfitting and gives major improvements over other regularization methods. We
show that dropout improves the performance of neural networks on supervised learning
tasks in vision, speech recognition, document classification and computational biology,
obtaining state-of-the-art results on many benchmark data sets.
Keywords: neural networks, regularization, model combination, deep learning
1. Introduction
Deep neural networks contain multiple non-linear hidden layers and this makes them very
expressive models that can learn very complicated relationships between their inputs and
outputs. With limited training data, however, many of these complicated relationships
will be the result of sampling noise, so they will exist in the training set but not in real
test data even if it is drawn from the same distribution. This leads to overfitting and many
methods have been developed for reducing it. These include stopping the training as soon as
performance on a validation set starts to get worse, introducing weight penalties of various
kinds such as L1 and L2 regularization and soft weight sharing (Nowlan and Hinton, 1992).
With unlimited computation, the best way to “regularize” a fixed-sized model is to
average the predictions of all possible settings of the parameters, weighting each setting by
c
2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
(a) Standard Neural Net (b) After applying dropout.
Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:
An example of a thinned net produced by applying dropout to the network on the left.
Crossed units have been dropped.
its posterior probability given the training data. This can sometimes be approximated quite
well for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but we
would like to approach the performance of the Bayesian gold standard using considerably
less computation. We propose to do this by approximating an equally weighted geometric
mean of the predictions of an exponential number of learned models that share parameters.
Model combination nearly always improves the performance of machine learning meth-
ods. With large neural networks, however, the obvious idea of averaging the outputs of
many separately trained nets is prohibitively expensive. Combining several models is most
helpful when the individual models are different from each other and in order to make
neural net models different, they should either have different architectures or be trained
on different data. Training many different architectures is hard because finding optimal
hyperparameters for each architecture is a daunting task and training each large network
requires a lot of computation. Moreover, large networks normally require large amounts of
training data and there may not be enough data available to train different networks on
different subsets of the data. Even if one was able to train many different large networks,
using them all at test time is infeasible in applications where it is important to respond
quickly.
Dropout is a technique that addresses both these issues. It prevents overfitting and
provides a way of approximately combining exponentially many different neural network
architectures efficiently. The term “dropout” refers to dropping out units (hidden and
visible) in a neural network. By dropping a unit out, we mean temporarily removing it from
the network, along with all its incoming and outgoing connections, as shown in Figure 1.
The choice of which units to drop is random. In the simplest case, each unit is retained with
a fixed probability p independent of other units, where p can be chosen using a validation
set or can simply be set at 0.5, which seems to be close to optimal for a wide range of
networks and tasks. For the input units, however, the optimal probability of retention is
usually closer to 1 than to 0.5.
1930
Dropout
w pw
Present with -
Always -
probability p present
(a) At training time (b) At test time
Figure 2: Left: A unit at training time that is present with probability p and is connected to units
in the next layer with weights w. Right: At test time, the unit is always present and
the weights are multiplied by p. The output at test time is same as the expected output
at training time.
Applying dropout to a neural network amounts to sampling a “thinned” network from

it. The thinned network consists of all the units that survived dropout (Figure 1b). A
neural net with n units, can be seen as a collection of 2n possible thinned neural networks.
These networks all share weights so that the total number of parameters is still O(n2 ), or
less. For each presentation of each training case, a new thinned network is sampled and
trained. So training a neural network with dropout can be seen as training a collection of 2n
thinned networks with extensive weight sharing, where each thinned network gets trained
very rarely, if at all.
At test time, it is not feasible to explicitly average the predictions from exponentially
many thinned models. However, a very simple approximate averaging method works well in
practice. The idea is to use a single neural net at test time without dropout. The weights
of this network are scaled-down versions of the trained weights. If a unit is retained with
probability p during training, the outgoing weights of that unit are multiplied by p at test
time as shown in Figure 2. This ensures that for any hidden unit the expected output (under
the distribution used to drop units at training time) is the same as the actual output at
test time. By doing this scaling, 2n networks with shared weights can be combined into
a single neural network to be used at test time. We found that training a network with
dropout and using this approximate averaging method at test time leads to significantly
lower generalization error on a wide variety of classification problems compared to training
with other regularization methods.
The idea of dropout is not limited to feed-forward neural nets. It can be more generally
applied to graphical models such as Boltzmann Machines. In this paper, we introduce
the dropout Restricted Boltzmann Machine model and compare it to standard Restricted
Boltzmann Machines (RBM). Our experiments show that dropout RBMs are better than
standard RBMs in certain respects.
This paper is structured as follows. Section 2 describes the motivation for this idea.
Section 3 describes relevant previous work. Section 4 formally describes the dropout model.
Section 5 gives an algorithm for training dropout networks. In Section 6, we present our
experimental results where we apply dropout to problems in different domains and compare
it with other forms of regularization and model combination. Section 7 analyzes the effect of
dropout on different properties of a neural network and describes how dropout interacts with
the network’s hyperparameters. Section 8 describes the Dropout RBM model. In Section 9
we explore the idea of marginalizing dropout. In Appendix A we present a practical guide
1931
for training dropout nets. This includes a detailed analysis of the practical considerations
involved in choosing hyperparameters when training dropout networks.
2. Motivation
A motivation for dropout comes from a theory of the role of sex in evolution (Livnat et al.,
2010). Sexual reproduction involves taking half the genes of one parent and half of the
other, adding a very small amount of random mutation, and combining them to produce an
offspring. The asexual alternative is to create an offspring with a slightly mutated copy of
the parent’s genes. It seems plausible that asexual reproduction should be a better way to
optimize individual fitness because a good set of genes that have come to work well together
can be passed on directly to the offspring. On the other hand, sexual reproduction is likely
to break up these co-adapted sets of genes, especially if these sets are large and, intuitively,
this should decrease the fitness of organisms that have already evolved complicated co-
adaptations. However, sexual reproduction is the way most advanced organisms evolved.
One possible explanation for the superiority of sexual reproduction is that, over the long
term, the criterion for natural selection may not be individual fitness but rather mix-ability
of genes. The ability of a set of genes to be able to work well with another random set of
genes makes them more robust. Since a gene cannot rely on a large set of partners to be
present at all times, it must learn to do something useful on its own or in collaboration with
a small number of other genes. According to this theory, the role of sexual reproduction
is not just to allow useful new genes to spread throughout the population, but also to
facilitate this process by reducing complex co-adaptations that would reduce the chance of
a new gene improving the fitness of an individual. Similarly, each hidden unit in a neural
network trained with dropout must learn to work with a randomly chosen sample of other
units. This should make each hidden unit more robust and drive it towards creating useful
features on its own without relying on other hidden units to correct its mistakes. However,
the hidden units within a layer will still learn to do different things from each other. One
might imagine that the net would become robust against dropout by making many copies
of each hidden unit, but this is a poor solution for exactly the same reason as replica codes
are a poor way to deal with a noisy channel.
A closely related, but slightly different motivation for dropout comes from thinking
about successful conspiracies. Ten conspiracies each involving five people is probably a
better way to create havoc than one big conspiracy that requires fifty people to all play
their parts correctly. If conditions do not change and there is plenty of time for rehearsal, a
big conspiracy can work well, but with non-stationary conditions, the smaller the conspiracy
the greater its chance of still working. Complex co-adaptations can be trained to work well
on a training set, but on novel test data they are far more likely to fail than multiple simpler
co-adaptations that achieve the same thing.
3. Related Work
Dropout can be interpreted as a way of regularizing a neural network by adding noise to
its hidden units. The idea of adding noise to the states of units has previously been used in
the context of Denoising Autoencoders (DAEs) by Vincent et al. (2008, 2010) where noise
1932
Dropout
is added to the input units of an autoencoder and the network is trained to reconstruct the
noise-free input. Our work extends this idea by showing that dropout can be effectively
applied in the hidden layers as well and that it can be interpreted as a form of model
averaging. We also show that adding noise is not only useful for unsupervised feature
learning but can also be extended to supervised learning problems. In fact, our method can
be applied to other neuron-based architectures, for example, Boltzmann Machines. While
5% noise typically works best for DAEs, we found that our weight scaling procedure applied
at test time enables us to use much higher noise levels. Dropping out 20% of the input units
and 50% of the hidden units was often found to be optimal.
Since dropout can be seen as a stochastic regularization technique, it is natural to
consider its deterministic counterpart which is obtained by marginalizing out the noise. In
this paper, we show that, in simple cases, dropout can be analytically marginalized out
to obtain deterministic regularization methods. Recently, van der Maaten et al. (2013)
also explored deterministic regularizers corresponding to different exponential-family noise
distributions, including dropout (which they refer to as “blankout noise”). However, they
apply noise to the inputs and only explore models with no hidden layers. Wang and Manning
(2013) proposed a method for speeding up dropout by marginalizing dropout noise. Chen
et al. (2012) explored marginalization in the context of denoising autoencoders.
In dropout, we minimize the loss function stochastically under a noise distribution.
This can be seen as minimizing an expected loss function. Previous work of Globerson and
Roweis (2006); Dekel et al. (2010) explored an alternate setting where the loss is minimized
when an adversary gets to pick which units to drop. Here, instead of a noise distribution,
the maximum number of units that can be dropped is fixed. However, this work also does
not explore models with hidden units.
4. Model Description
This section describes the dropout neural network model. Consider a neural network with
L hidden layers. Let l ∈ {1, . . . , L} index the hidden layers of the network. Let z(l) denote
the vector of inputs into layer l, y(l) denote the vector of outputs from layer l (y(0) = x is
the input). W (l) and b(l) are the weights and biases at layer l. The feed-forward operation
of a standard neural network (Figure 3a) can be described as (for l ∈ {0, . . . , L − 1} and
any hidden unit i)
(l+1) (l+1) l (l+1)

zi = wi y + bi ,
(l+1) (l+1)
yi = f (zi ),
where f is any activation function, for example, f (x) = 1/ (1 + exp(−x)).

With dropout, the feed-forward operation becomes (Figure 3b)
(l)
rj ∼ Bernoulli(p),
e (l) = r(l) ∗ y(l) ,
y
(l+1) (l+1) l (l+1)
zi = wi e + bi
y ,
(l+1) (l+1)
yi = f (zi ).
1933
+1
+1
(l)
r3 (l+1)
bi
(l+1) bi
(l) (l) (l)

y3 y3 ye3
(l+1)
f (l+1) (l) (l+1)
f (l+1)
(l+1)
wi zi yi r2 (l+1)
wi zi yi
(l) (l) (l)

y2 y2 ye2
(l)
r1
(l)
y1 (l) (l)
y1 ye1
(a) Standard network (b) Dropout network

Figure 3: Comparison of the basic operations of a standard and dropout network.
Here ∗ denotes an element-wise product. For any layer l, r(l) is a vector of independent
Bernoulli random variables each of which has probability p of being 1. This vector is
sampled and multiplied element-wise with the outputs of that layer, y(l) , to create the
thinned outputs ye (l) . The thinned outputs are then used as input to the next layer. This
process is applied at each layer. This amounts to sampling a sub-network from a larger
network. For learning, the derivatives of the loss function are backpropagated through the
(l)
sub-network. At test time, the weights are scaled as Wtest = pW (l) as shown in Figure 2.
The resulting neural network is used without dropout.
5. Learning Dropout Nets

This section describes a procedure for training dropout neural nets.
5.1 Backpropagation
Dropout neural networks can be trained using stochastic gradient descent in a manner simi-
lar to standard neural nets. The only difference is that for each training case in a mini-batch,
we sample a thinned network by dropping out units. Forward and backpropagation for that
training case are done only on this thinned network. The gradients for each parameter are
averaged over the training cases in each mini-batch. Any training case which does not use a
parameter contributes a gradient of zero for that parameter. Many methods have been used
to improve stochastic gradient descent such as momentum, annealed learning rates and L2
weight decay. Those were found to be useful for dropout neural networks as well.
One particular form of regularization was found to be especially useful for dropout—
constraining the norm of the incoming weight vector at each hidden unit to be upper
bounded by a fixed constant c. In other words, if w represents the vector of weights incident
on any hidden unit, the neural network was optimized under the constraint ||w||2 ≤ c. This
constraint was imposed during optimization by projecting w onto the surface of a ball of
radius c, whenever w went out of it. This is also called max-norm regularization since it
implies that the maximum value that the norm of any weight can take is c. The constant
1934
Dropout
c is a tunable hyperparameter, which is determined using a validation set. Max-norm

regularization has been previously used in the context of collaborative filtering (Srebro and
Shraibman, 2005). It typically improves the performance of stochastic gradient descent
training of deep neural nets, even when no dropout is used.
Although dropout alone gives significant improvements, using dropout along with max-
norm regularization, large decaying learning rates and high momentum provides a significant
boost over just using dropout. A possible justification is that constraining weight vectors
to lie inside a ball of fixed radius makes it possible to use a huge learning rate without the
possibility of weights blowing up. The noise provided by dropout then allows the optimiza-
tion process to explore different regions of the weight space that would have otherwise been
difficult to reach. As the learning rate decays, the optimization takes shorter steps, thereby
doing less exploration and eventually settles into a minimum.
5.2 Unsupervised Pretraining

Neural networks can be pretrained using stacks of RBMs (Hinton and Salakhutdinov, 2006),
autoencoders (Vincent et al., 2010) or Deep Boltzmann Machines (Salakhutdinov and Hin-
ton, 2009). Pretraining is an effective way of making use of unlabeled data. Pretraining
followed by finetuning with backpropagation has been shown to give significant performance
boosts over finetuning from random initializations in certain cases.
Dropout can be applied to finetune nets that have been pretrained using these tech-
niques. The pretraining procedure stays the same. The weights obtained from pretraining
should be scaled up by a factor of 1/p. This makes sure that for each unit, the expected
output from it under random dropout will be the same as the output during pretraining.
We were initially concerned that the stochastic nature of dropout might wipe out the in-
formation in the pretrained weights. This did happen when the learning rates used during
finetuning were comparable to the best learning rates for randomly initialized nets. How-
ever, when the learning rates were chosen to be smaller, the information in the pretrained
weights seemed to be retained and we were able to get improvements in terms of the final
generalization error compared to not using dropout when finetuning.
6. Experimental Results
We trained dropout neural networks for classification problems on data sets in different
domains. We found that dropout improved generalization performance on all data sets
compared to neural networks that did not use dropout. Table 1 gives a brief description of
the data sets. The data sets are
• MNIST : A standard toy data set of handwritten digits.
• TIMIT : A standard speech benchmark for clean speech recognition.
• CIFAR-10 and CIFAR-100 : Tiny natural images (Krizhevsky, 2009).
• Street View House Numbers data set (SVHN) : Images of house numbers collected by
Google Street View (Netzer et al., 2011).
• ImageNet : A large collection of natural images.
• Reuters-RCV1 : A collection of Reuters newswire articles.
1935
• Alternative Splicing data set: RNA features for predicting alternative gene splicing
(Xiong et al., 2011).
We chose a diverse set of data sets to demonstrate that dropout is a general technique
for improving neural nets and is not specific to any particular application domain. In this
section, we present some key results that show the effectiveness of dropout. A more detailed
description of all the experiments and data sets is provided in Appendix B.
Data Set Domain Dimensionality Training Set Test Set

MNIST Vision 784 (28 × 28 grayscale) 60K 10K
SVHN Vision 3072 (32 × 32 color) 600K 26K
CIFAR-10/100 Vision 3072 (32 × 32 color) 60K 10K
ImageNet (ILSVRC-2012) Vision 65536 (256 × 256 color) 1.2M 150K
TIMIT Speech 2520 (120-dim, 21 frames) 1.1M frames 58K frames
Reuters-RCV1 Text 2000 200K 200K
Alternative Splicing Genetics 1014 2932 733
Table 1: Overview of the data sets used in this paper.
6.1 Results on Image Data Sets

We used five image data sets to evaluate dropout—MNIST, SVHN, CIFAR-10, CIFAR-100
and ImageNet. These data sets include different image types and training set sizes. Models
which achieve state-of-the-art results on all of these data sets use dropout.
6.1.1 MNIST
Unit Error
Method Architecture
Type %
Standard Neural Net (Simard et al., 2003) Logistic 2 layers, 800 units 1.60
SVM Gaussian kernel NA NA 1.40
Dropout NN Logistic 3 layers, 1024 units 1.35
Dropout NN ReLU 3 layers, 1024 units 1.25
Dropout NN + max-norm constraint ReLU 3 layers, 1024 units 1.06
Dropout NN + max-norm constraint (Goodfellow 2 layers, (5 × 240)
Maxout 0.94
et al., 2013) units
DBN + finetuning (Hinton and Salakhutdinov, 2006) Logistic 500-500-2000 1.18
DBM + finetuning (Salakhutdinov and Hinton, 2009) Logistic 500-500-2000 0.96
DBN + dropout finetuning Logistic 500-500-2000 0.92
DBM + dropout finetuning Logistic 500-500-2000 0.79
Table 2: Comparison of different models on MNIST.

The MNIST data set consists of 28 × 28 pixel handwritten digit images. The task is
to classify the images into 10 digit classes. Table 2 compares the performance of dropout
with other techniques. The best performing neural networks for the permutation invariant
1936
Dropout
setting that do not use dropout or unsupervised pretraining achieve an error of about
1.60% (Simard et al., 2003). With dropout the error reduces to 1.35%. Replacing logistic
units with rectified linear units (ReLUs) (Jarrett et al., 2009) further reduces the error to
1.25%. Adding max-norm regularization again reduces it to 1.06%. Increasing the size of
the network leads to better results. A neural net with 2 layers and 8192 units per layer
gets down to 0.95% error. Note that this network has more than 65 million parameters and
is being trained on a data set of size 60,000. Training a network of this size to give good
generalization error is very hard with standard regularization methods and early stopping.
Dropout, on the other hand, prevents overfitting, even in this case. It does not even need
early stopping. Goodfellow et al. (2013) showed that results can be further improved to
0.94% by replacing ReLU units with maxout units. All dropout nets use p = 0.5 for hidden
units and p = 0.8 for input units. More experimental details can be found in Appendix B.1.
Dropout nets pretrained with stacks of RBMs and Deep Boltzmann Machines also give
improvements as shown in Table 2. DBM—pretrained dropout nets achieve a test error of
0.79% which is the best performance ever reported for the permutation invariant setting.
We note that it possible to obtain better results by using 2-D spatial information and
augmenting the training set with distorted versions of images from the standard training
set. We demonstrate the effectiveness of dropout in that setting on more interesting data
sets.
In order to test the robustness of
dropout, classification experiments were 2.5
done with networks of many different ar-
chitectures keeping all hyperparameters, in-
cluding p, fixed. Figure 4 shows the test 2.0 Without dropout
Classification Error %
error rates obtained for these different ar- @

chitectures as training progresses. The R
@
same architectures trained with and with- 1.5
With dropout
out dropout have drastically different test
errors as seen as by the two separate clus- R
@
ters of trajectories. Dropout gives a huge 1.0
improvement across all architectures, with-

out using hyperparameters that were tuned 0 200000 400000 600000 800000 1000000
Number of weight updates
specifically for each architecture.
Figure 4: Test error for different architectures
6.1.2 Street View House Numbers with and without dropout. The net-
works have 2 to 4 hidden layers each
The Street View House Numbers (SVHN) with 1024 to 2048 units.
Data Set (Netzer et al., 2011) consists of
color images of house numbers collected by
Google Street View. Figure 5a shows some examples of images from this data set. The
part of the data set that we use in our experiments consists of 32 × 32 color images roughly
centered on a digit in a house number. The task is to identify that digit.
For this data set, we applied dropout to convolutional neural networks (LeCun et al.,
1989). The best architecture that we found has three convolutional layers followed by 2
fully connected hidden layers. All hidden units were ReLUs. Each convolutional layer was
1937
Method Error %
Binary Features (WDCH) (Netzer et al., 2011) 36.7
HOG (Netzer et al., 2011) 15.0
Stacked Sparse Autoencoders (Netzer et al., 2011) 10.3
KMeans (Netzer et al., 2011) 9.4
Multi-stage Conv Net with average pooling (Sermanet et al., 2012) 9.06
Multi-stage Conv Net + L2 pooling (Sermanet et al., 2012) 5.36
Multi-stage Conv Net + L4 pooling + padding (Sermanet et al., 2012) 4.90
Conv Net + max-pooling 3.95
Conv Net + max pooling + dropout in fully connected layers 3.02
Conv Net + stochastic pooling (Zeiler and Fergus, 2013) 2.80
Conv Net + max pooling + dropout in all layers 2.55
Conv Net + maxout (Goodfellow et al., 2013) 2.47
Human Performance 2.0
Table 3: Results on the Street View House Numbers data set.
followed by a max-pooling layer. Appendix B.2 describes the architecture in more detail.
Dropout was applied to all the layers of the network with the probability of retaining a hid-
den unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the different layers of the network (going
from input to convolutional layers to fully connected layers). Max-norm regularization was
used for weights in both convolutional and fully connected layers. Table 3 compares the
results obtained by different methods. We find that convolutional nets outperform other
methods. The best performing convolutional nets that do not use dropout achieve an error
rate of 3.95%. Adding dropout only to the fully connected layers reduces the error to 3.02%.
Adding dropout to the convolutional layers as well further reduces the error to 2.55%. Even
more gains can be obtained by using maxout units.
The additional gain in performance obtained by adding dropout in the convolutional
layers (3.02% to 2.55%) is worth noting. One may have presumed that since the convo-
lutional layers don’t have a lot of parameters, overfitting is not a problem and therefore
dropout would not have much effect. However, dropout in the lower layers still helps be-
cause it provides noisy inputs for the higher fully connected layers which prevents them
from overfitting.
6.1.3 CIFAR-10 and CIFAR-100
The CIFAR-10 and CIFAR-100 data sets consist of 32 × 32 color images drawn from 10
and 100 categories respectively. Figure 5b shows some examples of images from this data
set. A detailed description of the data sets, input preprocessing, network architectures and
other experimental details is given in Appendix B.3. Table 4 shows the error rate obtained
by different methods on these data sets. Without any data augmentation, Snoek et al.
(2012) used Bayesian hyperparameter optimization to obtained an error rate of 14.98% on
CIFAR-10. Using dropout in the fully connected layers reduces that to 14.32% and adding
dropout in every layer further reduces the error to 12.61%. Goodfellow et al. (2013) showed
that the error is further reduced to 11.68% by replacing ReLU units with maxout units. On
CIFAR-100, dropout reduces the error from 43.48% to 37.20% which is a huge improvement.
No data augmentation was used for either data set (apart from the input dropout).
1938
Dropout
(a) Street View House Numbers (SVHN) (b) CIFAR-10
Figure 5: Samples from image data sets. Each row corresponds to a different category.
Method CIFAR-10 CIFAR-100

Conv Net + max pooling (hand tuned) 15.60 43.48
Conv Net + stochastic pooling (Zeiler and Fergus, 2013) 15.13 42.51
Conv Net + max pooling (Snoek et al., 2012) 14.98 -
Conv Net + max pooling + dropout fully connected layers 14.32 41.26
Conv Net + max pooling + dropout in all layers 12.61 37.20
Conv Net + maxout (Goodfellow et al., 2013) 11.68 38.57
Table 4: Error rates on CIFAR-10 and CIFAR-100.
6.1.4 ImageNet
ImageNet is a data set of over 15 million labeled high-resolution images belonging to roughly
22,000 categories. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual
competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has
been held. A subset of ImageNet with roughly 1000 images in each of 1000 categories is
used in this challenge. Since the number of categories is rather large, it is conventional to
report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test
images for which the correct label is not among the five labels considered most probable by
the model. Figure 6 shows some predictions made by our model on a few test images.
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so
most of our experiments were performed on this data set. Table 5 compares the performance
of different methods. Convolutional nets with dropout outperform other methods by a large
margin. The architecture and implementation details are described in detail in Krizhevsky
et al. (2012).
1939
Figure 6: Some ImageNet test cases with the 4 most probable labels as predicted by our model.
The length of the horizontal bars is proportional to the probability assigned to the labels
by the model. Pink indicates ground truth.
Model Top-1 Top-5

Sparse Coding (Lin et al., 2010) 47.1 28.2
SIFT + Fisher Vectors (Sanchez and Perronnin, 2011) 45.7 25.7
Conv Net + dropout (Krizhevsky et al., 2012) 37.5 17.0
Table 5: Results on the ILSVRC-2010 test set.

Top-1 Top-5 Top-5
Model
(val) (val) (test)
SVM on Fisher Vectors of Dense SIFT and Color Statistics - - 27.3
Avg of classifiers over FVs of SIFT, LBP, GIST and CSIFT - - 26.2
Conv Net + dropout (Krizhevsky et al., 2012) 40.7 18.2 -
Avg of 5 Conv Nets + dropout (Krizhevsky et al., 2012) 38.1 16.4 16.4
Table 6: Results on the ILSVRC-2012 validation/test set.

Our model based on convolutional nets and dropout won the ILSVRC-2012 competition.
Since the labels for the test set are not available, we report our results on the test set for
the final submission and include the validation set results for different variations of our
model. Table 6 shows the results from the competition. While the best methods based on
standard vision features achieve a top-5 error rate of about 26%, convolutional nets with
dropout achieve a test error of about 16% which is a staggering difference. Figure 6 shows
some examples of predictions made by our model. We can see that the model makes very
reasonable predictions, even when its best guess is not correct.
6.2 Results on TIMIT

Next, we applied dropout to a speech recognition task. We use the TIMIT data set which
consists of recordings from 680 speakers covering 8 major dialects of American English
reading ten phonetically-rich sentences in a controlled noise-free environment. Dropout
neural networks were trained on windows of 21 log-filter bank frames to predict the label
of the central frame. No speaker dependent operations were performed. Appendix B.4
describes the data preprocessing and training details. Table 7 compares dropout neural
1940
Dropout
nets with other models. A 6-layer net gives a phone error rate of 23.4%. Dropout further
improves it to 21.8%. We also trained dropout nets starting from pretrained weights. A
4-layer net pretrained with a stack of RBMs get a phone error rate of 22.7%. With dropout,
this reduces to 19.7%. Similarly, for an 8-layer net the error reduces from 20.5% to 19.7%.
Method Phone Error Rate%

NN (6 layers) (Mohamed et al., 2010) 23.4
Dropout NN (6 layers) 21.8
DBN-pretrained NN (4 layers) 22.7
DBN-pretrained NN (6 layers) (Mohamed et al., 2010) 22.4
DBN-pretrained NN (8 layers) (Mohamed et al., 2010) 20.7
mcRBM-DBN-pretrained NN (5 layers) (Dahl et al., 2010) 20.5
DBN-pretrained NN (4 layers) + dropout 19.7
DBN-pretrained NN (8 layers) + dropout 19.7
Table 7: Phone error rate on the TIMIT core test set.
6.3 Results on a Text Data Set

To test the usefulness of dropout in the text domain, we used dropout networks to train a
document classifier. We used a subset of the Reuters-RCV1 data set which is a collection of
over 800,000 newswire articles from Reuters. These articles cover a variety of topics. The
task is to take a bag of words representation of a document and classify it into 50 disjoint
topics. Appendix B.5 describes the setup in more detail. Our best neural net which did
not use dropout obtained an error rate of 31.05%. Adding dropout reduced the error to
29.62%. We found that the improvement was much smaller compared to that for the vision
and speech data sets.
6.4 Comparison with Bayesian Neural Networks

Dropout can be seen as a way of doing an equally-weighted averaging of exponentially many
models with shared weights. On the other hand, Bayesian neural networks (Neal, 1996) are
the proper way of doing model averaging over the space of neural network structures and
parameters. In dropout, each model is weighted equally, whereas in a Bayesian neural
network each model is weighted taking into account the prior and how well the model fits
the data, which is the more correct approach. Bayesian neural nets are extremely useful for
solving problems in domains where data is scarce such as medical diagnosis, genetics, drug
discovery and other computational biology applications. However, Bayesian neural nets are
slow to train and difficult to scale to very large network sizes. Besides, it is expensive to
get predictions from many large nets at test time. On the other hand, dropout neural nets
are much faster to train and use at test time. In this section, we report experiments that
compare Bayesian neural nets with dropout neural nets on a small data set where Bayesian
neural networks are known to perform well and obtain state-of-the-art results. The aim is
to analyze how much does dropout lose compared to Bayesian neural nets.
The data set that we use (Xiong et al., 2011) comes from the domain of genetics. The
task is to predict the occurrence of alternative splicing based on RNA features. Alternative
splicing is a significant cause of cellular diversity in mammalian tissues. Predicting the
1941
Method Code Quality (bits)

Neural Network (early stopping) (Xiong et al., 2011) 440
Regression, PCA (Xiong et al., 2011) 463
SVM, PCA (Xiong et al., 2011) 487
Neural Network with dropout 567
Bayesian Neural Network (Xiong et al., 2011) 623
Table 8: Results on the Alternative Splicing Data Set.

occurrence of alternate splicing in certain tissues under different conditions is important for
understanding many human diseases. Given the RNA features, the task is to predict the
probability of three splicing related events that biologists care about. The evaluation metric
is Code Quality which is a measure of the negative KL divergence between the target and
the predicted probability distributions (higher is better). Appendix B.6 includes a detailed
description of the data set and this performance metric.
Table 8 summarizes the performance of different models on this data set. Xiong et al.
(2011) used Bayesian neural nets for this task. As expected, we found that Bayesian neural
nets perform better than dropout. However, we see that dropout improves significantly
upon the performance of standard neural nets and outperforms all other methods. The
challenge in this data set is to prevent overfitting since the size of the training set is small.
One way to prevent overfitting is to reduce the input dimensionality using PCA. Thereafter,
standard techniques such as SVMs or logistic regression can be used. However, with dropout
we were able to prevent overfitting without the need to do dimensionality reduction. The
dropout nets are very large (1000s of hidden units) compared to a few tens of units in the
Bayesian network. This shows that dropout has a strong regularizing effect.
6.5 Comparison with Standard Regularizers

Several regularization methods have been proposed for preventing overfitting in neural net-
works. These include L2 weight decay (more generally Tikhonov regularization (Tikhonov,
1943)), lasso (Tibshirani, 1996), KL-sparsity and max-norm regularization. Dropout can
be seen as another way of regularizing neural networks. In this section we compare dropout
with some of these regularization methods using the MNIST data set.
The same network architecture (784-1024-1024-2048-10) with ReLUs was trained us-
ing stochastic gradient descent with different regularizations. Table 9 shows the results.
The values of different hyperparameters associated with each kind of regularization (decay
constants, target sparsity, dropout rate, max-norm upper bound) were obtained using a
validation set. We found that dropout combined with max-norm regularization gives the
lowest generalization error.
7. Salient Features
The experiments described in the previous section provide strong evidence that dropout
is a useful technique for improving neural networks. In this section, we closely examine
how dropout affects a neural network. We analyze the effect of dropout on the quality of
features produced. We see how dropout affects the sparsity of hidden unit activations. We
1942
Dropout
Method Test Classification error %

L2 1.62
L2 + L1 applied towards the end of training 1.60
L2 + KL-sparsity 1.55
Max-norm 1.35
Dropout + L2 1.25
Dropout + Max-norm 1.05
Table 9: Comparison of different regularization methods on MNIST.
also see how the advantages obtained from dropout vary with the probability of retaining
units, size of the network and the size of the training set. These observations give some
insight into why dropout works so well.
7.1 Effect on Features
(a) Without dropout (b) Dropout with p = 0.5.
Figure 7: Features learned on MNIST with one hidden layer autoencoders having 256 rectified
linear units.
In a standard neural network, the derivative received by each parameter tells it how it
should change so the final loss function is reduced, given what all other units are doing.
Therefore, units may change in a way that they fix up the mistakes of the other units.
This may lead to complex co-adaptations. This in turn leads to overfitting because these
co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit,
dropout prevents co-adaptation by making the presence of other hidden units unreliable.
Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must
perform well in a wide variety of different contexts provided by the other hidden units. To
observe this effect directly, we look at the first level features learned by neural networks
trained on visual tasks with and without dropout.
1943
Figure 7a shows features learned by an autoencoder on MNIST with a single hidden

layer of 256 rectified linear units without dropout. Figure 7b shows the features learned by
an identical autoencoder which used dropout in the hidden layer with p = 0.5. Both au-
toencoders had similar test reconstruction errors. However, it is apparent that the features
shown in Figure 7a have co-adapted in order to produce good reconstructions. Each hidden
unit on its own does not seem to be detecting a meaningful feature. On the other hand, in
Figure 7b, the hidden units seem to detect edges, strokes and spots in different parts of the
image. This shows that dropout does break up co-adaptations, which is probably the main
reason why it leads to lower generalization errors.
7.2 Effect on Sparsity
Figure 8: Effect of dropout on sparsity. ReLUs were used for both models. Left: The histogram
of mean activations shows that most units have a mean activation of about 2.0. The
histogram of activations shows a huge mode away from zero. Clearly, a large fraction of
units have high activation. Right: The histogram of mean activations shows that most
units have a smaller mean mean activation of about 0.7. The histogram of activations
shows a sharp peak at zero. Very few units have high activation.
We found that as a side-effect of doing dropout, the activations of the hidden units
become sparse, even when no sparsity inducing regularizers are present. Thus, dropout au-
tomatically leads to sparse representations. To observe this effect, we take the autoencoders
trained in the previous section and look at the sparsity of hidden unit activations on a ran-
dom mini-batch taken from the test set. Figure 8a and Figure 8b compare the sparsity for
the two models. In a good sparse model, there should only be a few highly activated units
for any data case. Moreover, the average activation of any unit across data cases should
be low. To assess both of these qualities, we plot two histograms for each model. For each
model, the histogram on the left shows the distribution of mean activations of hidden units
across the minibatch. The histogram on the right shows the distribution of activations of
the hidden units.
Comparing the histograms of activations we can see that fewer hidden units have high
activations in Figure 8b compared to Figure 8a, as seen by the significant mass away from
1944
Dropout
zero for the net that does not use dropout. The mean activations are also smaller for the
dropout net. The overall mean activation of hidden units is close to 2.0 for the autoencoder
without dropout but drops to around 0.7 when dropout is used.
7.3 Effect of Dropout Rate

Dropout has a tunable hyperparameter p (the probability of retaining a unit in the network).
In this section, we explore the effect of varying this hyperparameter. The comparison is
done in two situations.
1. The number of hidden units is held constant.

2. The number of hidden units is changed so that the expected number of hidden units
that will be retained after dropout is held constant.
In the first case, we train the same network architecture with different amounts of
dropout. We use a 784-2048-2048-2048-10 architecture. No input dropout was used. Fig-
ure 9a shows the test error obtained as a function of p. If the architecture is held constant,
having a small p means very few units will turn on during training. It can be seen that this
has led to underfitting since the training error is also high. We see that as p increases, the
error goes down. It becomes flat when 0.4 ≤ p ≤ 0.8 and then increases as p becomes close
to 1.
3.5 3.0
Test Error Test Error
3.0 Training Error 2.5 Training Error
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5 0.5
0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0
Probability of retaining a unit (p) Probability of retaining a unit (p)
(a) Keeping n fixed. (b) Keeping pn fixed.
Figure 9: Effect of changing dropout rates on MNIST.

Another interesting setting is the second case in which the quantity pn is held constant
where n is the number of hidden units in any particular layer. This means that networks
that have small p will have a large number of hidden units. Therefore, after applying
dropout, the expected number of units that are present will be the same across different
architectures. However, the test networks will be of different sizes. In our experiments,
we set pn = 256 for the first two hidden layers and pn = 512 for the last hidden layer.
Figure 9b shows the test error obtained as a function of p. We notice that the magnitude
of errors for small values of p has reduced by a lot compared to Figure 9a (for p = 0.1 it fell
from 2.7% to 1.7%). Values of p that are close to 0.6 seem to perform best for this choice
of pn but our usual default value of 0.5 is close to optimal.
1945
7.4 Effect of Data Set Size

One test of a good regularizer is that it should make it possible to get good generalization
error from models with a large number of parameters trained on small data sets. This
section explores the effect of changing the data set size when dropout is used with feed-
forward networks. Huge neural networks trained in the standard way overfit massively on
small data sets. To see if dropout can help, we run classification experiments on MNIST
and vary the amount of data given to the network.
The results of these experiments are 30
shown in Figure 10. The network was given With dropout
Without dropout
data sets of size 100, 500, 1K, 5K, 10K 25
and 50K chosen randomly from the MNIST
training set. The same network architec- 20
ture (784-1024-1024-2048-10) was used for
15
all data sets. Dropout with p = 0.5 was per-
formed at all the hidden layers and p = 0.8 10
at the input layer. It can be observed that
for extremely small data sets (100, 500) 5
dropout does not give any improvements.
The model has enough parameters that it 0
102 103 104 105
Dataset size
can overfit on the training data, even with
all the noise coming from dropout. As the
size of the data set is increased, the gain Figure 10: Effect of varying data set size.
from doing dropout increases up to a point and then declines. This suggests that for any
given architecture and dropout rate, there is a “sweet spot” corresponding to some amount
of data that is large enough to not be memorized in spite of the noise but not so large that
overfitting is not a problem anyways.
7.5 Monte-Carlo Model Averaging vs. Weight Scaling

The efficient test time procedure that we 1.35
propose is to do an approximate model com- Monte-Carlo Model Averaging
Approximate averaging by weight scaling
bination by scaling down the weights of the 1.30
trained neural network. An expensive but 1.25

Test Classification error %
more correct way of averaging the models

is to sample k neural nets using dropout for 1.20
each test case and average their predictions.

1.15
As k → ∞, this Monte-Carlo model average
gets close to the true model average. It is in- 1.10
teresting to see empirically how many sam-
1.05
ples k are needed to match the performance
of the approximate averaging method. By 1.000 20 40 60 80 100 120
Number of samples used for Monte-Carlo averaging (k)
computing the error for different values of k
we can see how quickly the error rate of the
finite-sample average approaches the error Figure 11: Monte-Carlo model averaging vs.
weight scaling.
rate of the true model average.
1946
Dropout
We again use the MNIST data set and do classification by averaging the predictions
of k randomly sampled neural networks. Figure 11 shows the test error rate obtained for
different values of k. This is compared with the error obtained using the weight scaling
method (shown as a horizontal line). It can be seen that around k = 50, the Monte-Carlo
method becomes as good as the approximate method. Thereafter, the Monte-Carlo method
is slightly better than the approximate method but well within one standard deviation of
it. This suggests that the weight scaling method is a fairly good approximation of the true
model average.
8. Dropout Restricted Boltzmann Machines

Besides feed-forward neural networks, dropout can also be applied to Restricted Boltzmann
Machines (RBM). In this section, we formally describe this model and show some results
to illustrate its key properties.
8.1 Model Description

Consider an RBM with visible units v ∈ {0, 1}D and hidden units h ∈ {0, 1}F . It defines
the following probability distribution
1
P (h, v; θ) = exp(v> W h + a> h + b> v).
Z(θ)
Where θ = {W, a, b} represents the model parameters and Z is the partition function.
Dropout RBMs are RBMs augmented with a vector of binary random variables r ∈
{0, 1}F . Each random variable rj takes the value 1 with probability p, independent of
others. If rj takes the value 1, the hidden unit hj is retained, otherwise it is dropped from
the model. The joint distribution defined by a Dropout RBM can be expressed as
P (r, h, v; p, θ) = P (r; p)P (h, v|r; θ),

F
Y
P (r; p) = prj (1 − p)1−rj ,
j=1
Y F
1 > > >
P (h, v|r; θ) = exp(v W h + a h + b v) g(hj , rj ),
Z 0 (θ, r)
j=1
g(hj , rj ) = 1(rj = 1) + 1(rj = 0)1(hj = 0).
Z 0 (θ, r) is the normalization constant. g(hj , rj ) imposes the constraint that if rj = 0,

hj must be 0. The distribution over h, conditioned on v and r is factorial
F
Y
P (h|r, v) = P (hj |rj , v),
j=1
!
X
P (hj = 1|rj , v) = 1(rj = 1)σ bj + Wij vi .
i
1947
Figure 12: Features learned on MNIST by 256 hidden unit RBMs. The features are ordered by L2
norm.
The distribution over v conditioned on h is same as that of an RBM

D
Y
P (v|h) = P (vi |h),
i=1
 
X
P (vi = 1|h) = σ ai + Wij hj  .
j
Conditioned on r, the distribution over {v, h} is same as the distribution that an RBM
would impose, except that the units for which rj = 0 are dropped from h. Therefore, the
Dropout RBM model can be seen as a mixture of exponentially many RBMs with shared
weights each using a different subset of h.
8.2 Learning Dropout RBMs

Learning algorithms developed for RBMs such as Contrastive Divergence (Hinton et al.,
2006) can be directly applied for learning Dropout RBMs. The only difference is that r is
first sampled and only the hidden units that are retained are used for training. Similar to
dropout neural networks, a different r is sampled for each training case in every minibatch.
In our experiments, we use CD-1 for training dropout RBMs.
8.3 Effect on Features

Dropout in feed-forward networks improved the quality of features by reducing co-adaptations.
This section explores whether this effect transfers to Dropout RBMs as well.
Figure 12a shows features learned by a binary RBM with 256 hidden units. Figure 12b
shows features learned by a dropout RBM with the same number of hidden units. Features
1948
Dropout
Figure 13: Effect of dropout on sparsity. Left: The activation histogram shows that a large num-
ber of units have activations away from zero. Right: A large number of units have
activations close to zero and very few units have high activation.
learned by the dropout RBM appear qualitatively different in the sense that they seem to
capture features that are coarser compared to the sharply defined stroke-like features in the
standard RBM. There seem to be very few dead units in the dropout RBM relative to the
standard RBM.
8.4 Effect on Sparsity
Next, we investigate the effect of dropout RBM training on sparsity of the hidden unit
activations. Figure 13a shows the histograms of hidden unit activations and their means on
a test mini-batch after training an RBM. Figure 13b shows the same for dropout RBMs.
The histograms clearly indicate that the dropout RBMs learn much sparser representations
than standard RBMs even when no additional sparsity inducing regularizer is present.
9. Marginalizing Dropout
Dropout can be seen as a way of adding noise to the states of hidden units in a neural
network. In this section, we explore the class of models that arise as a result of marginalizing
this noise. These models can be seen as deterministic versions of dropout. In contrast to
standard (“Monte-Carlo”) dropout, these models do not need random bits and it is possible
to get gradients for the marginalized loss functions. In this section, we briefly explore these
models.
Deterministic algorithms have been proposed that try to learn models that are robust to
feature deletion at test time (Globerson and Roweis, 2006). Marginalization in the context
of denoising autoencoders has been explored previously (Chen et al., 2012). The marginal-
ization of dropout noise in the context of linear regression was discussed in Srivastava (2013).
Wang and Manning (2013) further explored the idea of marginalizing dropout to speed-up
training. van der Maaten et al. (2013) investigated different input noise distributions and
1949
the regularizers obtained by marginalizing this noise. Wager et al. (2013) describes how
dropout can be seen as an adaptive regularizer.
9.1 Linear Regression

First we explore a very simple case of applying dropout to the classical problem of linear
regression. Let X ∈ RN ×D be a data matrix of N data points. y ∈ RN be a vector of
targets. Linear regression tries to find a w ∈ RD that minimizes
||y − Xw||2 .
When the input X is dropped out such that any input dimension is retained with
probability p, the input can be expressed as R ∗ X where R ∈ {0, 1}N ×D is a random matrix
with Rij ∼ Bernoulli(p) and ∗ denotes an element-wise product. Marginalizing the noise,
the objective function becomes

minimize ER∼Bernoulli(p) ||y − (R ∗ X)w||2 .
w
This reduces to
minimize ||y − pXw||2 + p(1 − p)||Γw||2 ,

w
where Γ = (diag(X > X))1/2 . Therefore, dropout with linear regression is equivalent, in
expectation, to ridge regression with a particular form for Γ. This form of Γ essentially
scales the weight cost for weight wi by the standard deviation of the ith dimension of the
data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight
more.
Another interesting way to look at this objective is to absorb the factor of p into w.
This leads to the following form
1−p
minimize e 2+
||y − X w|| e 2,
||Γw||
w p
where we = pw. This makes the dependence of the regularization constant on p explicit.
For p close to 1, all the inputs are retained and the regularization constant is small. As
more dropout is done (by decreasing p), the regularization constant grows larger.
9.2 Logistic Regression and Deep Networks

For logistic regression and deep neural nets, it is hard to obtain a closed form marginalized
model. However, Wang and Manning (2013) showed that in the context of dropout applied
to logistic regression, the corresponding marginalized model can be trained approximately.
Under reasonable assumptions, the distributions over the inputs to the logistic unit and over
the gradients of the marginalized model are Gaussian. Their means and variances can be
computed efficiently. This approximate marginalization outperforms Monte-Carlo dropout
in terms of training time and generalization performance.
However, the assumptions involved in this technique become successively weaker as more
layers are added. Therefore, the results are not directly applicable to deep networks.
1950
Dropout
Data Set Architecture Bernoulli dropout Gaussian dropout

MNIST 2 layers, 1024 units each 1.08 ± 0.04 0.95 ± 0.04
CIFAR-10 3 conv + 2 fully connected layers 12.6 ± 0.1 12.5 ± 0.1
Table 10: Comparison of classification error % with Bernoulli and Gaussian dropout. For MNIST,
the Bernoulli model uses p = 0.5 for the hidden units and p = 0.8 for the input units.
For CIFAR-10, we use p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) going from the
q input layer to the
1−p
top. The value of σ for the Gaussian dropout models was set to be p . Results were
averaged over 10 different random seeds.
10. Multiplicative Gaussian Noise

Dropout involves multiplying hidden activations by Bernoulli distributed random variables
which take the value 1 with probability p and 0 otherwise. This idea can be generalized
by multiplying the activations with random variables drawn from other distributions. We
recently discovered that multiplying by a random variable drawn from N (1, 1) works just
as well, or perhaps better than using Bernoulli noise. This new form of dropout amounts
to adding a Gaussian distributed random variable with zero mean and standard deviation
equal to the activation of the unit. That is, each hidden activation hi is perturbed to
hi + hi r where r ∼ N (0, 1), or equivalently hi r0 where r0 ∼ N (1, 1). We can generalize
this to r0 ∼ N (1, σ 2 ) where σ becomes an additional hyperparameter to tune, just like p
was in the standard (Bernoulli) dropout. The expected value of the activations remains
unchanged, therefore no weight scaling is required at test time.
In this paper, we described dropout as a method where we retain units with probability p
at training time and scale down the weights by multiplying them by a factor of p at test time.
Another way to achieve the same effect is to scale up the retained activations by multiplying
by 1/p at training time and not modifying the weights at test time. These methods are
equivalent with appropriate scaling of the learning rate and weight initializations at each
layer.
Therefore, dropout can be seen as multiplying hi by a Bernoulli random variable rb that
takes the value 1/p with probability p and 0 otherwise. E[rb ] = 1 and V ar[rb ] = (1 − p)/p.
For the Gaussian multiplicative noise, if we set σ 2 = (1 − p)/p, we end up multiplying
hi by a random variable rg , where E[rg ] = 1 and V ar[rg ] = (1 − p)/p. Therefore, both
forms of dropout can be set up so that the random variable being multiplied by has the
same mean and variance. However, given these first and second order moments, rg has the
highest entropy and rb has the lowest. Both these extremes work well, although preliminary
experimental results shown in Table 10 suggest that the high entropy case mightq work
slightly better. For each layer, the value of σ in the Gaussian model was set to be 1−p p
using the p from the corresponding layer in the Bernoulli model.
11. Conclusion
Dropout is a technique for improving neural networks by reducing overfitting. Standard
backpropagation learning builds up brittle co-adaptations that work for the training data
but do not generalize to unseen data. Random dropout breaks up these co-adaptations by
1951
making the presence of any particular hidden unit unreliable. This technique was found
to improve the performance of neural nets in a wide variety of application domains includ-
ing object classification, digit recognition, speech recognition, document classification and
analysis of computational biology data. This suggests that dropout is a general technique
and is not specific to any domain. Methods that use dropout achieve state-of-the-art re-
sults on SVHN, ImageNet, CIFAR-100 and MNIST. Dropout considerably improved the
performance of standard neural nets on other data sets as well.
This idea can be extended to Restricted Boltzmann Machines and other graphical mod-
els. The central idea of dropout is to take a large model that overfits easily and repeatedly
sample and train smaller sub-models from it. RBMs easily fit into this framework. We de-
veloped Dropout RBMs and empirically showed that they have certain desirable properties.
One of the drawbacks of dropout is that it increases training time. A dropout network
typically takes 2-3 times longer to train than a standard neural network of the same ar-
chitecture. A major cause of this increase is that the parameter updates are very noisy.
Each training case effectively tries to train a different random architecture. Therefore, the
gradients that are being computed are not gradients of the final architecture that will be
used at test time. Therefore, it is not surprising that training takes a long time. However,
it is likely that this stochasticity prevents overfitting. This creates a trade-off between over-
fitting and training time. With more training time, one can use high dropout and suffer less
overfitting. However, one way to obtain some of the benefits of dropout without stochas-
ticity is to marginalize the noise to obtain a regularizer that does the same thing as the
dropout procedure, in expectation. We showed that for linear regression this regularizer is
a modified form of L2 regularization. For more complicated models, it is not obvious how to
obtain an equivalent regularizer. Speeding up dropout is an interesting direction for future
work.
Acknowledgments
This research was supported by OGS, NSERC and an Early Researcher Award.
Appendix A. A Practical Guide for Training Dropout Networks

Neural networks are infamous for requiring extensive hyperparameter tuning. Dropout
networks are no exception. In this section, we describe heuristics that might be useful for
applying dropout.
A.1 Network Size

It is to be expected that dropping units will reduce the capacity of a neural network. If
n is the number of hidden units in any layer and p is the probability of retaining a unit,
then instead of n hidden units, only pn units will be present after dropout, in expectation.
Moreover, this set of pn units will be different each time and the units are not allowed to
build co-adaptations freely. Therefore, if an n-sized layer is optimal for a standard neural
net on any given task, a good dropout net should have at least n/p units. We found this to
be a useful heuristic for setting the number of hidden units in both convolutional and fully
connected networks.
1952
Dropout
A.2 Learning Rate and Momentum

Dropout introduces a significant amount of noise in the gradients compared to standard
stochastic gradient descent. Therefore, a lot of gradients tend to cancel each other. In
order to make up for this, a dropout net should typically use 10-100 times the learning rate
that was optimal for a standard neural net. Another way to reduce the effect the noise is
to use a high momentum. While momentum values of 0.9 are common for standard nets,
with dropout we found that values around 0.95 to 0.99 work quite a lot better. Using high
learning rate and/or momentum significantly speed up learning.
A.3 Max-norm Regularization

Though large momentum and learning rate speed up learning, they sometimes cause the
network weights to grow very large. To prevent this, we can use max-norm regularization.
This constrains the norm of the vector of incoming weights at each hidden unit to be bound
by a constant c. Typical values of c range from 3 to 4.
A.4 Dropout Rate

Dropout introduces an extra hyperparameter—the probability of retaining a unit p. This
hyperparameter controls the intensity of dropout. p = 1, implies no dropout and low values
of p mean more dropout. Typical values of p for hidden units are in the range 0.5 to 0.8.
For input layers, the choice depends on the kind of input. For real-valued inputs (image
patches or speech frames), a typical value is 0.8. For hidden layers, the choice of p is coupled
with the choice of number of hidden units n. Smaller p requires big n which slows down
the training and leads to underfitting. Large p may not produce enough dropout to prevent
overfitting.
Appendix B. Detailed Description of Experiments and Data Sets

.
This section describes the network architectures and training details for the experimental
results reported in this paper. The code for reproducing these results can be obtained from
http://www.cs.toronto.edu/~nitish/dropout. The implementation is GPU-based. We
used the excellent CUDA libraries—cudamat (Mnih, 2009) and cuda-convnet (Krizhevsky
et al., 2012) to implement our networks.
B.1 MNIST
The MNIST data set consists of 60,000 training and 10,000 test examples each representing
a 28×28 digit image. We held out 10,000 random training images for validation. Hyperpa-
rameters were tuned on the validation set such that the best validation error was produced
after 1 million weight updates. The validation set was then combined with the training set
and training was done for 1 million weight updates. This net was used to evaluate the per-
formance on the test set. This way of using the validation set was chosen because we found
that it was easy to set up hyperparameters so that early stopping was not required at all.
Therefore, once the hyperparameters were fixed, it made sense to combine the validation
and training sets and train for a very long time.
1953
The architectures shown in Figure 4 include all combinations of 2, 3, and 4 layer networks
with 1024 and 2048 units in each layer. Thus, there are six architectures in all. For all the
architectures (including the ones reported in Table 2), we used p = 0.5 in all hidden layers
and p = 0.8 in the input layer. A final momentum of 0.95 and weight constraints with c = 2
was used in all the layers.
To test the limits of dropout’s regularization power, we also experimented with 2 and 3
layer nets having 4096 and 8192 units. 2 layer nets gave improvements as shown in Table 2.
However, the three layer nets performed slightly worse than 2 layer ones with the same
level of dropout. When we increased dropout, performance improved but not enough to
outperform the 2 layer nets.
B.2 SVHN
The SVHN data set consists of approximately 600,000 training images and 26,000 test
images. The training set consists of two parts—A standard labeled training set and another
set of labeled examples that are easy. A validation set was constructed by taking examples
from both the parts. Two-thirds of it were taken from the standard set (400 per class) and
one-third from the extra set (200 per class), a total of 6000 samples. This same process
is used by Sermanet et al. (2012). The inputs were RGB pixels normalized to have zero
mean and unit variance. Other preprocessing techniques such as global or local contrast
normalization or ZCA whitening did not give any noticeable improvements.
The best architecture that we found uses three convolutional layers each followed by
a max-pooling layer. The convolutional layers have 96, 128 and 256 filters respectively.
Each convolutional layer has a 5 × 5 receptive field applied with a stride of 1 pixel. Each
max pooling layer pools 3 × 3 regions at strides of 2 pixels. The convolutional layers are
followed by two fully connected hidden layers having 2048 units each. All units use the
rectified linear activation function. Dropout was applied to all the layers of the network
with the probability of retaining the unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the
different layers of the network (going from input to convolutional layers to fully connected
layers). In addition, the max-norm constraint with c = 4 was used for all the weights. A
momentum of 0.95 was used in all the layers. These hyperparameters were tuned using a
validation set. Since the training set was quite large, we did not combine the validation
set with the training set for final training. We reported test error of the model that had
smallest validation error.
B.3 CIFAR-10 and CIFAR-100

The CIFAR-10 and CIFAR-100 data sets consists of 50,000 training and 10,000 test images
each. They have 10 and 100 image categories respectively. These are 32 × 32 color images.
We used 5,000 of the training images for validation. We followed the procedure similar
to MNIST, where we found the best hyperparameters using the validation set and then
combined it with the training set. The images were preprocessed by doing global contrast
normalization in each color channel followed by ZCA whitening. Global contrast normal-
ization means that for image and each color channel in that image, we compute the mean
of the pixel intensities and subtract it from the channel. ZCA whitening means that we
mean center the data, rotate it onto its principle components, normalize each component
1954
Dropout
and then rotate it back. The network architecture and dropout rates are same as that for
SVHN, except the learning rates for the input layer which had to be set to smaller values.
B.4 TIMIT
The open source Kaldi toolkit (Povey et al., 2011) was used to preprocess the data into log-
filter banks. A monophone system was trained to do a forced alignment and to get labels for
speech frames. Dropout neural networks were trained on windows of 21 consecutive frames
to predict the label of the central frame. No speaker dependent operations were performed.
The inputs were mean centered and normalized to have unit variance.
We used probability of retention p = 0.8 in the input layers and 0.5 in the hidden layers.
Max-norm constraint with c = 4 was used in all the layers. A momentum of 0.95 with a
high learning rate of 0.1 was used. The learning rate was decayed as 0 (1 + t/T )−1 . For
DBN pretraining, we trained RBMs using CD-1. The variance of each input unit for the
Gaussian RBM was fixed to 1. For finetuning the DBN with dropout, we found that in
order to get the best results it was important to use a smaller learning rate (about 0.01).
Adding max-norm constraints did not give any improvements.
B.5 Reuters
The Reuters RCV1 corpus contains more than 800,000 documents categorized into 103
classes. These classes are arranged in a tree hierarchy. We created a subset of this data set
consisting of 402,738 articles and a vocabulary of 2000 words comprising of 50 categories
in which each document belongs to exactly one class. The data was split into equal sized
training and test sets. We tried many network architectures and found that dropout gave
improvements in classification accuracy over all of them. However, the improvement was
not as significant as that for the image and speech data sets. This might be explained by
the fact that this data set is quite big (more than 200,000 training examples) and overfitting
is not a very serious problem.
B.6 Alternative Splicing

The alternative splicing data set consists of data for 3665 cassette exons, 1014 RNA features
and 4 tissue types derived from 27 mouse tissues. For each input, the target consists of 4
softmax units (one for tissue type). Each softmax unit has 3 states (inc, exc, nc) which are
of the biological importance. For each softmax unit, the aim is to predict a distribution over
these 3 states that matches the observed distribution from wet lab experiments as closely
as possible. The evaluation metric is Code Quality which is defined as
|data points|
X X X qts (ri )
psi,t log( ),
p̄s
i=1 t∈tissue types s∈{inc, exc, nc}
where, psi,t is the target probability for state s and tissue type t in input i; qts (ri ) is the
predicted probability for state s in tissue type t for input ri and p̄s is the average of psi,t
over i and t.
A two layer dropout network with 1024 units in each layer was trained on this data set.
A value of p = 0.5 was used for the hidden layer and p = 0.7 for the input layer. Max-norm
regularization with high decaying learning rates was used. Results were averaged across the
same 5 folds used by Xiong et al. (2011).
1955
References
M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for
domain adaptation. In Proceedings of the 29th International Conference on Machine
Learning, pages 767–774. ACM, 2012.
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-
covariance restricted Boltzmann machine. In Advances in Neural Information Processing
Systems 23, pages 469–477, 2010.
O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features.
Machine Learning, 81(2):149–178, 2010.
A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In
Proceedings of the 23rd International Conference on Machine Learning, pages 353–360.
ACM, 2006.
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks.
In Proceedings of the 30th International Conference on Machine Learning, pages 1319–
1327. ACM, 2013.
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504 – 507, 2006.
G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.
Neural Computation, 18:1527–1554, 2006.
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage
architecture for object recognition? In Proceedings of the International Conference on
Computer Vision (ICCV’09). IEEE, 2009.
A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,
University of Toronto, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in Neural Information Processing Systems 25, pages
1106–1114, 2012.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computa-
tion, 1(4):541–551, 1989.
Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, Z. Li, M.-H. Tsai, X. Zhou,
T. Huang, and T. Zhang. Imagenet classification: fast descriptor coding and large-scale
svm training. Large scale visual recognition challenge, 2010.
A. Livnat, C. Papadimitriou, N. Pippenger, and M. W. Feldman. Sex, mixability, and
modularity. Proceedings of the National Academy of Sciences, 107(4):1452–1457, 2010.
V. Mnih. CUDAMat: a CUDA-based matrix class for Python. Technical Report UTML
TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
1956
Dropout
A. Mohamed, G. E. Dahl, and G. E. Hinton. Acoustic modeling using deep belief networks.
IEEE Transactions on Audio, Speech, and Language Processing, 2010.
R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in

natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning
and Unsupervised Feature Learning 2011, 2011.
S. J. Nowlan and G. E. Hinton. Simplifying neural networks by soft weight-sharing. Neural

Computation, 4(4), 1992.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann,

P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi
Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition
and Understanding. IEEE Signal Processing Society, 2011.
R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In Proceedings of the Inter-

national Conference on Artificial Intelligence and Statistics, volume 5, pages 448–455,
2009.
R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov

chain Monte Carlo. In Proceedings of the 25th International Conference on Machine
Learning. ACM, 2008.
J. Sanchez and F. Perronnin. High-dimensional signature compression for large-scale image

classification. In Proceedings of the 2011 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1665–1672, 2011.
P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house

numbers digit classification. In International Conference on Pattern Recognition (ICPR
2012), 2012.
P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks ap-
plied to visual document analysis. In Proceedings of the Seventh International Conference
on Document Analysis and Recognition, volume 2, pages 958–962, 2003.
J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning

algorithms. In Advances in Neural Information Processing Systems 25, pages 2960–2968,
2012.
N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th
annual conference on Learning Theory, COLT’05, pages 545–560. Springer-Verlag, 2005.
N. Srivastava. Improving Neural Networks with Dropout. Master’s thesis, University of

Toronto, January 2013.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B. Methodological, 58(1):267–288, 1996.
1957
A. N. Tikhonov. On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5):
195–198, 1943.
L. van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized
corrupted features. In Proceedings of the 30th International Conference on Machine
Learning, pages 410–418. ACM, 2013.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th International Conference
on Machine Learning, pages 1096–1103. ACM, 2008.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising

autoencoders: Learning useful representations in a deep network with a local denoising
criterion. In Proceedings of the 27th International Conference on Machine Learning, pages
3371–3408. ACM, 2010.
S. Wager, S. Wang, and P. Liang. Dropout training as adaptive regularization. In Advances

in Neural Information Processing Systems 26, pages 351–359, 2013.
S. Wang and C. D. Manning. Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning, pages 118–126. ACM, 2013.
H. Y. Xiong, Y. Barash, and B. J. Frey. Bayesian prediction of tissue-regulated splicing

using RNA sequence and cellular context. Bioinformatics, 27(18):2554–2562, 2011.
M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional

neural networks. CoRR, abs/1301.3557, 2013.
1958

!#"%$'&)(*,+.-!"/$0132406575701(,+'-98:$;!$<>=?$<1$@2:A9BDCECEF6-*6*G3H@IJ6(*KMLON0 P,$C75RQT(*S IUI?$CWV
XY[Z]\;^;_a`cbed.fW_Wghghikjl^]m_Wgh_n\befMop4_Wq>\rbhseZ_Wj.stvuwxuzy9\;{lg|vm)_Egh_n\befMo9}*~n; fMo6lkRp4beik'_}
m_n\;jl}1R@ ~E|[6}!a t
ikkU\Z_Eshse_a4j1ik'_Ebeghis[.}1;6sM\rse_
6shbe_W_7sn} \;k_WZ}l4mT;; ~}>a1t
a pY[m41u4} m> dr_Eb4 o>\lghgh_W_
¡.},~r¢r£'¤;_Ebekikj9} _7beZ]\;j.
¥ .\;jlj*} ¦_Edj6{!§¨)be_Wgh_n\befMo9© \rshdsn©befEi¦d^;Zikj>\;}>k^;d;bhq1b7¨)l{lkikikghk¦ol\;_WZ_Eikshj se_© _n 9}l ¦\lgM¨)ªlbegOsn© ^;Z]*© 1_
)bhbn} ©>\j> #l« kk_Ebn}¬@©v[4_E1bM\;*4_7s[®dbe6gW¯1shbeikfM gd°cseol_4shbM\;1_W± }
q1beikj1^_Ebn}*~n;¤ ©
u®o1_fWdj6'_Ebe^;_WjlfE_ºd°{>\f7.|»q1bed;q>\;^\seikdj¼k_n\rbejlikjl^#ikg\;jl\; W_n
gh²]d³c\;´Wg4µW¶rse·>d]¸;_Eµ½ ¹ q1U\ikjºfEdZZdjq1ol_Wj1dZ_Wj1djd;{lgh_Ebe_n{6q1bM\fEseisei¦d;jl_EbegW© \j6
1j>1_EghikbM\{lk_]{_Wo>\n6ikd;beg
d;°{>\f76q bedqfn\j#{_]\n'di¦1_W3ikseoshbeikfM6g
seol\sD\be_
bM\rbe_W¾_7½1qd;gh_n¾ikjºgh_Ebeikd1gse_Ef7o1jlikfn\;%q1l{lkikfn\rsei¦d;jlgW©,u®olikgaq>\q_Ebîk'_EgaghdZ_
d °c\sej6oldgh\;_R shsebeolikfMd;6begWgD}>o>\\nj>'_dgh¿!1_E^be^g_E_EgOse½ _nql¦\;sejlo>\\sesDikdghj1_Wg)fWd;dj>°6|Ào.d]bM1se_EolbD_Ed]q ®seikdZbei¦W© \seikdjÁZ_Eseo1d61g
\rbe_®\ \j.sM\;^_Edlg*°Jdbcjl_W bM\;.j1_Es9shbM\ikjlikjl^ ©rYÀsikggho1dnjseol\s,ZdgOs)[fW¦\;ghghikfn\k±
gh_WfEdj>6|»d;bM1_7bDZ_7seold61gR\be_ikZq1bM\fEseikfn\;c°Jdb
U\rbe^_]jl_E1bM\;%j1_Es[d;be6gW©t°J_E
Z_Eseo1d61g\rbe_aq1bedqd;gh_nseol\s)1dDjld;sol\r_4seol_Wgh_akikZisM\seikdj1gW©
Â ÃlÄÅÆ!ÇaÈDÉ
Ê,Å!ËeÇÄ
241ÌWÍ>Î!C701Î*6Ïl.5E=U01Á=?K< $;C7ÐÁÎ,0 Î*(!I? C]!$;(!CE I!$5eÑ)01C7ÍÁI?$; C7*=U!Ï# IUÏ10 CE=¦5EÒ!ÓÔP,$;Ì (*K7$
=¦5
=JK
Ì0 *Ì$Î5E(*6I?I?Ð3K7=UÓÎ!I?$ -,Ìr01ÓÎ*(5E657=?0 * IUI?Ðº$ÕÌ=U$;l5;-9 *GP,$;Ì (*K7$=¦5
0 Ö×57$;ÁÑ)01C7ÍK;A
Ø0.Ñ4$< $;C;-'Ï1$r5M5E=U*Ï:=U5570Ñ40 CEÍÑ4$I?IÀ-6 *GKM01Ó$57=?Ó$;K570@Ñ40 CEÍÙ.5 IUI[- Ì K7$$;ÓÚÓ0 CE$R06Ö
66C755EÒ*6ºK7Ì=U$;*Ìr$1A1ÛD$;K7=UÏ1!=?!ÏÙ69G57CW6=?!=?!Ï]]!$5eÑ)01C7Í(*K7=U!ÏÙP* ÌWÍ>Î!CE0 ÎCE$;Ül(!=?C7$K
Ó Í>=U!Ï
Ó >Ð@K7$$;Ó=U!Ï1IUÐ@6CEP!=U57CW6CEÐDÌWÒ!01=?Ì$;KKM(9ÌWÒ] K/57Ò!$)l(*ÓÙP,$C®6*G@5eÐ>Î,$;Kv06Ö!!0G$K-
I?'Ð1$CWK->I?$;6CE!=?!ÏCE657$K-5ECE =U*=U!Ï *G¾57$;KM5DK7$r5WK-* *GKM0ÖÝ01CM5EÒ%A*ÞRÒ!$KM$]ÌWÒ*0 =JÌr$;KaÌ;6P,$
ÌrCE=¦5E=?Ì;6I[-Ð $5/5EÒ!$CE$=JK/!0ÖÝ0l01IUÎ*C70>06ÖCE$;Ì=UÎ,$ÖÝ01CvG$Ìr=JG=U*Ïa57Ò!$;ÓßP9$Ì (*KM$57Ò!$;Ð@ C7$I? C7Ï1$I?Ð
Î!C701P!I?$Óà *GG!.5W:G!$Î,$*G$;l5;A Ø
0.Ñ)$;< $C-57Ò!$;C7$a C7$RÒ!$;(!C7=JKM57=JÌK69GK70 Ó$R(!*G!$CEIUÐ>=?!Ï
57Ò!$;0 CEÐ57Ò*65
Ì ¾Ò!$;IUÎÏ1(!=?G!$@Î!CW Ìr57=U57=?0 !$;CR570Ó6Í1$:P,$r5757$;C
ÌWÒ!0 =JÌr$KA
áhÙ5EÒ!$Râ*CWKe5K7$;Ìn5E=U01P9$;IU0.ÑãÑ)$R=?15EC70G(9Ìr$aKe5W6*G*6CWGÙP*1ÌWÍ>Î!C701Î*6Ïl.5E=U01]6*GG=JK7Ì(*K7K
>(!ÓÙP,$CD06ÖK7=UÓÎ!I?$]Ò!$;(!C7=JKM57=JÌK01C5EC7=JÌWÍKaÖÝ0 CD=UÓÎ!CE0.<>=U*Ï=¦5WKÎ,$C7ÖÝ0 CEÓ *Ìr$1A*äå$Ù!$æ>5
G=?KEÌr(9K7K=JKEKM(!$K]0 ÖaÌ0 >< $;C7Ï1$*Ì$ Aäå$57Ò!$;çG$K7ÌC7=?P9$#ÖÝ$Ñéè7ÌI?1K7K7=?Ì;6IJê¾K7$;Ì0 *GL[01CEG$;C
!0 LOI?=U!$6C01Î57=?Ó=Uë.57=?0 T5E$;ÌWÒ!!=JÜl(!$;Kº6*GìK7Ò!0.Ñí5EÒ*.55EÒ!$=?C¾ Î!Î!I?=?Ì;.5E=U01ç5E0z!$(*CE I
!$r5eÑ40 CEÍ¼57CW6=?!=?!Ï¼=JK< $;C7ÐzI?=?Ó=U57$Gc-RG$;K7Î!=U57$#Ó6>ÐãÌI? =UÓK5E0¼57Ò*$ÁÌr0 l5ECE C7Ðz=?ì57Ò*$
IU=U57$;CE657(!CE$ Aî=U96I?IUÐ1-Ñ)$Î!CE$;K7$l5ÁÖÝ$;ÑïK7$;Ì0 *GL[01CEG$;CÙÓ$r5EÒ!0G!K57Ò9.5ºG0¼ Ì;Ìr$;IU$;CE657$
IU$6CE!=U*Ï=?Ìr$C75E =UÌ1KM$KA
ð ñÙò%ó®Æ*Ä
Ë7ÄôõóÄ
È÷öxò/Äò%Æ*ó®ø7ËeùcóÅ!ËeÇÄ
ÞRÒ!$CE$6CE$KM$;< $CW6I Î!Î!CE01 ÌWÒ*$;K
570Á6(!570 Ó657=JÌÓ1ÌWÒ!=?!$IU$6CE!=U*Ï*-cP!(5ÓÙ(*ÌWÒú06Ö457Ò*$
KM(*Ì;Ìr$K7KMÖÝ(!I>6Î!Î!CE011ÌWÒ!$;K/Ì6ÙP9$4Ì.5E$Ï10 CE=Uë;$;G Kû6üEý þ.ÿ Wý Wþ Wý.ü 9ÿ lû ;þ nA'ÞRÒ*$

IU$6CE!=U*ÏÁÓ ÌWÒ!=?!$ -® KÙC7$;Î!CE$;K7$l57$Gú=Uçî=?Ï (!CE$1-Ì0 ÓÎ!(5E$;KÖÝ(!*Ìr57=?0

ÑaÒ!$CE$
=JK/5EÒ!$ !L[57Ò=U*Î!(5Î*65M57$;C7/-' *G C7$;Î!CE$;K7$l5EK/57Ò!$Ì0 I?IU$Ìn5E=U01Ù06Ö, G e(*KM5E P!I?$
Î*6CW6Ó$r5E$CWK=?ú57Ò!$3KMÐKM57$ÓA Ìr0lKe5]ÖÝ(**Ìn5E=U01 !"
# $
#%r-/Ó$;1KM(*C7$K
57Ò!$G=JKEÌrCE$Î* *ÌrÐ]P,$r5eÑ4$$;! -'5EÒ!$è7Ì0 CEC7$Ìn5Wê
01C®G!$;K7=UCE$;G0 (!57Î!(5ÖÝ01CÎ*.5757$;C7& - *G
57Ò!$:0 (5EÎ!(5aÎ!CE0G(*Ìr$GP>Ð57Ò!$@K7Ð>KM57$;ÓAlÞRÒ!$@'< $;CE Ï $Ì01KM5ÖÝ(!*Ìr57=?0 ' (*) +-,/.0 =?K)57Ò*$
'< $CW6Ï1$
0 Öv5EÒ!$$CEC701CE1K 0.< $CRK7$r506Ö=U*Î!(-5 2.0 (5EÎ!(5
Î96=?CEKRÌ IUI?$;G¾5EÒ!$@57CW6=?!=?!ÏKM$5
3 :+
#¾4+ 4
6575/5758:9
;&9<-=1Aáh]5EÒ!$4K7=?ÓÎ*IU$Ke5KM$5M5E=U!Ï9-57Ò!$4IU$6CE!=?!Ï
Î!CE0 P*IU$;Ó Ì0 *K7=JKe5WK/=?
â**G=?!Ï57Ò*$]<.6I?(!$06<Ö 57Ò*65Ó=?!=?Ó=Uë;$;:K (*)#+-,7. rA*áh#Î!CW Ìr57=JÌr$1->57Ò!$]Î,$C7ÖÝ0 CEÓ6*Ì$
06Ö5EÒ!$KMÐKM57$Ó 0 º57CW6=?!=?!ÏKM$5@=JK:06ÖI?=U5M57I?$=?15E$CE$;KM5;AcÞRÒ!$Ó0 CE$ÙCE$I?$<.6l5@Ó$; K7(!CE$
=?KD57Ò*$$;C7CE0 CDCE657$Ù0 Ö57Ò*$K7ÐKe5E$Ó =U 57Ò!$â*$;I?G%-,ÑaÒ*$CE$Ù=U5:Ñ40 (!IJGÁP,$(*KM$GÁ=? Î!CW Ìr57=JÌr$1A
ÞRÒ!=?K]Î9$;CMÖÝ01C7Ó69Ìr$=?K]$;KM57=?Ó657$G P>ÐÁÓ$ K7(!C7=?!Ï357Ò*$ ÌÌ(!CW ÌrÐ 0 ú#KM$5]06ÖaK7 ÓÎ*IU$K
G=?>K e0 =?l5)ÖÝCE0 Óõ5EÒ!$:57CW6=?!=?!ÏK7$r5->Ì;6I?IU$G5EÒ!$:57$;KM5aKM$5;A!ÞRÒ!$:Ó01KM5RÌ0 ÓÓ0 !I?Ð(*KM$GºÌ01KM5
ÖÝ(!*Ìn5E=U01=?KR5EÒ!$]Q#$6@ ?Ül(*6CE$;BG ACEC701DC C
E > GF H
#; F
(*) +-,7. IJ
6Kv+
E0, E1,....Ep
Error
COST FUNCTION
Desired
Output
Output
D0, D1,...Dp
Parameters M(Z,W)
W
LEARNING
MACHINE
Input
Z0, Z1,... Zp
L<M7N ¹PO>¹ bM\;1ik_Wj.sh|»{>\;gh_Wk_n\bej1i¦j1^@Z]\;fMolikj1_©

ÞRÒ!=?KÌWÒ*6Î5E$C=JKÖÝ0Ì(*KM$G0 KM57CW.57$;Ï =?$;KÖÝ01C=?ÓÎ*C70.<>=?!Ï:57Ò!$Î*C70Ìr$K7K06ÖcÓ=U!=?Ó=Uë;=U*Ï
57Ò!$ºÌ01KM5ÖÝ(!*Ìr57=?0 %AØ0.Ñ4$< $;C;-,5EÒ!$;K7$Ke5ECE657$;Ï =?$;K@ÓÙ(9Ke5ÙP,$(*K7$;Gú=?¼Ìr0 Qe(!*Ìn5E=U01úÑa=¦5EÒ
Ó$57Ò!0G!KÖÝ0 CÓ.æ=?Ó=?ë=?!Ï357Ò*$!$r5eÑ40 CESÍ R K@6P*=UI?=¦5eÐÁ570Áû ü7ý k/ÿ T r-/57Ò*65]=?K;-c570Î!CE$;G!=?Ìr5
57Ò!$Ìr0 CECE$;Ìn5R5W6CEÏ $r5WKaÖÝ01CÎ*.5757$;C7*K57Ò!$IU$6CE!=?!ÏKMÐKM57$;Ó Ò*1Ka!0 5DÎ!CE$<>=?0 (*K7IUÐ3KM$;$ÖÝ01C

Ó01C7$G$5E =UUI nA

Þ/0 (!*G$;CEKM5E *G¼Ï $;!$CW6I?=?ë;.5E=U01%-/I?$r5(*KÌr01*K7=?G$;CÒ*0.Ñ P* ÌWÍ>Î!CE0 Î* Ï1657=?0 úÑ)01C7ÍK;A
ä $#KM5E6C75Ña=¦5EÒTúK7$r506Ö@K7 ÓÎ!IU$K$;1ÌWÒç06Ö:ÑaÒ!=JÌWÒã=JK6ã=U!Î*(-5 2.0 (5EÎ!(5Î* =UC0 Ö
57Ò*$
ÖÝ(!*Ìn5E=U01º5E0P,$@IU$6CE!$;G%A ?>=?*Ì$D5EÒ!$@Ó$; K7(!CE$Ó$l54Î!C70Ì$;KEK=JK406Ö×57$;¾*0 =JKMÐ1-15EÒ!$CE$:Ó'Ð
P9$ú$CECE0 CWK=?57Ò*$úKE6ÓÎ!I?$;K;A4ä $¼Ì =?Ó6Ï =?!$ 57Ò*65=UÖÑ4$úÌr0 I?I?$;Ìn5E$;GxÓ(!IU57=?Î!IU$
06Ö4KE6ÓÎ!I?$;K:5EÒ!$ú$ ÌWÒúKM$5@Ñ40 (*I?G IU0>01Í 3IU=U5M5EIU$G/= V$CE$l5]P9$Ì (*KM$0 Ö57Ò*$!0 =JKM$ *G

P9$Ì6(9KM$R06Ö,57Ò*$GW= V$CE$l5Î,0 =?l5EKKE6ÓÎ!I?$;GcXA A) ÌWÒ06Ö,57Ò!$KM$G!65E@K7$r5EK®Ñ40 (*I?G6IJK70DCE$;K7(!IU5

=U!$5eÑ)01C7ÍKÑa=¦5EÒÓ=?!=?Ó:57Ò*656CE$RK7I?=UÏ1Ò15EIUÐGW= V$CE$l5ÖÝCE0 Ó$; ÌWÒ065EÒ!$C *GÙÖÝC701ÓÚ57Ò*$
57CE(!$ÖÝ(**Ìn5E=U01%A.áh]5EÒ!=?K®ÌWÒ* Î57$;C;-Ñ4$4Ì0 *Ì$l57CW.5E$0 =UÓÎ!CE0.<>=U!Ï
57Ò!$4Î!CE0Ìr$;KEKv06Ö!â*9G=U*Ï
57Ò!$RÓ=U*=UÓ(!ÓßÖÝ0 C5EÒ!$)Î96C757=JÌr(!IJ6C®KM$506Ö*$æ ÓÎ!IU$Kv57Ò*65Ñ4$4 C7$4Ï =?< $;%A.8:$!$;CE IU=?ë;657=?0
57$;ÌWÒ*!=?Ül(!$K5EC7Ð35E03Ì0 CEC7$Ìn5ÖÝ01C57Ò*$$;C7CE0 CWKa=?15EC70G(9Ìr$;G =Ul5E057Ò*$!$5eÑ)01C7Í KDCE$;K7(!IU5
06Ö®0 (!CaÌWÒ!01=?Ì$:0 Ö®G*.5E1KM$5;A!24065EÒ3 C7$@=?ÓÎ901CM5W6l5;A
?>$<1$CW6Ia5EÒ!$01C7$57=JÌ6ID4$ V0 C75EK¾Ò*'< $Á *6I?Ð>ë$GT57Ò!$åÎ!CE0>Ì$;KEK06Ö]I?$; C7*=U!ÏçPlÐìÓ=?!=¦L
Ó=?ë=?!Ï5EÒ!$$CECE0 C0 ¼35ECE =U!=?!ÏÁK7$r5 ÀÎ!CE0>Ì$;KEKKM01Ó$r57=?Ó$;KÌ6I?I?$;G AÓÎ!=?C7=JÌ INa=JKMÍ
Q#=U*=UÓ=?ë;.5E=U01 - OA
?>0 Ó$
0 Ö%57Ò*01K7$57Ò*$0 CE$r5E=?Ì;6I6* IUÐK7$;K)6CE$P* K7$;G0 ¾G!$;Ìr01ÓÎ90lKM=?!Ï5EÒ!$DÏ1$!$;CE IU=?ë;6L
57=?0 $CECE0 C=?l570@5eÑ4057$;C7ÓDK C P!=J K)6*G<.6CE=J6*Ì$ »K7$$$1A Ï9A E *nAlÞRÒ!$P!=J K=?K)Ó$; K7(!CE$
06Ö9Ò*0.ÑzÓÙ(*ÌWÒÙ57Ò!$R!$5eÑ)01C7Í@01(57Î!(!5;-6'< $;CE Ï $G@0.< $C®6I?I>Î90lK7K7=?P!IU$RG!65E@KM$5EK®G/= V,$;CEKvÖÝCE0 Ó
57Ò!$
G!$;K7=UCE$;GÙÖÝ(!*Ìr57=?0 %AlÞRÒ!$<.6CE=? *Ìr$R=JK@Ó$;1KM(!CE$R06ÖÒ!0.ÑTÓÙ(*ÌWÒ57Ò*$!$r5eÑ40 CEÍ0 (!57Î!(5
<' C7=?$;K)P,$r5eÑ4$$3G!.5W K7$r5WK$A A C7I?Ð=U¾57CW6=?!=?!Ï*-157Ò!$@P!=J K)=JK)IJ6CEÏ $DP9$Ì (*KM$D57Ò*$D!$5eÑ)01C7Í
0 (5EÎ!(5D=JK
Ö»6CÖÝCE0 Óï5EÒ!$G$KM=?CE$;G3ÖÝ(!*Ìr57=?0 %AÞRÒ!$<' C7=J69Ìr$]=JK<1$CEÐ3K7Ó6I?I/P,$;Ì (*K7$]57Ò*$
G!.5WÒ* K
Ò* G3I?=¦5757I?$Ù=? *(!$;*Ìr$]Ð1$r5A*"/657$]=?57CW6=?!=?!Ï*-5EÒ!$]P!=J K=?KDKMÓ6I?I%P,$;Ì;6(*K7$57Ò*$
!$r5eÑ40 CEÍÒ* KDIU$6CE!$;GÁ57Ò!$(!*G!$CEIUÐ>=?!ÏºÖÝ(**Ìn5E=U01%A%Ø0.Ñ4$<1$C-*=UÖ57CW6=?!$G570>03IU01!Ï*-,57Ò*$
!$r5eÑ40 CEÍÑa=?I?I I?K70]Ò*'<1$I?$;6CE!$G57Ò!$:!01=?K7$DK7Î9$Ìr=Uâ9Ìa5E0Ù5EÒ*.5RG!.5W K7$r5AlÞRÒ!=JK4=?K)CE$rÖÝ$CECE$;G
570# K:0.< $;CM5ECE =U*=U!Ï9A*áh¼K7(*ÌWÒ Ì1KM$1-,57Ò!$<.6CE=? *Ìr$Ña=?IUIP,$I? C7Ï1$P,$;Ì;6(*K7$Ù5EÒ!$!01=?K7$
<' C7=?$;KP,$r5eÑ4$$;G!65E1KM$5EK;AáO5Ì;6ÙP,$aKMÒ*0.Ña]57Ò9.557Ò*$4Ó=?!=?ÓÙ(!Óß5E065W6I$CECE0 CÑa=UI?I0ÌÌr(*C
ÑaÒ!$357Ò*$]KM(!Ó 06ÖP!=?1Ka6*G3<.6CE=J6*Ì$: C7$@Ó=U*=UÓ6I[A
ÞRÒ!$CE$3 C7$#å>(!ÓÙP,$C0 Ö
5E$;ÌWÒ!!=JÜl(!$;K Ý$ A Ï*A$6CEIUÐzKM5701Î!Î!=?!Ï*-CE$Ï1(!IJ6CE=Uë.57=?0 ÖÝ01C
Ó6æ=UÓ=?ë=?!Ï57Ò!$Ï1$!$;CE IU=?ë;657=?0 ¾ P!=?IU=U5eÐ06Ö!$r5eÑ40 CEÍÑaÒ!$;3(9KM=?!ÏP* ÌWÍ>Î!CE0 Î/A
ÞRÒ!$ú=?G$ç0 ÖÙ5EÒ!=?K#ÌWÒ96Î5E$C-)5EÒ!$CE$rÖÝ01C7$1-R=?K35E0TÎ*C7$KM$;15Ó=U*=UÓ=?ë;.5E=U01 KM57CW.5E$Ï1=U$K
ÝÏ =?< $;ÁºÌ01KM5ÖÝ(!*Ìr57=?0 D6*G35EÒ!$]5EC7=JÌWÍK
1K7K70Ìr=J.57$GÑa=¦5EÒÁ=?*ÌrCE$;1KM=?!Ï57Ò*$K7Î9$;$;GÁ *G
Ü1(96I?=¦5eÐz0 Ö
5EÒ!$Ó=?!=?Ó=Uë.57=?0 /AáO5=?KÒ!0.Ñ4$<1$CÌrI?$;6C5EÒ*.55EÒ!$#ÌWÒ!0 =JÌr$306ÖD57Ò!$#Ó0G$I
ÝÓ0G$I%K7$I?$;Ìr57=?0 r-15EÒ!$6CWÌWÒ!=U57$;Ìr57(!CE$D *Gº57Ò!$Ìr0lKe54ÖÝ(!*Ìr57=?0 3=?KRÌC7(*Ì=? I,ÖÝ0 CR0 P5W6=?!=?!Ï
!$r5eÑ40 CEÍ]5EÒ*.5)Ï $*$CW6I?=Uë;$;KÑ4$I?IÀA ?>0]Í1$$Î=?Ó=U9G57Ò*65)=UÖc57Ò!$
ÑaC701!ÏÓ0>G!$IÌrIJ KEK=?K
(*KM$GÙ *G!0:Î!CE0 Î,$CÓ0G$;IKM$;IU$Ìn57=?0 Ù=?K®G01!$ -;57Ò!$;$;< $;Ù
K7(!Î,$CEPÙÓ=U*=UÓ=?ë;.5E=U01ÙÑa=?IUI
ÌrI?$;6CEI?Ð!065RÒ!$I?Î< $;C7ÐÓÙ(9ÌWÒ%AáhºÖ»1Ìn5->57Ò!$@$ræ=JKe5E$*Ì$@06Ö/0.<1$C757CW6=?!=U*ÏÙÒ*1K)I?$;G3K7$< $;CE I
6(5EÒ!0 CWKR570¾KM(!Ï1Ï $Ke5a5EÒ*.5
=?*1ÌÌr(*CE657$Ó=U*=UÓ=?ë;.5E=U01 6I?Ï 01C7=U57Ò*ÓKaÌ;6#P,$]P,$r5M5E$CD57Ò*
Ï 0>0Gº01!$;K;A

4Å>óÄDÈó®Æ*ÈúóÊ DÆ!Ç
ó®ôóÅ!ËeÇ)Ä
IU57Ò!01(!Ï Òú5EÒ!$5EC7=JÌWÍKÙ69G¼6* IUÐK7$;K=?ú57Ò!=JKÎ*6Î,$C6CE$Î!C7=?Ó6CE=UI?Ð Î*C7$KM$;15E$;Gú=?¼57Ò*$
Ìr0 l5E$ræ>5Ù06ÖèEÌrIJ KEKM=JÌ I?ê3ÓÙ(*I¦5E=¦LOI?'Ð1$CÖÝ$;$;G>L[ÖÝ0 CEÑR6CWG *$(!CW6I*$r5eÑ40 CEÍ>K;-%Ó6>Ðå06ÖR57Ò*$Ó
6IJKM06Î*Î!IUÐ570Ó01KM50657Ò*$CaÏ CW G!=U$;157L[P9 K7$;GI?$; C7!=?!ÏÓ$r5EÒ!0G!KA
ÞRÒ!$)K7=UÓÎ!I?$;KM5/ÖÝ01C7Ó 0 Ö!ÓÙ(!IU57=?I?'Ð1$C%I?$; C7*=U!Ï
Ó1ÌWÒ!=?!$5ECE =U!$GÑa=U57Ò]Ï1CE1G=U$;l5MLOP* K7$;G
IU$6CE!=U*ÏD=JKK7=UÓÎ!I?ÐÙ@Ke5W ÌWÍ]0 Ö9Ó0G(!I?$;K;-6$;1ÌWÒ0 Ö9ÑaÒ!=JÌWÒ=UÓÎ!I?$Ó$l5WKDÖÝ(!*Ìn5E=U01 .
. .
. n-lÑaÒ!$CE $ . =?K)]<1$;Ìn5E0 CCE$Î*C7$KM$;15E=U*Ï@57Ò!$D0 (5EÎ!(5)0 Ö57Ò!$DÓ0G(*IU$1- .
=?K)57Ò*$:<1$;Ìr570 C4+ 06Ö/57(!* P!I?$:Î96CW6Ó$r5E$CWK=?5EÒ!$@Ó0G(*IU$ »K7(!P*K7$r5R06Ö r-6*G . + =?K
57Ò!$:Ó0>G!(!IU$ R K)=?!Î!(5R< $Ìn5701C À K)Ñ)$;IUIc K57Ò!$:Î!CE$<>=?0 (*KÓ0G(*IU$ R K40 (!57Î!(5R< $Ìn5E0 C-nA1ÞRÒ*$
=U!Î*(5 :5E0Ù5EÒ!$:â*CEKM5RÓ0G(*IU$@=JK5EÒ!$@=U*Î!(5aÎ*.5757$;C7@ 1A>áOÖ/57Ò*$:Î96C757=J6IcG$CE=?<'657=?< $D06Ö
ÙÑa=¦5EÒºCE$;K7Î9$Ìn54570 . =?K4Í>!0.Ña%-157Ò*$¾57Ò!$:Î*6C757=J6IG$;C7=?<..57=?< $K0 Ö ÙÑa=U57Ò¾C7$KMÎ,$;Ìr5
570 .¾ *G . + Ì ¾P,$]Ìr01ÓÎ*(57$G¾(9KM=?!Ï57Ò!$P9 ÌWÍ>Ñ4 CEGºC7$Ìr(!CECE$*Ì$
.
.
. + .
G
. + .
. + .
ÑaÒ!$CE$ .
. + a=?K57Ò!$ l Ì0 P!=J60 Ö Ña=U57Ò#CE$;K7Î9$Ìn55E0 $<.6I?(*657$;G#6557Ò*$
Î901=Ul!5 .
. + n-6*G
.
. + =?K
57Ò*$l Ìr01P!=J606Ö Ña=U57Ò CE$;K7Î,$;Ìn5
570 çA
ÞRÒ!$ 11Ìr01P!=? º0 Ö®<1$;Ìn5E0 C4ÖÝ(!*Ìr57=?0 #=JKaÓ.57CE=Uæ¾Ìr0115W6=?!=?!Ï57Ò!$Î* CM5E=? I%G$CE=?<'657=?< $K
06Ö]6I?I5EÒ!$Á0 (!57Î!(5WKºÑa=U57ÒxCE$;K7Î,$;Ìn5º570ç IUI
57Ò!$Á=?!Î!(!5EK;A)äxÒ!$;x57Ò*$ 6P,0.< $#$Ü1(9.57=?0 9K
6CE$3 Î!Î!I?=U$Gz5E0 5EÒ!$#Ó0G(!I?$;K=?ãCE$<1$CWKM$¾0 CWG$C-ÖÝC701Ó I?'Ð1$C 5E0úI?'Ð1$BC -6I?IR57Ò*$
Î*6C757=J6I,G$CE=U<.657=?< $;K06Öc5EÒ!$DÌ01KM5ÖÝ(!*Ìr57=?0 ¾Ña=¦5EÒC7$KMÎ,$;Ìr55E06I?I!5EÒ!$:Î*6CW6Ó$r5E$CWKÌ P,$
Ìr0 ÓÎ!(!57$;G%A*ÞRÒ!$ÑR'Ð06ÖÌ0 ÓÎ!(5E=U*ÏÏ CW G=?$l5EK4=JKaÍ>!0.Ña K4P* ÌWÍlLOÎ!CE0 Î* Ï1.5E=U01%A
Þ/CW G=U57=?0 96IlÓÙ(!IU57=UL[IJ'Ð $;C/!$;(!CW6I>!$r5eÑ40 CEÍK/ C7$4DK7Î9$Ìr=J6IÌ1KM$)06Ö!5EÒ!$R6P,0.< $)K7ÐKe5E$Ó
ÑaÒ!$CE$:57Ò!$Ó0G(!I?$;Ka C7$6IU57$;C79.57$GIJ'Ð $;CEK406ÖÓ.57CE=UæºÓÙ(!IU57=?Î!I?=?Ì;.5E=U01*K ×5EÒ!$Ñ4$=?Ï Òl5E K
6*G3Ìr01ÓÎ,0 *$l5MLOÑa=?K7$KM=?Ï Ó01=?GºÖÝ(!*Ìr57=?0 *K Ý57Ò!$(*!=¦5W#K 4C
. . . + E
. .
ÑaÒ!$CE$ .¾=JKÓ.57CE=UæºÑaÒ!01K7$@>(!ÓÙP,$C0 Ö®Ì0 I?(!Ó*Ka=JKR57Ò*$]G=UÓ$;*KM=?0 0 Ö . + -* *G
l(*ÓÙP,$CÙ0 ÖaC70.ÑK@=JK5EÒ!$¾G=?Ó$*K7=?0 ¼06Ö .,A ï=JKÙ<1$;Ìn5E0 CÖÝ(**Ìn5E=U01ú57Ò*656Î*Î!IU=?$;KÙ
KM=?Ï Ó0 =JG ÖÝ(!*Ìr57=?0 ú570#$ ÌWÒúÌr01ÓÎ,0 *$l5]06Öa=U5EK=?!Î!(5A . =?K]57Ò!$<1$;Ìn5E0 C0 ÖRÑ4$=?Ï Òl5E$;G
KM(!ÓK;-!0 C hý %ÿ n-570IJ'Ð $CA
Î!Î!I?Ð>=U*Ï57Ò!$DÌWÒ*6=?C7(*IU$a5E057Ò*$$;Ül(*657=?0 6P,0.< $1-.57Ò*$
ÌrIJ KEKM=JÌ I!P*1ÌWÍlÎ*C701Î*6Ïl.57=?0

$;Ül(*.5E=U01*Ka6CE$:0 P5W6=?!$;G C
, ., G,
. .
G
.,! . + ., #"
., $ , 5 &
.$ + J , % .
ÞRÒ!$]6P,0.< $:$;Ül(*657=?0 *KaÌ;66IJKM0P9$ÑaCE=U5M57$;=UÓ.5EC7=UæÖÝ0 CE'Ó C
' # . )(X
. .
. G *
. +.
G
. + ,. + . 5 -
ÞRÒ!$KM=?ÓÎ!IU$Ke5aI?$; C7!=?!'Ï ÝÓ=?!=UÓ=?ë;657=?0 4Î!CE0Ìr$G(!CE$:=?KM(*ÌWÒKM$5M5E=U!Ï=JK457Ò!$Ï1CE6L
G=U$;l5
G$;KEÌr$;15 IUÏ10 CE=¦5EÒ!ÓÑaÒ!$;C7$ =JKa=U57$;CE657=?< $I?Ð PG e(*Ke5E$;G K4ÖÝ01IUI?0.ÑDK C
/.% /. F F10 5 %
áh 5EÒ!$KM=?ÓÎ!IU$Ke5Ì1KM$1- 0 =?K@3K7Ì;6IJ6C@Ìr01*Ke5W6l5;AQ#01C7$KM01Î!Ò!=JKe5E=?Ì;.5E$;GÁÎ*C70Ìr$G(!CE$;KD(*K7$
<' C7=J6P*IU$ 0 AáhT065EÒ!$CÓ$r5EÒ!0G!K 0 5W6Í $K]5EÒ!$3ÖÝ0 CEÓ 06Ö@ G=J6Ï10 * I)Ó657CE=¦æc-®0 C=JK
$;KM57=?Ó657$¾06ÖR57Ò!$¾=U><1$CWKM$Ø
$;KEKM=J6¼Ó.5EC7=Uæú06Öa57Ò*$ºÌr0lKe5ÖÝ(!*Ìn5E=U01 »K7$;Ì0 *G¼G!$CE=U<..5E=U<1$
Ó657CE=¦æ K7(*ÌWÒz1K]=?¼57Ò!$ $ÑR5E0 ç *G D(*1KM=UL
$ÑR5E0 úÓ$r57Ò*0>G*KG$K7ÌC7=?P,$;GúIJ.57$;CÙ=?
57Ò!$ÌWÒ* Î57$;C;SA Î!C701Î9$;CÌWÒ!0 =JÌr$06Ö 0 =JK@=?ÓÎ,0 C75E l5@ *GÁÑa=?IUIP,$G=JK7Ì(*K7K7$;G .5@IU$;!Ï65EÒ
I?657$CA
a
ò Æ!óÊ,Å!ËMÊóø Æ9ËMÊ
241ÌWÍ>Î!C701Î*6Ïl.5E=U01Ì63P,$< $CEÐK7IU0.ÑÎ96C757=JÌr(!IJ6CEIUÐÖÝ01CaÓÙ(!IU57=?IJ'Ð $CE$;Gº!$r5eÑ40 CEÍK4ÑaÒ!$CE$
57Ò!$#Ìr0lKe5KM(*CMÖ»1Ìr$3=?K5eÐ>Î!=JÌ6I?I?Ðå*0 LhÜl(* GCW.5E=?Ì -v*0 LhÌr01l<1$ræc-v *GzÒ*=UÏ1ÒTG=?Ó$*K7=?0 * I
Ña=¦5EÒ Ó6>ÐI?0>Ì;6I®Ó=U*=UÓ36*G 2.0 C 9.5@CE$Ï =?0 9KAÞRÒ!$CE$=?K:!0¾ÖÝ0 CEÓÙ(!IJ5E0¾Ï1(*6CW6l5E$$
57Ò*6:5 /5EÒ!$a!$r5eÑ40 CEÍÑa=?IUI*Ì0 >< $;C7Ï1$5E0@:Ï10l0GKM01IU(!57=?0 %"- E ®Ìr01>< $CEÏ $;*Ìr$)=?KK7Ña=UÖ×5;-601C
Ìr01l<1$CEÏ $;*Ìr$a$;< $0Ì;Ìr(!CWK6546I?I[AlØ
0.Ñ)$;< $C-6=?57Ò*=?K4KM$Ìn5E=U01Ñ)$DG=?KEÌr(9K7K4>(!ÓÙP,$C
06Ö5EC7=JÌWÍK5EÒ*.5Ì ÁÏ CE$;657I?Ð¾=?ÓÎ!C70.<1$]5EÒ!$ÌWÒ96*Ì$;KD06Öâ**G!=U!ÏÏ10>0>G KM01IU(!57=?0 ÁÑaÒ!=?I?$
6IJKM0
G!$;ÌrCE$;1KM=?!Ïa5EÒ!$)Ì0 >< $;C7Ï1$*Ì$®57=?Ó$)0 Ö×57$ÙPlÐ@01CEG$;CEK%06Ö*Ó6Ï1!=U57(*G$1AQÁ0 CE$G$r5W6=?IU$G
57Ò!$;0 CE$r57=JÌ I e(*KM57=Uâ9Ì657=?0 *KRÑa=?IUI/P9$Ï1=U<1$3=UIJ.57$;CKM$Ìn5E=U01*KA
!"$#&%(')*,+!-/.0%12')3"034
45º$;1ÌWÒx=¦5E$CW.57=?0 /-$;Ül(*657=?0 CE$;Ül(!=?C7$KzÌ0 ÓÎ!I?$r57$#Î9 KEK57Ò*C701(!Ï ÒT57Ò*$Á$l57=?CE$
G!.5W K7$r5=?#0 CWG$Ca5E0ºÌr01ÓÎ!(57$]57Ò!$¾6ý 5 ü7ý;û
0 Ca5EC7(*$]Ï CW G=?$l5A*ÞRÒ!=JK=JKaC7$ÖÝ$CEC7$G¾570º1K
P*.5WÌWÒI?$; C7*=U!ÏK7=U9Ìr$6$l5E=UCE$èMP*65EÌWÒ*ê:0 ÖG!65E:ÓÙ(9Ke5P,$Ìr01*K7=?G$;C7$GÙP,$rÖÝ01C7$aÑ4$=?Ï Òl5WK

6CE$ (!ÎG!.5E$;GcA
I¦5E$( CE*.5E( =U<1$I?Ð -R0 *$ Ì (*K7$úKe5E0>ÌWÒ9 KM57=JÌ Ý01!I?=U!$ ¾IU$6CE!=U*ÏçÑaÒ!$;C7$å
Wÿ >û a$ræ! ÓÎ*IU$ 3
; ==JKÌWÒ!01K7$ Ý$1A Ï9A!CW6*G01ÓIU"Ð )ÖÝCE0 Ó 5EÒ!$57CW6=?!=?!ÏKM$5
.5$ ÌWÒ
=¦5E$( CW.57=?0 .nXA
[ÿ ý 06Ö,57Ò!$57CE(!$aÏ CW G=?$l5=JK®57Ò!$;Ìr01ÓÎ*(57$GP9 K7$;G0 57Ò*$$CEC701C

0 Öv5EÒ*.5a$ræ! ÓÎ*IU$1-*6*Gº57Ò*$357Ò!$Ñ4$=?Ï Òl5WKa6CE$D(!ÎG!657$;G C

(
. 7 < /%. F10 5 %X
2)$Ì (*KM$57Ò!=JK/$Ke5E=UÓ657$0 Ö57Ò!$)Ï CW G=?$l5/=JK%*0 =JKMÐ1-5EÒ!$Ñ4$=?Ï Òl5EK/Ó'ÐD!0 5vÓ0.< $Î!C7$Ìr=JKM$;IUÐ
G0.Ña57Ò*$)Ï1CE1G=U$;l5v.5$; ÌWÒÙ=¦5E$CW.57=?0 /A DK/Ñ4$)K7Ò*6I?I>KM$;$ -;57Ò!=JK:èM!01=?K7$;ê65v$ ÌWÒ]=U57$;CE657=?0
Ì6 P9$ G<. 15W6Ï1$0 (9KSA ?l570ÌWÒ*1Ke5E=?ÌÙIU$6CE!=U*Ïº=JK
Ï1$!$;CE IUI?Ð357Ò!$Î!CE$rÖÝ$CECE$;GÓ$r5EÒ!0GÁÖÝ01C
P* K7=?Ì:P*1ÌWÍ>Î!C701Î*6Ïl.5E=U01ÖÝ0 CR5EÒ!$@ÖÝ0 I?IU0.Ña=?!Ï57Ò!CE$$CE$;1KM01*DK C
89 #!:3;!4<%1=&>?!"A@B%12')3C"3C4
A ?l5E0>ÌWÒ9 KM57=JÌI?$;6CE!=?!Ï=?K4(*K7(*6I?I?Ð : D @Ö»1Ke5E$C457Ò963P*.5WÌWÒºI?$;6CE!=?!Ï*A
E A ?l5E0>ÌWÒ9 KM57=JÌDI?$; C7*=U!Ï6IJK700 Ö×57$;3CE$;K7(!IU5EKR=?¾P,$r5757$;C
KM01IU(!57=?0 *K;A

!A ?l5E0>ÌWÒ9 KM57=JÌDI?$; C7*=U!ÏÌ ¾P,$(*K7$;G¾ÖÝ0 CR5ECE1ÌWÍ>=U!ÏÌWÒ* !Ï $KA

?l570ÌWÒ*1Ke5E=?ÌIU$6CE!=U*Ï
=JK/Ó01KM5®06Ö×57$; : D aÖ» KM57$;C/5EÒ*6P*.5WÌWÒÙI?$; C7!=?!ÏÎ96C757=JÌr(!IJ6CEIUÐ
0 IJ6CEÏ $aCE$;G(**G!6l5)G*.5E1KM$5EK;AlÞRÒ*$
CE$; K70 ÖÝ01C57Ò*=?K)=?K4K7=UÓÎ!I?$5E0K7Ò!0.ÑA>&)01*KM=JG$;C57Ò*$

KM=?ÓÎ!IU$3Ì1KM$¾ÑaÒ!$CE$¾5ECE =U*=U!ÏåKM$50 Ö
KM=?ë$ =JKÙ=?* G!< $C757$;l57I?ÐåÌ0 ÓÎ90lKM$G¼06Ö
=?G$;l57=JÌ6IvÌr0 Î*=U$Ka06ÖK7$r5
Ña=U57Ò K7 ÓÎ!IU$KA < $;CE Ï =?!Ï57Ò!$]Ï1CE1G=?$l50.< $;C6I?<I
Î*.5757$CE*KÙÏ =?< $;K]57Ò!$¾$ræ! Ìr5KE6Ó$CE$;K7(!I¦5 KÌr0 ÓÎ!(!57=?!Ï#57Ò*$ºÏ CW G!=U$;15P* K7$;Gú01' e(*Ke5
57Ò!$3â*CWKe@5 AÞRÒ>(*K;-P*.5WÌWÒçÏ1CE1G=U$;l5G$K7Ì$l5=?KÑR KM57$ÖÝ(!IaP,$;Ì (*K7$¾=U5C7$Ìr01ÓÎ*(57$K
57Ò!$KE6Ó$Ül(*6l5E=¦5eÐ 57=?Ó$KP,$rÖÝ0 CE$0 *$Î*6CW6Ó$r5E$C@(!ÎG!.5E$ ABD 5EÒ!$065EÒ!$CÒ96*Gc-
Ke5E0>ÌWÒ9 KM57=JÌÁÏ CW G!=U$;153Ña=UI?IKM$;$åzÖÝ(!I?I$Î,0>ÌWÒ K =¦5E$CW.57=?0 9K5EÒ!C701(!Ï Ò LOI?0 !Ï
57CW6=?!=U*ÏK7$r5A%áhúÎ!CW Ìn5E=?Ì$ -c$æ ÓÎ!IU$K@CE C7$;IUÐÁ Î!Î,$;6C@Ó0 CE$57Ò* ú0 *Ì$=?¼#G!.5W K7$r5-
P!(5:57Ò!$;C7$Ù C7$Ù(*KM(96I?IUÐ#ÌrI?(*KM57$;CEK
06ÖÎ*65M57$;C79K5EÒ*.5@6CE$]< $;C7ÐKM=?Ó=UIJ6CA9î*0 CD$ræ!6ÓÎ!I?$Ù=?
Î!Ò!0 *$Ó$ÌrIJ KEKM=Uâ9Ì657=?0 %- IUIc0 Ö/57Ò*$:Î9.5M5E$CE*K4ÖÝ0 CR5EÒ!$@Î!Ò!01!$Ó$ 2:2ÙÑa=?IU<I »Ò!0 Î,$rÖÝ(!I?I?$Ð
Ìr0 l5W6=?]ÓÙ(*ÌWÒ06Ö!5EÒ!$RK7 Ó$4=?ÖÝ0 CEÓ.57=?0 /A;áO5=JK/5EÒ!=JKvCE$;G!(!*G! *ÌrÐ:57Ò9.5Ì6Ó6Í $)P*.5WÌWÒ
IU$6CE!=U*ÏÓ(*ÌWÒKMI?0.Ñ)$;C457Ò* ¾01L[I?=?!$ A
?l570ÌWÒ*1Ke5E=?Ì
IU$6CE!=U*Ï I?K70Ù0 Ö×57$¾C7$KM(*I¦5WK)=?3P9$5M57$;CaKM01IU(5E=U01*K4P9$Ì (*KM$:06Ö/57Ò!$@!01=?K7$
=U5EÒ!$:(!Î,G*.57$K1A 01!I?=U!$6C)!$r5eÑ40 CEÍK(*K7(*6I?I?ÐÒ*'<1$ÓÙ(*I¦5E=UÎ!I?$:IU0Ì I*Ó=?!=UÓÙ06Ö/G/= V,$;CML
=U!ÏG$Î5EÒ*K;AlÞRÒ!$DÏ 0l6I!0 Öc57CW6=?!=U*Ï=?K570]I?0Ì657$01!$06Öc5EÒ!$;K7$Ó=?!=UÓ*Al2R.5WÌWÒIU$6CE!=?!Ï
Ña=UI?I%G=JK7Ì0.< $C57Ò!$@Ó=?!=UÓ(!Ó0 Ö/ÑaÒ*.5E$<1$C4P* K7=U¾57Ò*$DÑ4$=?Ï Òl5EKR C7$D=U!=U57=J6I?I?ÐÎ!IJ Ìr$GcA>áh
Ke5E0>ÌWÒ9 KM57=JÌIU$6CE!=U*Ï*-c5EÒ!$!01=?K7$Î!CE$;K7$l5]=Uú5EÒ!$(!ÎG!657$;KÌ åCE$;K7(!IU5Ù=?ú57Ò!$Ñ4$=?Ï Òl5WK
e(!ÓÎ!=U*Ïå=?l570 57Ò!$P*1KM=?ç0 Ö: !065EÒ!$C-vÎ,01KEK7=UP!I?ÐçG$;$Î,$C-®IU0Ì IRÓ=?!=?ÓÙ(!ÓAÞRÒ!=JKÒ*1K
P9$;$#G$Ó01*Ke5ECE657$Gº=?#Ìr$C75E =UK7=UÓÎ!I?=¦â9$;GÌ1KM$K "!- [A
?l570ÌWÒ*1Ke5E=?ÌDIU$6CE!=?!Ï=JKa6IJK70(*K7$rÖÝ(*I%ÑaÒ!$¾57Ò*$DÖÝ(**Ìn5E=U013P,$=?!ÏÓ0G$I?$;G¾=?KaÌWÒ96!Ï L
=U!Ïã0.< $;C57=?Ó$ -RçÜl(!=U57$ Ì0 ÓÓ0 KEÌr$;*6CE=U0ç=U=?*G(9Ke5EC7=J6I:6Î!Î!I?=JÌ.5E=U01*KÑaÒ*$CE$Á57Ò*$
G!.5WG=JKe5EC7=?P!(5E=U01ÁÌWÒ*6!Ï1$;KRÏ1CE1G(*6I?I?Ð0.< $;Ca57=?Ó$ Ý$1A Ï9A9G(!$]570Ñ4$; C
6*G¾5E$;6CD06Ö®57Ò*$
Ó1ÌWÒ!=?!$; K nAcáOÖ)57Ò!$I?$; C7!=?!Ï¾Ó1ÌWÒ!=?!$G0>$;K@!0 5G!$r57$Ìn5]69GÁÖÝ01IUI?0.Ñß57Ò*$ÌWÒ*6!Ï1$=U5=?K
=UÓÎ,01KEKM=?P!I?$
5E0I?$; C7º57Ò!$G!65EÎ!CE0 Î,$CEIUÐ69GIJ6CEÏ $:Ï $*$CW6I?=Uë.57=?0 $;C7CE0 CWKÑa=?I?IcC7$KM(!IU5;A
äx=¦5EÒzP*.5WÌWÒúIU$6CE!=U*Ï*-%ÌWÒ* !Ï $KÏ 0(!*G!$r57$Ìn5E$;Gz *GåÑ)$01P5E =U¼CW.5EÒ!$CP*1GúC7$KM(*I¦5WK
KM=?*Ìr$¾Ñ)$6CE$I?=UÍ1$I?Ðå570ú'< $CW6Ï1$0.< $;CÙKM$;< $;CE ICE(!IU$K-ÑaÒ!$;C7$ K0 LOIU=?!$¾IU$6CE!=U*Ï å=UÖ
0 Î,$CW.57$G#Î!CE0 Î,$CEIUÐ ÀKM$;$ÙP,$I?0.ÑÚ=UåKM$Ìn5E=U01 A!Q( Ña=?IUIv57CW ÌWÍ35EÒ!$ÌWÒ96!Ï1$;K
*GÁÐ>=?$IJG
Ï 0>0G¾ Î!Î!CE0'æ>=?Ó.5E=U01ºCE$;K7(!I¦5WKA
Û
$KMÎ!=U57$
57Ò!$: G<. 15W6Ï1$;K®06Ö/KM570ÌWÒ*1Ke5E=?ÌRI?$; C7*=U!Ï9-65EÒ!$CE$
6CE$
KM57=?IUI,CE$; K70 9K®ÑaÒlÐ0 *$
Ó=?Ï Òl5
Ì0 *K7=?G!$CR(*K7=U!ÏP*65EÌWÒI?$;6CE!=?!Ï C
8 9 #!!3;:4&%1=<>+!-/@B%12')3"034
1A&)01*G=U57=?0 *KR0 Ö®Ì0 >< $;C7Ï1$*Ì$D C7$:Ñ)$;IUI/(!*G$;CEKM570>0GcA
E ARQ 6>Ð:1ÌÌ$I?$CW.57=?0 @5E$;ÌWÒ!*=?Ül(!$K Ý$1A Ï9AÌ0 P e(*Ï1.5E$Ï1CE1G=U$;l 5 01!IUÐ:0 Î!L
$;CE657$@=?3P9.5EÌWÒ3I?$;6CE!=?!Ï*A
*ARÞRÒ*$0 CE$r5E=?Ì;6I *6I?ÐKM=JK0 Öa57Ò!$Ñ4$=?Ï Òl5G!Ðl96Ó=?Ì;KÙ *G¼Ìr01l<1$CEÏ $;*Ìr$
CW.5E$;Ka C7$@KM=?ÓÎ!I?$CA
ÞRÒ!$;K7$º G!<' l5E6Ï1$;KKM57$;Ó÷ÖÝCE0 Ó÷57Ò*$ºK7 Ó$!0 =JK7$57Ò*65ÙÓ6Í1$ºKe5E0ÌWÒ* KM57=JÌIU$6CE!=?!Ï
G<.6l5W6Ï $;0 (*K;A.ÞRÒ!=JK®!0 =JKM$1-.ÑaÒ!=JÌWÒ=JKK70@ÌC7=U57=JÌ I>ÖÝ0 C®â*9G=U*ÏP9$5M57$;CI?0>Ì;6IÓ=?!=UÓ I?K70
Î!C7$;< $;15WK@ÖÝ(!I?I4Ì0 >< $;C7Ï1$*Ì$Ù5E0#57Ò*$ºÓ=U*=UÓ(!ÓAváh*KM57$ Gú0 ÖaÌ0 >< $;C7Ï1=U*Ïº5E0#57Ò!$¾$ræ! Ìr5
Ó=?!=?ÓÙ(!Ó-5EÒ!$Ìr01l<1$CEÏ $;*Ìr$
KM5E IUIJK40 (5G(*$D5E057Ò!$@Ñ4$=?Ï Òl5 9(*Ìn5E(*.5E=U01*KA!ÞRÒ*$@K7=?ë$:06Ö
57Ò!$ *(*Ìn5E(*.5E=U01*KÙG!$Î,$*G¼01ú57Ò!$¾G$Ï1C7$;$06Ö!01=?K7$06ÖR57Ò*$¾KM570ÌWÒ* KM57=JÌ(!ÎG!.5E$;K;AÞRÒ*$
<' C7=J69Ìr$º0 Ö
5EÒ!$ 9(*Ìn5E(*.5E=U01*K6CE0 (!9Gú57Ò!$I?0Ì6I4Ó=U*=UÓ(!Ó =JKÎ!C701Î901CM5E=U01*6I570å57Ò*$
IU$6CE!=U*Ï3CW.5E$ 0 E **- E (- & O0A ?>0#=?å01CEG!$C@570#CE$;G!(*Ìr$57Ò*$ *(*Ìr57(*657=?0 *KÑ4$Ì å$;=¦5EÒ!$C
G$;ÌC7$ K7G$ À6!*$;6UI v57Ò*$IU$6CE!=?!ÏCE657$R0 CÒ*'< $a61G!6Î5E=U<1$aP*.5WÌWÒKM=?ë$1A6áh57Ò!$;0 CEÐ *-
- &!- " =U54=JKRKMÒ!0.Ña57Ò*65)5EÒ!$D01Î57=?Ó6I6!*$;6I?=?!ÏKEÌWÒ!$;G!(!IU$:06Öc5EÒ!$:IU$6CE!=U*ÏÙCW.57$D=?K
06Ö57Ò!$@ÖÝ01C7Ó
0 .
% E
ÑaÒ!$CE$ .®=?K57Ò!$D>(!ÓÙP,$C)06ÖcÎ*.5757$;C7*KÎ!CE$;K7$l57$G6*G =JK]Ì0 *KM5E l5;A áhÎ*CE1Ìn57=JÌr$1-65EÒ!=?K
Ó'ÐºP9$@5E0l0Ö»1Ke5
!0 57Ò!$;CÓ$r5EÒ!0G570C7$;Ó0.< $a!01=?K7$=JK5E0Ù(*K7$ºè7Ó=?!=UL[P9.5EÌWÒ*$;KEê!-.5EÒ*.54=?K;-lKM5E CM5)Ña=¦5EÒ
#KMÓ6I?IP9.5EÌWÒ¼K7=Uë;$º69G =?*ÌC7$ K7$57Ò!$ºK7=?ë$º1K@57CW6=?!=?!Ï#Î!CE0>Ì$$G!KAvQ6I?IU$;CÙG=JK7Ì(*KEKM$K
0 !$Ó$r5EÒ!0G ÖÝ01C]G0 =?!Ï¾57Ò!=JK E " 6*G¼BDC7C " G!=?KEÌr(*KEK7$;K:57Ò*=?K:ÖÝ01CIU=?!$; CÎ!CE0 P!I?$ÓK;A
Ø0.Ñ4$< $;C;-lG$;Ì=?G!=U!ÏÙ57Ò!$:CW.57$:.5RÑaÒ!=JÌWÒ570=U9ÌrCE$; K7$a5EÒ!$DP9.5EÌWÒ¾KM=?ë$:6*GºÑaÒ!=?ÌWÒº=U!Î*(5EK
570=U*ÌIU(9G$@=U¾57Ò!$K7Ó6I?I,P9.5EÌWÒ*$;K4=?Ka1KRG=¦ÕÌ(!I¦51K4G!$r57$;C7Ó=?!=?!Ï57Ò!$:Î!CE0 Î,$CRIU$6CE!=?!Ï
CE657$ A A V$;Ìr57=?< $I?Ð357Ò!$KM=?ë$06Ö5EÒ!$ÙI?$; C7!=?!ÏCW.5E$Ù=? KM570ÌWÒ* KM57=JÌI?$; C7!=?!ÏºÌ0 CEC7$KMÎ,0 9G!K
57057Ò*$:CE$;K7Î,$;Ìn5E=U<1$KM=?ë$0 Ö/57Ò*$Ó=?!=%P9.5EÌWÒ/A
065E$º6IJK7057Ò*65]57Ò!$¾Î!CE0 P!I?$Ó÷06ÖaC7$;Ó0.<l=?!Ï5EÒ!$!01=?K7$=?ú57Ò!$3G!.5W#Ó'Ð P,$IU$K7K
ÌrCE=¦5E=?Ì;6Ic5EÒ*60 *$@57Ò!=?!ÍKaP,$;Ì (*K7$06Ö®Ï $*$CW6I?=Uë.57=?0 /A9BD< $C757CW6=?!=?!ÏÓ'Ðº0>Ì;Ìr(!C
IU01!Ï
P9$ÖÝ0 CE$D5EÒ!$!01=?K7$@C7$;Ï =?Ó$@=JKa$<1$3C7$ ÌWÒ!$GcA
!0 57Ò!$;CD G!<' l5E6Ï1$06ÖP*65EÌWÒ5ECE =U!=?!Ï=JKa5EÒ*.5:0 !$=?KD6P!I?$5E0º(*K7$ÙK7$;Ìr01*G0 CWG$;C
Ó$57Ò!0G!K5E0åK7Î,$$;Gç57Ò*$3I?$; C7*=U!ÏåÎ!CE0Ìr$;KEK;A ?>$Ìr01*Gz01CEG$;CÓ$57Ò!0G!KK7Î9$;$;GãIU$6CE!=?!Ï
PlÐ $;KM57=?Ó657=?!Ï!065 e(9Ke55EÒ!$Ï CW G=?$l5@P!(5Ù I?K70º5EÒ!$Ìr(*C7<..5E(!CE$0 Ö5EÒ!$Ìr0lKe5]K7(!CMÖ»1Ìr$1A
8:=U<1$ú57Ò*$ºÌr(!CE<..5E(!C7$1-%0 *$ºÌ å$Ke5E=UÓ657$5EÒ!$º6Î*Î!C70'æ=?Ó.57$I?0Ì.5E=U01ú06ÖR57Ò!$3 Ìr57(* I
Ó=?!=?ÓÙ(!ÓA
Û
$KMÎ!=U57$57Ò!$
1G<.6l5E Ï $K06ÖP*.5WÌWÒ(!ÎG!.5E$;K;-1KM570ÌWÒ* KM57=JÌRI?$;6CE!=?!Ï=JKKM57=?IUI906Ö×57$;57Ò*$
Î!C7$ÖÝ$CEC7$GÓ$r5EÒ!0GºÎ* CM5E=?Ì(!IJ6CEIUÐÑaÒ!$;¾G$6I?=U!ÏÑa=U57Òº<1$CEÐI? C7Ï1$DG!65EKM$5EK4P9$Ì (*KM$:=U5
=?KK7=?ÓÎ*IUÐÓ(*ÌWÒ3Ö» KM57$;C;A

* "3C4 C%
.0%1
$r5eÑ40 CEÍK]I?$; C7ú5EÒ!$ºÖ»1Ke5E$;KM5ÙÖÝCE0 Ó 57Ò*$¾Ó01KM5(*!$ræÎ,$;Ìn5E$;GãK7 ÓÎ*IU$1AÞRÒ!$;C7$ÖÝ0 CE$ -v=¦5=?K
G<>=JK7 P!IU$570 ÌWÒ*0l0lKM$3ÁKE6ÓÎ!I?$¾.5$;1ÌWÒ¼=¦5E$CW.5E=U01¼57Ò*65=JK]5EÒ!$3Ó0lKe5(!Ö» Ó=?I?=? CÙ5E0
57Ò!$úKMÐKM57$;Ó3A
0657$1-457Ò!=JK#6Î!Î*IU=?$;K¾0 !I?Ðì570ìKM570ÌWÒ* KM57=JÌÁI?$; C7*=U!ÏTKM=?*Ì$ 5EÒ!$ 01CEG!$C306Ö
=U!Î*(5Î!CE$;K7$l5E657=?0 ç=JK=?C7CE$I?$<.6l5ÖÝ0 CP*65EÌWÒ,+.A)B
Ö:Ìr01(!CWKM$1-v5EÒ!$CE$=?K!0¼K7=UÓÎ!I?$Ñ4'Ð
570åÍl*0.Ñ ÑaÒ!=?ÌWÒç=U!Î*(5EK C7$¾=UÖÝ01C7Ó657=?0 zCE=?ÌWÒ/-vÒ!0.Ñ4$<1$C-v < $CEÐ¼KM=?ÓÎ!I?$º5EC7=JÌWÍå57Ò*65
ÌrCE(*G$I?Ð =?ÓÎ!I?$Ó$l5EK]57Ò!=JK=JG$;=JK5E0ÁKM=?ÓÎ!I?Ð ÌWÒ!0>01K7$K7(*ÌÌ$;KEKM=?< $$ræ! ÓÎ*IU$K:57Ò*656CE$
ÖÝC701Ó þ.ÿ rü 4ÌrIJ KEKM$KKM=?*Ì$57CW6=?!=?!Ï#$ræ!6ÓÎ!I?$;K]P9$;IU01!Ï =?!Ï5705EÒ!$ºKE6Ó$ÌrIJ KEKÑa=?IUI
Ó0lKe5aI?=UÍ1$I?Ð¾Ì0 l5E =U3KM=?Ó=UIJ6Ca=?ÖÝ0 CEÓ.5E=U01%A

!0 57Ò!$;CÒ!$;(!C7=JKM57=JÌaÖÝ0 C e(*GÏ1=U*ÏÙÒ!0.ÑìÓÙ(9ÌWÒ!$Ñx=?ÖÝ01C7Ó.5E=U015ECE =U!=?!Ï]$æ!6ÓÎ!I?$

Ìr0 l5W6=?*K/=JKv570D$ræ!6Ó=?!$5EÒ!$4$CEC701C/P,$r5eÑ4$$]57Ò!$R!$5eÑ)01C7Í:0 (5EÎ!(56*G]57Ò!$)5E C7Ï1$r5<' IU(*$
ÑaÒ!$57Ò*=?K=?!Î!(!5=?K®Î*C7$KM$;15E$;GcA ìI? C7Ï1$)$;C7CE0 C®=?*G=JÌ.5E$;K®5EÒ*.55EÒ!=JK=U*Î!(5Ò*1K!065P,$$;
IU$6CE!$;G#P>Ð357Ò*$*$r5eÑ40 CEÍ3 *GÁK703Ì0 l5E =U*KDºI?065:06Ö*$ÑÚ=?ÖÝ0 CEÓ.5E=U01%A,ÞRÒ*$CE$rÖÝ0 CE$ -,=U5
Ó Í $KvK7$*K7$)570:Î!C7$KM$;l5v5EÒ!=?K=?!Î!(5Ó0 CE$ÖÝCE$;Ül(!$;l57I?Ð A B
Ö,Ì0 (!CWKM$1-;P>Ð#è7I? C7Ï1$;êRÑ4$)Ó$;
C7$;I?657=?< $5E0
6I?I10 Ö57Ò!$)065EÒ!$C/57CW6=?!=U*Ï$ræ!6ÓÎ!I?$;K;A
Kc5EÒ!$4!$r5eÑ40 CEÍ5ECE =U*K;-r5EÒ!$;K7$CE$IJ.5E=U<1$
$CEC701CEK
Ña=UI?IÌWÒ*6*Ï $6*G KM0KMÒ*0 (!IJGÁ5EÒ!$ÖÝC7$Ül(!$*ÌÐ#06ÖÎ*C7$KM$;15W.5E=U01ÁÖÝ0 C¾Î*6C757=JÌr(*I? C
=U!Î*(5Î*.5757$CE%A Ó$57Ò!0Gç57Ò*65Ó0G=Uâ*$;K5EÒ!$¾Î*C701P*6P!=?I?=¦5eÐú06Ö:6Î!Î,$; CE *Ìr$¾06Ö
$ ÌWÒ
Î*.5757$CE3=?KÌ;6I?IU$G¾ !ý n/ÿ Trÿ >û D rA

C&%
.%1," "
*
3C>&'
2"&3 <3;%13;
1A?Òl( $Á57Ò!$ 57CW6=?!=U*ÏzKM$5¾KM0z57Ò*65¾KM(9ÌÌr$K7K7=?< $57CW6=?!=?!Ïz$æ ÓÎ!IU$K
*$< $;C ÝCW6CE$I?"Ð P,$I?0 !Ï5E057Ò!$]KE6Ó$ÌrIJ KEKA
E A C7$KM$;l5=?!Î!(5$ræ! ÓÎ*IU$KÙ5EÒ*.5Î!CE0G(*Ìr$ I? C7Ï1$º$CECE0 CÙÓ01C7$ºÖÝC7$L
Ül(!$;l57I?Ð57Ò* $ræ!6ÓÎ!I?$;K45EÒ*.5Î!CE0G(*Ì$@K7Ó6I?I%$CEC701C;A
Xu®ol_dbM1_Ebvikj:olikfMo^;bM\ ik_Wj.seg\rbe_gh1ZZ_nDikj@{l\sefMoZ]\WD{_\r¿!_WfEse_n:{.
bed1j>1d¿Ù_Ebhbed;b
i°seol_7be_ai¦g\:ghik^;jliª>fn\j.sbM\jl^;_ad;°%^;bM\;1ik_Wj.s\kl_WgW©
Ø0.Ñ4$< $;C;-®01!$ÁÓÙ(9Ke5ºP,$ Ì;6CE$rÖÝ(!IÑaÒ*$xÎ,$C757(!CEP!=?!Ï¼57Ò*$Á!0 CEÓ6IaÖÝCE$;Ül(!$;*Ìr=?$;K06Ö
=U!Î*(5$ræ!6ÓÎ!I?$;K:P,$;Ì (*K7$5EÒ!=JKÌWÒ*6!Ï1$;K
57Ò!$CE$IJ.5E=U<1$Ù=?ÓÎ,0 C75E *Ìr$57Ò9.5@57Ò!$!$5eÑ)01C7Í
Î!I?1Ìr$K/01ÙG/= V,$;C7$;15$ræ!6ÓÎ!I?$;K;A'ÞRÒ!=?KvÓ'Ð@0 C/Ó'Ð@!0 5vP,$4G$KM=?CW6P!I?$ Aî!0 C$ræ! ÓÎ*IU$1- >ÿ
D *ÿ ý ¦ÿ Eþ #þ ý hý D Oý6ÿ *ÿ >û kÿ ü ,DWý þ6ÿ ý Àü
P,$;Ì;6(*K7$01(57I?=?$CWK

Ì6åÎ!CE0G(*Ìr$IJ6CEÏ $$CECE0 CWK

Ð1$r5ÙK7Ò!0 (!IJG !065]P,$Î!CE$;K7$l57$GÁÖÝCE$;Ül(!$;15EIUÐ1AvBDå57Ò!$0 57Ò!$;C

Ò*6*G%-15EÒ!=JK5E$;ÌWÒ!!=JÜl(!$@Ì6ºP9$:Î*6C757=JÌr(*I? C7I?ÐP,$*$râ9Ì=? I*ÖÝ01C)P,0>01KM57=?!Ï]5EÒ!$DÎ,$C7ÖÝ0 CEÓ6*Ì$

ÖÝ0 Ca=?ÖÝC7$Ül(!$l57I?Ðº0Ì;Ìr(!CEC7=?!Ï=U*Î!(5EK;-!$1A Ï9A 2'ë 2Ù=?Î!Ò!0 *$Ó$@C7$Ìr0 Ï1!=U57=?0
A&'
!.0" "3C4 % 3 *C

&)0 >< $;C7Ï1$*Ì$)=JK(*K7(* IUI?Ð]Ö»1Ke5E$C=UÖ57Ò!$

'<1$CW6Ï1$)0 Ö,$ ÌWÒ=U*Î!(5<.6CE=J6P!I?$a0.< $;C®57Ò!$a5ECE =U!L
=U!ÏK7$r5®=?KÌIU0lKM$)570@ë;$CE0*A'Þ/0K7$$)57Ò!=JK-6Ì0 *K7=JG$C57Ò*$4$æ>57CE$Ó$aÌ K7$4ÑaÒ!$CE$4 IUIl5EÒ!$R=U!Î*(5EK
6CE$)Î,01K7=¦5E=U<1$ A'äå$=?Ï Òl5EKv570DÎ*6C757=JÌr(*I? Cv!0G$R=?Ù5EÒ!$4â*CWKe5®Ñ)$;=UÏ1Òl5IJ'Ð $;C C7$)(!Î,G*.57$GÙP>Ð
6º Ó01(!l5Î!CE0 Î,0 C757=?0 * I570 ÑaÒ*$CE$ @=?K57Ò*$ »KEÌ I? C $CEC701C.5)57Ò*65)!0G$:6*G =?K
57Ò!$:=U*Î!(5R< $;Ìr5701C »K7$$:$;Ül(*657=?0 *K #" *G ;nA>äxÒ!$36I?I90 Öc57Ò!$@Ìr01ÓÎ901!$l5EK)06Ö/

=U!Î*(54<1$;Ìr570 C46CE$
Î,01K7=¦5E=U<1$ ->6I?I90 Öc57Ò!$:(!ÎG!.5E$;K406Ö%Ñ4$=?Ï Òl5WK5EÒ*.5)ÖÝ$$;Gº=Ul5E0]*0>G!$DÑa=?IUI
P9$57Ò!$KE6Ó$K7=?Ï »=ÀA $ AcK7=UÏ10 Q;nA
K:¾C7$KM(!IU5;-957Ò*$;K7$Ñ4$=?Ï Òl5EK:Ì 01!IUÐ# IUI®G$;ÌC7$ K7$
0 C
IUI%=?*ÌC7$ K7$ Eû ü)ÖÝ01C
Ï =?< $;=U!Î*(5DÎ*65M5E$CE%A*ÞRÒ>(*K;-*=UÖÑ4$=?Ï Òl5<1$;Ìn5E0 CÓ(*KM5

ÌWÒ*6!Ï1$ºG=?C7$Ìn5E=U01z=¦5Ì z0 !I?ÐúG0 K70ÁP>Ðúë=?Ï ë; Ï Ï1=U!Ï#ÑaÒ*=?ÌWÒz=JK=U!$ÕÌr=?$l5 *Gú57Ò>(*K

< $CEÐK7IU0.ÑA
áhº5EÒ!$]6P,0.< $@$ræ! ÓÎ*IU$1->57Ò!$=?!Î!(5WKaÑ4$CE$: IUI%Î,01K7=U57=?< $ A!Ø
0.Ñ)$;< $;C;->=?3Ï1$!$;CE IÀ- lÐ
KMÒ!=UÖ×506Ö)57Ò!$'< $;CE Ï $]=?!Î!(5]'ÑR'Ð3ÖÝC701Óë$CE0ºÑa=?I?IP!=?1KD57Ò!$(!ÎG!.5E$;K@=? ¾Î*6C757=JÌr(*I? C
G=UCE$;Ìr57=?0 Á *G357Ò>(*KDKMI?0.Ñ G!0.Ña#IU$6CE!=U*Ï*A*ÞRÒ!$;C7$ÖÝ0 CE$ -!=U5D=JKÏ10>0>G35E0ºK7Ò!=¦Ö×5D57Ò!$Ù=U!Î*(5EK
KM05EÒ*.5D57Ò!$Ù'<1$CW6Ï1$:0.<1$CR5EÒ!$57CW6=?!=?!ÏºK7$r5
=JK
ÌIU0lKM$@5E0ë$;C709AÁÞRÒ!=JKÒ!$;(!C7=JKM57=JÌ]KMÒ*0 (!IJG
P9$º Î!Î!I?=U$G¼.5 IUI)I?'Ð1$CWK@ÑaÒ!=JÌWÒ¼Ó$6*K]57Ò*65ÙÑ4$Ñ4 l557Ò!$º'<1$CW6Ï1$0 Ö45EÒ!$ !
06Ö4¾!0G$570P,$ÌIU0lKM$Ù570ë$;C70¾P9$Ì (*KM$57Ò*$;K7$01(57Î*(5EK C7$Ù57Ò!$=?!Î!(5WK:570357Ò!$!$æ>5

I?'Ð1$C - OA/ÞRÒ!=JKÎ!CE0 P!I?$Ó÷Ì6úP,$ G!G!C7$K7K7$;G P>Ð Ìr0>01CEG=?*657=?!ÏÒ!0.ÑÚ5EÒ!$=U*Î!(5EK6CE$

57CW6*KMÖÝ0 CEÓ$;GúÑa=U57Ò¼5EÒ!$¾ÌWÒ!01=?Ì$06Ö
K7=?Ï Ó0 =JG!6I4 Ìr57=?<'657=?0 úÖÝ(!9Ìn57=?0 /A®Ø
$CE$Ñ4$ºG=JK7Ì(*K7K
57Ò!$=?!Î!(5a5ECE *KMÖÝ0 CEÓ657=?0 %AÞRÒ!$]G!=?KEÌr(*KEK7=U01¾0 Ö/57Ò*$]KM=?Ï Ó0 =JGºÖÝ0 I?IU0.ÑK;A
&)0 >< $;C7Ï1$*Ì$=JKÖ»1Ke5E$C]!0 5Ù0 *IUÐ =¦ÖR57Ò*$=U*Î!(5EK6CE$KMÒ*=¦Ö×5E$;Gz1K]G$;KEÌrCE=?P9$Gå P90.<1$
P!(5a6IJKM0]=UÖ%5EÒ!$Ð6CE$DKEÌ IU$GKM0Ù57Ò*654 IUI,Ò*'< $D6P,0 (5457Ò!$:K7 Ó$DÌr0.<. C7=J6*Ì$ - , ->ÑaÒ!$CE$
, I J 9 , F 5

%
6Kv+
Ø$CE$ - I =?K57Ò!$D>(!ÓÙP,$C)(06Öc57CW6=?!=?!Ï$ræ! ÓÎ*IU$K- ( , =JK57Ò*$DÌr0.<. C7=J6*Ì$a06Ö57Ò*$ ( =?!Î!(5

<' C7=J6P*IU$
*G , =JK57Ò!$ Ì0 ÓÎ901!$l5406Öc57Ò*$ 57CW6=?!=?!ÏÙ$æ!6ÓÎ!I?$ $A ?Ì IU=?!ÏÙK7Î9$;$;G!K

IU$6CE!=U*ÏP9$Ì6(9KM$D=¦54Ò!$I?Î*K5E0ÙP* I? *Ìr$a01(55EÒ!$

CW.5E$
.5)ÑaÒ!=?ÌWÒ57Ò!$DÑ)$;=UÏ1Ò15WKÌr01!!$Ìn57$G

570357Ò!$=?!Î!(5]*0>G!$;K:IU$6CE%AcÞRÒ!$<.6I?(!$06Ö5EÒ!$Ìr0.<.6CE=J6*Ì$K7Ò!0 (!IJG P9$Ó.5WÌWÒ!$;G Ña=¦5EÒ

57Ò*650 Öa57Ò!$KM=?Ï Ó01=?Gú(*K7$;G%A®î*0 CÙ57Ò!$KM=?Ï Ó01=?GúÏ1=U<1$zP,$I?0.Ñ- Ìr0.<.6CE=J6*Ì$06Ö º=JK
Ï 0>0G¾ÌWÒ!01=?Ì$ A
ÞRÒ!$$ræ!Ìr$;Î57=?0 35E0ºK7Ì;6I?=U*Ï IUIÌr0.<.6CE=J6*Ì$;K)5705EÒ!$ÙKE6Ó$<.6I?(!$]0ÌÌ(!CEKRÑaÒ*$#=U5
=?K
Íl*0.Ña 5EÒ*.5ÙK70 Ó$=U*Î!(5EKÙ6CE$0 Ö4I?$;KEKK7=UÏ1!=¦â,Ì69Ìr$57Ò* 0 57Ò!$;CEK;Acáh¼KM(9ÌWÒå#Ì; K7$ -c=U5
Ì6P,$]P,$!$â9Ìr=J6Ic5E0ºKEÌ6I?$:57Ò!$ÙIU$K7KK7=?Ï !=Uâ9Ì l5=U!Î*(5EKDG0.Ña3K7057Ò9.55EÒ!$Ð6CE$èMI?$;KEK
<l=JK7=UP!I?$;ê5E05EÒ!$I?$;6CE!=?!ÏÎ!CE0>Ì$;KEKA
')!3>&'
"034 % 3 *C
1ARÞRÒ*$D'< $;CE Ï $406Ö%$ ÌWÒ=?!Î!(54<' C7=J6P*IU$0.<1$C5EÒ!$5ECE =U*=U!ÏKM$5)KMÒ*0 (!IJGP,$DÌrI?01K7$
5E0ë$CE0*A
E A ?!Ì6I?$@=U*Î!(5<.6CE=? P!I?$;KaK705EÒ*.5a5EÒ!$=?CÌr0.<. C7=J6*Ì$;K46CE$6P,0 (5a5EÒ!$]K7 Ó$ A
*A4áh*Î!(5<.6CE=? P!I?$;KaK7Ò!0 (!IJG¾P9$(!9Ìr0 CECE$IJ.57$G=UÖÎ,01KEK7=UP!I?$ A
ÞRÒ!$6P,0.< $5eÑ)0357CE=JÌWÍ>K:0 ÖRK7Ò!=UÖ×57=?!ÏÁ6*GåKEÌ6I?=?!Ïº5EÒ!$=?!Î!(5WK]6CE$Ü1(*=¦5E$KM=?ÓÎ!IU$5E0
=UÓÎ!I?$Ó$l5"A !0 57Ò!$;CR57CE=?ÌWÍ5EÒ*.5a=JKaÜ1(*=¦5E$:6$ V,$Ìn5E=U<1$:P*(5aÓ01C7$@G=UÕÌr(!IU5a570=UÓÎ!I?$Ó$;15
=?K
570ºG!$;Ìr01C7CE$IJ.5E$@57Ò!$=U*Î!(5EK;A%&)0 *K7=?G!$C5EÒ!$K7=?ÓÎ*IU$!$r5eÑ40 CEÍº=? î=UÏ1(!CE$ E A9áOÖ=U!Î*(5EK
6CE$Ù(!*Ì0 CEC7$;I?657$G357Ò!$; =U5:=JK:Î90lK7K7=?P!IU$570K70 I?< $ÖÝ01CD5EÒ!$<' IU(*$0 Ö + 5EÒ*.5Ó=U*=UÓ=?ë$;K
57Ò!$$;C7CE0 C@Ña=U57Ò!01(5Ù6>Ð Ì0 *Ì$CEÁÖÝ0 C F -/69G <>=JÌr$< $;CEKE!Aáhå065EÒ!$CÑ40 CWG!K;-57Ò!$5eÑ40
<' C7=J6P*IU$K6CE$a=U9G$Î,$*G!$l5Ý57Ò!$:KMÐKM57$Ó 06Öc$Ü1(9.57=?0 9K=JKG=J6Ï10 * I8rA6äx=¦5EÒºÌr01C7CE$IJ.5E$;G
=U!Î*(5EK;-101!$Ó(*Ke5RKM01IU<1$RÖÝ0 C)P90 57ÒºK7=UÓ(!IU5E6*$0 (9KMI?ÐÑaÒ*=?ÌWÒ=JK)]ÓÙ(*ÌWÒÒ* CEG!$CÎ!CE0 P*IU$;Ó3A
CE=U*Ì=UÎ96IRÌr0 ÓÎ,0 !$;l5 *6I?ÐKM=JK À6IJKM0 Í>!0.Ñaç K5EÒ!$3H C7Ò>(!*$Lh"%0>$<1$$ræÎ* *K7=U01
Ì63P,$(*KM$G¾570C7$;Ó0.< $ kÿ Wý.üaÌ0 CEC7$;I?657=?0 *K)=U=?!Î!(5WK OA
áh!Î!(5WK57Ò9.53 C7$#I?=?!$; C7I?ÐTG!$Î,$*G$;l5 ×5EÒ!$ $æ>57CE$Ó$ÁÌ1KM$Á0 Ö]Ìr0 CECE$IJ.57=?0 Ó'Ð

6IJKM0@Î!CE0G(*Ìr$G!$Ï $;!$CW Ì=U$KÑaÒ!=?ÌWÒÓ'ÐÙK7I?0.ÑçI?$;6CE!=?!Ï*A1&)01*K7=?G$;C57Ò*$Ì K7$aÑaÒ!$;C7$R0 *$

=U!Î*(5=JK6I?Ñ4'ÐKv5eÑa=JÌr$R57Ò*$065EÒ!$C=?!Î!(5 F E + nA1ÞRÒ!$a!$r5eÑ40 CEÍ]01(57Î!(!5=?KÌr0 9Ke5W6l5
6I?0 !ÏIU=?!$K F F % E % + -ÑaÒ!$;C7$=JKaÌ0 *KM5E l5;AÞRÒ>(*K->5EÒ!$@Ï CW G=?$l5R=?KRë;$CE0

6I?0 !Ï@5EÒ!$;K7$DG=?CE$;Ìn5E=U01*:K ÀKM$;$Dî=?Ï (!CE$ E rA1Q#0.<>=?!ÏÙ IU01!Ï@57Ò!$KM$DIU=?!$KÒ*1K6P9KM01IU(5E$I?Ð*0

4$ V$;Ìn501ÁIU$6CE!=?!Ï*Aäå$ C7$Ù57CEÐl=?!Ï¾570KM01IU<1$Ù=? E LOÛ ÑaÒ*65:=JK:4$ V$;Ìn5E=U<1$I?Ð#0 !I?ÐÁ LOÛ
Î!C701P!I?$ÓAúáeG$; IUI?ÐÑ)$Ñ4 l5
570¾C7$;Ó0.< $]01!$0 Ö57Ò!$=?!Î!(5WK@ÑaÒ!=?ÌWÒ Ña=UI?IG$;ÌC7$ K7$]57Ò*$
KM=?ë$@06Ö57Ò*$!$r5eÑ40 CEÍA
îv=?Ï (*C7$ ¾KMÒ*0.ÑK5EÒ!$$;l57=?C7$Î!C70Ì$;KEK06Ö5ECE *KMÖÝ0 CEÓ=?!Ï=?!Î!(!5EK;AÞRÒ*$KM57$;Î*KD C7B$
KMÒ!=UÖ×5
=?!Î!(5WKK7057Ò!$Ó$; =?Kaë;$CE0*- E RG$Ìr0 CECE$IJ.57$:=?!Î!(5WK-,6*G R$;Ül(* IU=?ë$Ìr0.<. C7=UL
6*Ì$;K;A
y
ω2
Lines of
ω1 ω2 constant E
ω1
z1 z2
L<M7N ¹*¹ y9ikjl_W\be _Wq_Wjl1_Wj.sikjlq11segW©

Mean
Cancellation
KL-
Expansion
Covariance
Equalization
L<M7N ¹!¹ u*bM\;jlgO°?d;beZ]\rsei¦d;jd;°%ikj1ql1segW©

% "04
<" 9
0 *IU=?!$; Cº Ìr57=?<..57=?0 ìÖÝ(**Ìn5E=U01*Kº C7$#ÑaÒ*65ºÏ =?< $#*$(!CW6ID!$r5eÑ40 CEÍK57Ò!$;=UC¾!0 !I?=?!$; C
Ì6Î96P!=?IU=U57=?$;K;A1BD!$R06Ö,57Ò*$aÓ0lKe5Ì0 ÓÓ0 ÖÝ0 CEÓK®0 Ö,1Ìn57=?<..5E=U01ÙÖÝ(!9Ìn57=?0 =JK®57Ò*$KM=?Ï Ó0 =JG
ÑaÒ!=?ÌWÒÙ=?KaÓ0 !0 5701!=?Ì;6I?IUÐ:=?*ÌrCE$;1KM=?!ÏRÖÝ(!*Ìr57=?0 Ù57Ò*65 K7Ð>ÓÎ570 57$;Kv.5®KM01Ó$â*!=U57$)<' IU(*$
K =JK6Î!Î*C70l ÌWÒ!$GcAÞRÒ!$#Ó01KM5Ì0 ÓÓ0 T$ræ!6ÓÎ!I?$;K C7$35EÒ!$ÁKe5W6*G*6CWGçI?0 Ï1=?KM57=JÌ
ÖÝ(!*Ìn5E=U01 " 7 :69G Ò>Ð>Î9$;C7P,0 I?=JÌ]5W6!Ï1$l5 GÚ5E !Ò @K7Ò!0.Ña =?
îv=?Ï (*C7$ A ?>=UÏ1Ó0 =JG!K®57Ò9.5) C7$K7Ð>ÓÓ$57CE=?Ìa P901(557Ò*$0 CE=UÏ1=U@ »$ A Ï*A1K7$$î=?Ï (!CE$ P 6CE$
Î!C7$ÖÝ$CEC7$G]ÖÝ01C®57Ò!$DK7 Ó$RCE$;1KM01Ù5EÒ*.5=?!Î!(5WKK7Ò!0 (!IJGP9$
!0 CEÓ6I?=Uë;$;Gc-.* Ó$I?Ð -.P,$;Ì (*K7$
57Ò!$;Ðç6CE$ºÓ0 CE$ºI?=?Í $I?Ðú570åÎ!CE0G(*Ìr$30 (!57Î!(5W@K ÝÑaÒ!=JÌWÒì6CE$ ÿ 5E0 5EÒ!$!$ræ>5I?'Ð1$-C
57Ò*6546CE$a0 ¾'< $CW6Ï1$4ÌIU0lKM$570]ë;$CE0*A1ÞRÒ*=?K=?K)=U¾Ìr0 l5ECE1Ke5- K7'Ð1-.570Ù57Ò!$DIU01Ï =JKe5E=?Ì4ÖÝ(**Ìn5E=U01

ÑaÒ!01K7$@0 (5EÎ!(5EK
6CE$: IUÑR'ÐK)Î,01K7=U57=?< $6*GK70ÓÙ(*KM5aÒ*'< $Ó$;6¾57Ò9.5=JKaÎ90lKM=U57=?< $1A
"04
&" 9
1A?ÐlÓÓ$r5EC7=JÌºK7=?Ï Ó0 =JG!K]K7(*ÌWÒz KÙÒ>ÐlÎ,$CEP,0 I?=?Ì5W6!Ï1$l5Ù0 Ö×57$ãÌr0 ><1$CEÏ $Ö» KM57$;C
5EÒ*635EÒ!$]Ke5W6*G*6CWGI?0 Ï =JKM57=JÌ
ÖÝ(**Ìn5E=U01%A
E A CE$;Ìr01ÓÓ$;*G$GzKM=?Ï Ó01=?G - )=J6K C X5!( "- 5W6!Ò
FV <A ?>=?*Ì$57Ò*$
5W6!ÒÁÖÝ(!*Ìn5E=U01Á=JKDK70 Ó$r5E=UÓ$K
Ìr01ÓÎ!(5E657=?0 * IUI?Ð3$ræÎ9$;*K7=U<1$ -6Á6Î*Î!C70'æ=?Ó.L
5E=U0106Öv=U5P>ÐºCW.57=?006Ö®Î901IUÐ>!01Ó=J6IJKRÌ 3P,$(*K7$;G¾=U9Ke5E$; G%A
*A ?0 Ó$r5E=UÓ$;K=U5=JKÒ!$I?ÎÖÝ(!I>5E0:1G!GÙ:K7Ó IUIlI?=U!$6C57$;C7Ó-'$ A Ï*A x5E6*0Ò 7
K70 K4570'<10 =JG 965
KMÎ,065WKA
1.5
1
1
0.8
0.5
0.6
-3 -2 -1 1 2 3
0.4
-0.5
0.2 -1
-1.5
-6 -4 -2 2 4 6
(a) (b)
L<M7N ¹,¹ \4d;sbe_WfEdZZ_Wjl1_n!¯ seo1_agOsM\;jll\bMkd;îkgOseikfD°J1jlfEseikd;j9} x~ ~ 7© {

q_Ebe{dkikfRsM\;j1^_Wj.sn}
ì~ 6~r¡¾sM\;j1o
©
ÞRÒ!$
Ìr01*KM5E6l5WK=?57Ò!$DC7$Ìr01ÓÓ$;*G$;GºKM=?Ï Ó0 =JGÏ =?< $;6P,0.< $aÒ*'<1$P,$$¾ÌWÒ!01K7$K70
57Ò*65;- Wþÿ Àü7ý .ü Wþ¾ÿ »K7$$Î!CE$<>=?0 (*KG!=?KEÌr(*KEK7=U01 n-*57Ò!$]<. C7=J6*Ì$
06Ö5EÒ!$0 (!57Î!(5WK@Ña=UI?I6IJK703P,$ÌrI?01K7$Ù5E0 P,$;Ì;6(*K7$57Ò!$$6V,$Ìn5E=U<1$Ïl6=? 0 Ö57Ò*$KM=?Ï Ó0 =JG

=?K
C701(!Ï Ò!I?Ð]0.<1$CR=U5EK(9KM$ÖÝ(!IvCW6*Ï $ Aáh#Î96C757=JÌr(!IJ6C->57Ò!=JK:KM=?Ï Ó0 =JG3Ò* Ka57Ò!$]Î*C701Î9$;CM5E=U$K
» 1 - ÝP a57Ò!$KM$Ìr0 9G#G$CE=U<.657=?< $=JK
Ó.æ=?ÓÙ(!Ó .5 1-9 *G »Ì R57Ò*$
4$ V$;Ìn5E=U<1$Ï1 =U¾=?KÌIU0lKM$:5EB0 1A
BD!$06Ö:57Ò!$#Î,065E$l57=J6IaÎ!CE0 P*IU$;ÓKÑa=U57Òx(9KM=?!Ï¼KMÐ>ÓÓ$r5EC7=JÌKM=?Ï Ó0 =JG!K=JK57Ò9.557Ò*$
$CEC701C4K7(!C7Ö» Ìr$@Ì ºP,,$ 5 ü,.5a!$; C457Ò!$@0 CE=?Ï =?%Aî!0 C45EÒ!=?KRCE$;1KM01º=U5a=?KRÏ10l0G570'< 01=?G
=U!=U57=J6I?=?ë=?!Ï¾Ña=U57Òå< $;C7Ð#K7Ó6I?IÑ4$=?Ï Òl5WKAc24$;Ì;6(*K7$Ù0 Ö57Ò!$KE.5E(!CE657=?0 Á0 Ö57Ò*$K7=?Ï Ó0 =JG!K;-

57Ò!$$CEC701CDK7(!C7Ö» Ìr$=JKD I?K70 965DÖ» C:ÖÝC701Ó 5EÒ!$0 CE=UÏ1=U/A DG!G=?!Ï3¾KMÓ IUI®I?=U!$6CD57$;C7Ó 5E0
57Ò!$]K7=UÏ1Ó0 =JG¾Ì;6KM01Ó$57=?Ó$;KRÒ!$;IUÎ#'< 01=?G57Ò!$ ,.5CE$Ï =?0 9KA
&"034 2')4<%
!.*C%1
há ÌrIJ KEK7=¦â9Ì;.5E=U01]Î!CE0 P!I?$ÓK;-;5W6CEÏ $5v<.6I?(!$;K® C7$5eÐlÎ*=?Ì;6I?IUÐP!=?* C7Ð Ý$1A Ï9A 3 L - 7 Q=PnA'&)01ÓL
Ó01ºÑa=JK7G!0 Ó Ó=UÏ1Òl5RK7$$;Óõ570KM(*Ï Ï $Ke5)57Ò*65)5EÒ!$:5E C7Ï1$r5)<. IU(!$K4P9$K7$r5a.5457Ò!$@<.6I?(!$:06Ö
57Ò!$]K7=UÏ1Ó0 =JG R Ka1KMÐ>ÓÎ5E0657$KA!Ø
0.Ñ)$;< $;C;-157Ò!=JKaÒ*1KaKM$;< $;CE IG!CE'ÑaP*1ÌWÍKA
îv=?CWKe5-;=?*KM5E6P*=UI?=¦5E=U$K®Ì6ÙC7$KM(!IU5;A.ÞRÒ!$)57CW6=?!=?!Ï
Î*C70Ìr$K7K/Ña=UI?Il57CEÐ@570:GC7=?< $57Ò!$40 (!57Î!(5
KaÌrI?01K7$@ KRÎ,01KEKM=?P!I?$D5E057Ò!$:5W6CEÏ $r5a<.6I?(!$K-ÑaÒ!=JÌWÒ#Ì ¾01!IUÐºP9$ ÌWÒ*=U$;< $;G3 K7Ð>ÓÎ570 57=UL
Ì6I?I?Ð A DK)ÙC7$KM(*I¦5- 5EÒ!$
Ñ4$=?Ï Òl5EK Ý01(57Î!(!5R *G$<1$Ò!=JG!G$; C7$DGC7=?< $;570I? C7Ï1$C *G
I? C7Ï1$C<' IU(*$;KÑaÒ!$CE$)57Ò!$K7=UÏ1Ó0 =JGÙG$;C7=?<..57=?< $)=?KÌIU0lKM$)570@ë;$CE0*A'ÞRÒ!$R< $CEÐIJ6CEÏ $Ñ4$=?Ï Òl5WK
=U*ÌC7$ K7$5EÒ!$DÏ1CE1G=?$l5EK;- Ò!0.Ñ4$<1$C-65EÒ!$;K7$DÏ1CE1G=?$l5EK)6CE$57Ò*$ºÓ(!IU57=?Î!IU=?$;G¾P>Ð6º$æÎ90 L
!$l57=J6I?I?ÐK7Ó6I?I9K7=UÏ1Ó01=?GG!$CE=U<..5E=U<1$ Ý$æ!Ìr$Î!5ÑaÒ!$¾@5eÑa=?KM57=?!Ï5E$CEÓF=JK G*G$;G5E0]57Ò*$
KM=?Ï Ó0 =JG 4Î!CE0G(*Ì=U!ÏÑ4$=?Ï Òl5(!ÎG!.5E$]ÌrI?01K7$@570ë;$CE0*"A
K
C7$KM(!IU5;-!57Ò!$Ñ4$=?Ï Òl5EKÓ'Ð
P9$Ìr0 Ó$]KM57(*ÌWÍA
?>$;Ì0 *Gc-,ÑaÒ!$;#57Ò!$0 (!57Î!(5WKDKE.5E(!CW.57$1-*5EÒ!$Ù*$r5eÑ40 CEÍ¾Ï1=U<1$;K*0º=?*G=JÌ.5E=U01Á06Ö4Ìr0 !L
â9G$9Ìr$]I?$<1$I[A*äxÒ!$;Á6#=?!Î!(5
Î9.5M5E$CEÖ»6I?I?K
!$; C
G$Ìr=JKM=?0 P,0 (!9G!6CEÐ5EÒ!$]0 (!57Î!(5
ÌrIJ KEK=?K(**Ìr$;CM5W6=?%AáeG$; IUI?Ðz5EÒ!=JKKMÒ*0 (!IJGTP9$#CE$ *$Ìn5E$;GT=UT5EÒ!$#!$r5eÑ40 CEÍzP>Ðã ã01(5ML
Î!(5@<' IU(*$]57Ò9.5D=JKD=UÁP,$r5eÑ4$$;#57Ò*$]5eÑ40Î90lK7K7=?P!IU$Ù5E C7Ï1$r5
<.6I?(!$K-,=ÀA $ A,!0 5D!$6C
$;=¦5EÒ!$C
K7ÐlÓÎ5E065E$ A'Ø0.Ñ4$< $;C;-I? C7Ï1$Ñ)$;=UÏ1Òl5EK/57$9G570
ÖÝ0 CWÌr$46I?I101(57Î!(!5EKv570
57Ò!$)5E =UIJK/0 Ö57Ò!$RK7=UÏ L
Ó01=?GCE$Ïl6CWGI?$;KEK®06Ö,57Ò!$D(!*Ìr$;CM5W6=?l5eÐ A ÞRÒ>(*K;-65EÒ!$!$5eÑ)01C7ÍÓ'ÐÎ!CE$;G=JÌn5)@ÑaCE0 *Ï]ÌrIJ KEK
Ña=¦5EÒ!0 (5:Ï =?<>=U!Ï¾6>Ð¾=?*G=JÌ657=?0 Á0 Ö=¦5WKI?0.ÑÚÌr01â9G$;*Ìr$]=?#5EÒ!$ÙCE$;K7(!IU5;A"/6CEÏ $Ñ4$=?Ï Òl5WK
57Ò*65DK7657(!CW.5E$:57Ò!$]*0>G!$;KaÓ6Í1$@=¦5
=?ÓÎ90lK7K7=?P!IU$@5E0ºG/= V,$;C7$;l57=J.57$P,$r5eÑ4$$;35eÐlÎ*=?Ì;6I/ *G
!0 l5eÐ>Î!=JÌ6Ic$æ!6ÓÎ!I?$;K;A
KM01IU(5E=U01ç5E0ú57Ò!$KM$#Î*C701P!IU$;ÓK=?K570¼KM$557Ò!$Á5E6CEÏ $5<' IU(*$;K5E0åP,$ÁÑa=U57Ò!=?ì57Ò*$
CE !Ï $0 Öc57Ò!$:KM=?Ï Ó01=?Gc-1CE657Ò!$;C5EÒ*6º655EÒ!$:1KMÐ>ÓÎ5E0657=JÌa<.6I?(!$KA!&46CE$ÓÙ(9Ke54P9$D5E Í $/-
Ò!0.Ñ)$;< $;C;->5E0=U9KM(!CE$5EÒ*.55EÒ!$]!0G$Ù=?K
!065
CE$;KM57CE=JÌn57$G357001!I?Ð57Ò*$ÙI?=U!$6CÎ96C75
06Ö®57Ò*$
KM=?Ï Ó0 =JGcQA ?>$5M57=?!Ï:57Ò*$5W6CEÏ $r5®<.6I?(!$;Kv5E0
5EÒ!$RÎ901=Ul506Ö!5EÒ!$RÓ.æ=UÓ(!ÓÚK7$;Ìr01*GÙG!$CE=U<..5E=U<1$
0 ú57Ò*$3K7=UÏ1Ó0 =JG =JKÙ57Ò!$¾P9$Ke5Ñ4'Ð 570#5W6Í1$º G<. 15W6Ï1$06Öa57Ò*$º!01!IU=?!$6CE=¦5eÐåÑa=¦5EÒ!0 (!5
K7657(!CW.5E=U!Ï 57Ò*$ÁKM=?Ï Ó0 =JGcAÞRÒ!=JK=JK6!0 57Ò!$;CC7$ K70 ç57Ò!$ÁK7=UÏ1Ó0 =JGç=?xîv=?Ï (*C7$ PT=JK
Ï 0>0G#ÌWÒ!0 =JÌr$1A*áO5
Ò*1KÓ.æ=?ÓÙ(!ÓK7$;Ì0 *GG$;C7=?<..57=?< $.5 ]ÑaÒ*=?ÌWÒÁÌ0 CEC7$KMÎ,0 *G¾5E057Ò*$
P!=U96CEÐ5E C7Ï1$r5a<.6I?(!$;K45eÐ>Î!=JÌ Ic=U#ÌrIJ KEK7=¦â9Ì;.5E=U01ºÎ!CE0 P!I?$ÓK;A
2')4&%(
&)Ò!0>0lKM$R5W6CEÏ $r5<. IU(!$K.557Ò!$Î,0 =?l506Ö,57Ò*$Ó.æ=UÓ(!ÓõKM$Ìr0 9GG!$CE=U<..5E=U<1$a0 57Ò*$
K7=UÏ1Ó0 =JG¾K70 K4570'<10 =JGºKE.57(*CE657=?!ÏÙ5EÒ!$0 (!57Î!(5D(!!=U5EK;A
t s[ikgOseikjl^Ôse_EbeZ ikgõ\ghZ]\;kìkikjl_W\bõse_EbeZ \;l1_W sedÔseo1_jld61_íd seql sn}ú_© ^1©

åsM\;j1o ©
3" "0!." "034 C%%("4&;

ÞRÒ!$Ke5W6C757=?!Ïº<.6I?(!$KD0 Ö5EÒ!$Ñ4$=?Ï Òl5WK:Ì;6 Ò*'< $¾K7=UÏ1!=Uâ9Ì 15@4$ V$;Ìr50 Á5EÒ!$57CW6=?!=?!Ï

!Î C70Ì$;KEKA®äå$=?Ï Òl5EKK7Ò!0 (!IJGzP,$ÁÌWÒ!0lKM$;zCE *G01ÓI?Ð¼P!(5=?TK7(*ÌWÒT ÑR'Ðå57Ò*6557Ò!$#K7=UÏ L
Ó01=?Gº=?K4Î!CE=?Ó C7=?I?Ð1Ìn57=?<..5E$;G=?º=U5EK4I?=U*$;6C4CE$Ï =?0 /A1áOÖ/Ñ4$=?Ï Òl5WK) C7$:6I?I9<1$CEÐIJ6CEÏ $57Ò!$;
57Ò!$K7=UÏ1Ó01=?G Ña=UI?IKE.57(*CE657$C7$KM(!IU57=?!Ï3=UúK7Ó IUIÏ1CE1G=?$l5EK
57Ò*65Ó Í $IU$6CE!=?!Ï3K7I?0.ÑA
áOÖ
Ñ)$;=UÏ1Òl5EK6CE$º< $;C7ÐúK7Ó IUI)57Ò*$TÏ CW G=?$l5EKÙÑa=UI?Ia6IJKM0 P9$< $;C7ÐúK7Ó IUI[Aáh15E$CEÓ$;G=J.5E$
Ñ)$;=UÏ1Ò15WK457Ò*65
CW6!Ï1$D0.<1$CR57Ò*$]KM=?Ï Ó0 =JG R KRIU=?!$6CCE$Ï1=U01ºÒ*'<1$D5EÒ!$Ù1G<.6l5E Ï $D57Ò*65
57Ò!$Ï1CE1G=?$l5EK C7$IJ6CEÏ $$!01(!Ï Òå5EÒ*.5]I?$; C7!=?!Ï#Ì úÎ!C70Ì$$;Gú *G E @57Ò!$!$5eÑ)01C7Í
Ña=UI?I!I?$;6CE57Ò!$I?=?!$; CÎ*6C7506Ö,57Ò!$Ó Î!Î!=?!Ï@P9$ÖÝ0 CE$R57Ò!$aÓ0 CE$G=UÕÌr(!IU5!0 *IU=?!$; CÎ*6C75;A

ÌWÒ!=?$<>=?!Ï]5EÒ!=?K4CE$;Ül(!=?C7$K)Ì0>0 CWG=U9.57=?0 ºP9$5eÑ)$;$5EÒ!$D57CW6=?!=U*ÏK7$r5a!0 CEÓ6I?=Uë.5E=U01%-
57Ò!$
ÌWÒ*0 =JÌr$0 ÖcK7=?Ï Ó0 =JGc- *G57Ò!$DÌWÒ!0 =JÌr$0 ÖcÑ)$;=UÏ1Ò15=?!=U57=J6I?=Uë.5E=U01%A äå$
KM5E6C75P>ÐÙCE$;Ül(!=?C7L
=U!Ï5EÒ*.5a5EÒ!$]G=JKe5EC7=?P!(5E=U0106Ö/5EÒ!$0 (!57Î!(5WKa06Ö®$;1ÌWÒ¾*0>G!$Ò*'< $@Ke5W6*G*6CWG¾G$;<>=?657=?0
D06ÖR6Î!Î!CE0'æ=UÓ657$I?Ð AcÞRÒ!=JK@=?K@ ÌWÒ*=U$;< $;G .5:57Ò!$=?!Î!(5IJ'Ð $C:P>Ð#!0 CEÓ6I?=Uë;=U*Ï57Ò*$
57CW6=?!=U*Ï KM$5 KG$;KEÌrCE=?P9$G¼$; C7I?=U$;C;AvÞ/0Á01P5E =Uz Ke5W6*G! CEG¼G!$<>=?657=?0 zÌIU0lKM$5E0 º65
57Ò!$@0 (!57Î!(5a0 Ö%57Ò!$:â*CWKM54Ò!=JG!G$;¾IJ'Ð $;C)Ñ4$ e(*Ke5a!$;$;G570(*KM$:5EÒ!$6P,0.< $

CE$;Ì0 ÓÓ$*G!$;G
KM=?Ï Ó0 =JG5701Ï $57Ò!$;CÑa=U57Ò5EÒ!$
CE$;Ül(!=?C7$;Ó$l557Ò9.55EÒ!$
=?!Î!(!5)5E057Ò!$:KM=?Ï Ó01=?G I?K70]Ò*'<1$
Ke5W6*G! CEGG$;<>=?657=?0 A
KEKM(!Ó=?!Ï5EÒ!$Ù=?!Î!(5WK- , -*5E0º(!!=U5: C7$(**Ìr01C7CE$IJ.5E$;G
Ña=¦5EÒ3<. C7=J6*Ì$ ->57Ò*$]Ke5W6*G! CEG¾G$<>=J.5E=U01¾0 Öv5EÒ!$(!!=U5EKaÑ4$=?Ï Òl5E$;G3KM(!Ó Ña=UI?IcP,$

+ EF
F

J ,! 5 %
ÞRÒl(9K-'5E0:=?*K7(!CE$5EÒ*.5®57Ò*$ C7$R6Î!Î*C70'æ=?Ó657$;IUÐ 457Ò!$RÑ4$=?Ï Òl5EK®K7Ò!0 (*I?GP9$RCW6*G!0 ÓIUÐ
GCE'Ña¾ÖÝCE0 ÓíG=JKe5EC7=?P!(5E=U01Ña=¦5EÒ3Ó$63ë$CE06*G3Ke5W6*G*6CWG¾G$;<>=?657=?0 3Ï =?< $3P>Ð

+ 7F
%"
ÑaÒ!$CE$ Ô=?K45EÒ!$>(!ÓÙP,$Ca0 Ö=?!Î!(!5EKR5E057Ò!$(!*=¦5A
3" "!.0" "3C4 %1"04<;
DK7K7(!Ó=U*Ï5EÒ*.5 C

1A)5EÒ!$@57CW6=?!=?!ÏKM$5aÒ* KRP,$$;3*0 CEÓ IU=?ë$Gc-!69G

E A)5EÒ!$]K7=UÏ1Ó01=?GºÖÝCE0 Óíî=?Ï (!CE$ P3Ò* KRP,$$;3(9KM$G
5EÒ!$Ñ4$=?Ï Òl5EKRK7Ò!0 (!IJG¾P9$CW6*G!0 ÓIUÐG!CE'ÑaºÖÝC701ÓíG=?KM57CE=?P!(57=?0 Ý$1A Ï9A!(!!=UÖÝ0 CE&Ó
Ña=U57ÒÓ$; ¾ë;$CE0 *G3Ke5W6*G! CEG3G$;<l=J.5E=U01
+ EF
&
ÑaÒ!$;C7$ =JKR57Ò*$@Ö»6LO=U ×5EÒ!$]>(!ÓP9$;Ca06ÖÌr0 *!$;Ìr57=?0 *K)ÖÝ$$G=?!Ï#ÿ D57Ò!$*0>G!$ nA

&"034 @ %12')3C"3C4 ')2%(

ÞRÒ!$CE$Ù=JKD65DI?$;1Ke5:0 !$Ñ)$;IUIUL[Î*C7=?*Ìr=?Î!I?$;G#Ó$57Ò!0GÀG$;KEÌrCE=?P9$G#=UåKM$Ìn5E=U01 -!A E 4ÖÝ01CD$Ke5E=¦L
Ó657=?!Ï57Ò!$@=JG$; I,I?$; C7*=U!ÏCE657$ 0 A!Q 6>Ð0657Ò*$CaK7ÌWÒ*$Ó$;KGÝÓ0lKe5R06Ö%5EÒ!$ÓíCE657Ò!$;C)$;ÓL
Î!=UCE=JÌ6UI vÒ*'< $4P,$$Î!CE0 Î,01K7$;GÙ=U57Ò*$4I?=U57$CW.5E(!CE$5E0@ (5701Ó.57=JÌ IUI?Ð] GPe(*Ke55EÒ!$RIU$6CE!=?!Ï
CE657$ AcQ#0lKe5@06Ö57Ò!0lKM$K7ÌWÒ*$Ó$;K:G$;ÌC7$ K7$]5EÒ!$IU$6CE!=U*ÏºCW.57$ÑaÒ!$;Á57Ò*$Ñ4$=?Ï Òl5:< $;Ìr5701C
èM01KEÌr=?I?I?657$;KEê!-c *G =?*ÌC7$ K7$=U5]ÑaÒ!$; 5EÒ!$Ñ4$=?Ï Òl5< $;Ìr5701CDÖÝ01IUI?0.ÑK3C7$;I?657=?< $;IUÐ KM57$ GÐ
G=UCE$;Ìr57=?0 %A'ÞRÒ!$4Ó6=?ÙÎ!CE0 P!I?$Ó Ña=U57Ò57Ò!$KM$)Ó$r57Ò*0>G*Kv=JKv57Ò*65v5EÒ!$Ð] C7$)!0656Î*Î!C701Î!CE=?657$
ÖÝ0 CKe5E0ÌWÒ* KM57=JÌÙÏ1CE1G=?$l5D01C:01LOIU=?!$IU$6CE!=U*Ï¾P,$;Ì;6(*K7$Ù5EÒ!$Ñ)$;=UÏ1Òl5:<1$;Ìn5E0 C *(*Ìr57(*657$;K
6I?I57Ò!$@5E=UÓ$ A
2)$;Ð 01*G#ÌWÒ!0>01K7=U*Ï¾ºK7=?!Ï I?$Ï1IU01P*6IvIU$6CE!=?!ÏºCW.5E$ -,=¦5@=?K@ÌIU$6CD57Ò*65:Î!=JÌWÍ>=U*Ï¾3G=¦Ö×L
ÖÝ$CE$l5IU$6CE!=?!Ï¼CE657$ 0 ,ÖÝ0 C$ ÌWÒìÑ)$;=UÏ1Òl5ºÌ6ì=?ÓÎ!C70.<1$357Ò!$ Ì0 >< $;C7Ï1$*Ì$ <A Ñ4$I?IUL
Î!C7=?*Ì=UÎ!I?$;GÑR'Ð:0 Ö,G01=U*Ï
5EÒ!=?K;-.P*1KM$GÙ01Ìr01ÓÎ!(57=?!ÏK7$;Ìr01*GÙG!$CE=U<..5E=U<1$;K;-'=?K®G$K7ÌC7=?P9$G
=U K7$;Ìr57=?0 -*/A 1A9ÞRÒ!$Ó6=?#Î!Ò!=?IU0lKM01Î!Ò>Ð¾=JK5E0ºÓ6Í1$ÙK7(!C7$]57Ò*65:6I?I%5EÒ!$Ñ)$;=UÏ1Ò15WK=?#57Ò*$
!$r5eÑ40 CEÍÌr01l<1$CEÏ $DC701(!Ï Ò!I?Ðº6545EÒ!$]KE6Ó$KMÎ,$$GcA
Û
$;Î9$;*G=?!Ï@(!Î901Ù5EÒ!$Ìr(*C7<..5E(!CE$)0 Ö!57Ò!$a$;C7CE0 C®K7(!CMÖ»1Ìr$1-'K70 Ó$4Ñ)$;=UÏ1Òl5EKÓ'ÐC7$Ü1(*=UCE$
aKMÓ6I?I6I?$; C7!=?!ÏRCW.57$=?0 CWG$;C570'< 01=?G:G=?< $;C7Ï1$*Ì$ -nÑaÒ!=?IU$0657Ò*$CWKÓ'ÐDC7$Ül(!=UCE$RI? C7Ï1$
IU$6CE!=U*Ï
CW.57$4=?01CEG$;C/5E0Ìr0 ><1$CEÏ $65®DC7$ K70 * P!I?$)K7Î9$;$;GcA.24$;Ì;6(*K7$)0 Ö!57Ò*=?K;-.IU$6CE!=?!Ï
CE657$;K=U57Ò!$:IU0.Ñ4$CI?'Ð1$CWKKMÒ*0 (!IJGÏ $;!$CW6I?I?ÐÙP,$
IJ6CEÏ $C57Ò* =Uº57Ò!$DÒ!=?Ï Ò!$;CIJ'Ð $CWK ÀKM$;$
îv=?Ï (*C7$ E nA9ÞRÒ!=?KDÌr01C7CE$;Ìr5EK4ÖÝ0 C57Ò!$Ö»1Ìn5a5EÒ*.5
=?#Ó01KM5
!$;(!CE Ic!$r5:6CWÌWÒ!=¦5E$;Ìr57(!CE$;K;-57Ò*$
KM$Ìr0 9GG$CE=U<.657=?< $
0 Ö57Ò!$@Ìr0lKe5ÖÝ(**Ìn5E=U01ºÑa=U57ÒCE$;K7Î9$Ìn5)570ÙÑ4$=?Ï Òl5WK=?57Ò!$:I?0.Ñ)$;CI?'Ð1$CWK
=?K:Ï $;!$CW6I?I?Ð3K7Ó6I?IU$;C
5EÒ*6 57Ò*65:0 Ö57Ò!$Ò!=?Ï Ò!$;CDIJ'Ð $;CEK;A9ÞRÒ*$CW.5E=U01*6I?$]ÖÝ01C
5EÒ!$ P90.<1$
Ò!$(!CE=JKe5E=?Ì;KvÑa=?I?IP9$aG=JK7Ì(*KEKM$G]=?Ó0 CE$)G!$r5E =UI=?I?657$C®K7$;Ìn5E=U01*K®6I?0 !Ï:Ña=U57ÒKM(!Ï1Ï $Ke5E=U01*K
ÖÝ0 CÒ!0.Ñ 5E0ÌWÒ!0>01K7$@57Ò!$Ù1Ìn5E(*6Iv<' IU(*$06Ö®57Ò!$ÙIU$6CE!=U*ÏCE657$@ÖÝ0 C
57Ò!$GW= V$CE$l5
Ñ4$=?Ï Òl5WK
»K7$$]K7$;Ìn5E=U01 -*/A nA
áOÖ
KMÒ* C7$GúÑ)$;=UÏ1Òl5EKÙ C7$(9KM$GzKM(9ÌWÒç KÙ=U¼5E=UÓ$LOG$;I?'Ðå!$;(!CE I!$5eÑ)01C7Í&K ÀÞaÛ
E v01CDÌ0 >< 01IU(5E=U01*6I/!$r5eÑ40 CEÍK E O-!5EÒ!$ÙI?$; C7!=?!ÏCW.5E$ÙK7Ò!0 (!IJG#P,$ÙÎ!CE0 Î,0 C757=?0 96Ic5E0
57Ò!$KEÜl(*6CE$CE0>06506Ö)57Ò!$>(!ÓP9$;C06ÖaÌr01!!$;Ìr57=?0 *KK7Ò* C7=?!Ï¾57Ò*65@Ñ4$=?Ï Òl5-cP9$Ì6(9KM$Ñ4$
Íl*0.Ñ57Ò9.5a57Ò*$Ï CW G=?$l5EKR C7$@KM(*Ó06Ö®Ó0 CE$rLO0 C7L[I?$;KEK=?*G$;Î9$;*G$;15a5E$CEÓKA

*!.0" %AC%A@ %12')3C"3C4 B%1% 9
1
Ï U
= 1
< @
$ ;
$ W
Ì ¾
Ò )
Ñ ;
$ =UÏ1Òl5R=U5EK0.Ña¾I?$; C7!=?!ÏCW.57$

?
I ;
$ 7
C *
U
= !

Ï E
C 6
7
5
$ K4K7Ò!01(!I?GºP9$@Î!CE0 Î,0 C757=?0 96I957057Ò!$KEÜ1(96CE$
CE0l0 540 Ö%57Ò*$
>(!ÓP9$;Ca06Ö®=U*Î!(5EKa57057Ò!$(*!=¦5
Ñ4$=?Ï Òl5WK@=U¼I?0.Ñ)$;CI?'Ð1$CWKKMÒ*0 (!IJG 5eÐlÎ*=?Ì;6I?IUÐ P9$IJ6CEÏ $;C:57Ò* ¼=Uå57Ò*$
Ò*=UÏ1Ò!$CaIJ'Ð $;CEK
B
57Ò!$;CR57CE=?ÌWÍK)ÖÝ0 Ca=?ÓÎ!C70.<>=?!Ï57Ò!$]Ì0 >< $;C7Ï1$*Ì$
=?*ÌrI?(*G$ C

%(3;*
Q#01Ó$l57(!Ó
. 7 < 0 ( + 7 /%. 6
Ì6=?*ÌrCE$;1KM$RKMÎ,$$GÙÑaÒ*$57Ò!$Ì01KM5K7(!C7Ö» Ì$)=JK®Ò!=?Ï Ò!I?Ð]!01*KMÎ*Ò!$CE=?Ì;6I>P,$;Ì (*K7$4=U5G!6ÓÎ*K

57Ò!$#K7=Uë;$¾0 Ö57Ò!$#KM57$;Î*K6I?0 !ÏåG=UCE$;Ìr57=?0 *K0 Ö
Ò!=?Ï ÒãÌ(!CE<'657(!CE$¾57Ò>(*KÐ>=U$;I?G=?!Ïú IJ6CEÏ $;C
4$ V$;Ìn5E=U<1$ÙI?$; C7!=?!ÏCW.5E$Ù6I?0 *Ï57Ò!$G=?C7$Ìn5E=U01*K0 ÖIU0.ÑÚÌ(!C7<.657(!CE$ çG!$!0 57$;K
57Ò*$
Ke5EC7$;!Ï65EÒ0 Ö95EÒ!$aÓ0 Ó$l57(*ÓÚ57$;C7&Ó rA6áO5Ò* KP9$;$ÌrIJ6=?Ó$;GÙ57Ò*65Ó01Ó$l57(!Ó Ï $;!$CW6I?I?Ð
Ò!$I?Î*KÓ01C7$=?ÁP*65EÌWÒ#Ó0G$57Ò96#=?ÁKe5E0>ÌWÒ9 KM57=JÌ@Ó0G$1-9P*(5
!0ºK7ÐKe5E$Ó.5E=?ÌÙKe5E(*GÐ306Ö
57Ò!=JK6CE$@Í>!0.Ña5E057Ò!$] (57Ò!01CEK;A
8 9 C" #<% .%(!'-3"034 ')2%1 QÁ >Ð 6(5EÒ!0 CWK;-c=U*ÌIU(9G=U*Ï ?>0 ÓÎ,0 I?=U*K7Í>ÐÁ$r5Ù IÀA ( O-
ÛD C7Í1$ Q#0>0>G!Ð - [-<?>(!5M5701 * [-Q#(*CE65E $r56I[A E * aÒ*'< $3Î!CE0 Î,01K7$;G¼CE(!I?$;KÖÝ01C
6(5E0 Ó.5E=?Ì;6I?IUÐ G! Î57=?!Ï5EÒ!$
I?$; C7*=U!ÏCW.5E$;K »K7$$D I?K70 & *nA>ÞRÒ!$KM$CE(!I?$;K)Ìr0 l5EC701I!57Ò*$
KMÎ,$$G¾0 ÖÌ0 >< $;C7Ï1$*Ì$
P>Ð=?*ÌrCE$;1KM=?!Ï0 CG$ÌrCE$; K7=?!ÏÙ5EÒ!$@IU$6CE!=?!ÏCE657$@P* K7$;Gº0 357Ò*$
$CEC701C;A
ä $ KEKM(!Ó$5EÒ!$]ÖÝ01IUI?0.Ña=?!ÏÖ» Ìr5EK
ÖÝ0 C:IU$6CE!=?!ÏCW.57$ G! Î5E657=?0 Á6I?Ï 01C7=U57Ò*BÓ C
57Ò!$K7Ó6I?IU$Ke5%$;=UÏ1$><.6I?(!$®06Ö>57Ò*$Ø$K7K7=? »K7$<$ AÜ,8A E Q( %,=JK%K7(ÕÌr=?$l57I?Ð@KMÓ6I?I?$C57Ò96@57Ò*$
KM$Ìr0 9G#KMÓ6I?I?$;KM5
$=?Ï $;l<. IU(!$6*G E a57Ò!$;C7$ÖÝ0 CE$Ù.Ö×5E$CDI? C7Ï1$]>(!ÓÙP,$CD06Ö=U57$CW.5E=U01*K;-
57Ò!$ Î*6CW6Ó$r5E$C<1$;Ìn5E0 C /%. ºÑa=UI?ID Î!Î!CE01 ÌWÒã57Ò*$ Ó=U*=UÓ(!Ó ÖÝC701Ó 5EÒ!$ G=?C7$Ìn5E=U01x06Ö
57Ò!$3Ó=U!=?ÓÙ(*Ó $=?Ï $><1$;Ìn5E0 C]0 Öa57Ò!$Ø$K7K7=J6 »K7$B$ AÜ,8A E X( n-®î=UÏ1(!C7$ X" rA
*G$C57Ò*$;K7$
Ìr0 9G=¦5E=U01*K57Ò!$:$<10 I?(57=?0 0 Ö%57Ò*$D$Ke5E=UÓ.5E$;GÎ96CW6Ó$r5E$C)Ì;6¾P9$D57Ò!01(!Ï Òl5406Öv1K)Ù0 !$L
G=UÓ$;*KM=?0 96IÎ!C70Ì$;KEK
69G57Ò!$Ó=U!=?ÓÙ(*Ó$=?Ï $><1$;Ìn5E0 C # Ì ÁP9$6Î*Î!C70'æ=?Ó.57$G ×ÖÝ0 C
IJ6CEÏ $:>(!ÓÙP,$C
06Öv=U57$;CE657=?0 *DK C!K7$$î=UÏ1(!C7$ " 4PlÐ
#

P
ÑaÒ!$CE$ :G$;!065E$;KR5EÒ!$ F:!01C7ÓA!Ø

$*Ì$Ñ)$Ì 31G0 Î!5
Î*C7X0 e$;Ìn5E=U01
# +

570¾57Ò!$6Î*Î!C70'æ=?Ó.57$GÓ=U!=?ÓÙ(*Ó A=?Ï $;l<1$;Ìr570 C # 1K:¾0 *$G=?Ó$*K7=U01*6I®Ó$;1KM(*C7$06Ö
57Ò!$
G!=?KM5E *Ìr$R5E057Ò!$
Ó=?!=?ÓÙ(!ÓAßÞRÒ!=JKG=JKM5E69Ìr$Ì P9$(9KM$G570]Ìr01l57CE0 I>57Ò*$IU$6CE!=?!Ï
CE657$ ×ÖÝ01CG$r5W6=?I?KK7$$ E * U
/C. 7 < /. 7 F10 ( (
% X(
' /. 7 D F Q ' /%. 7 (

% *
0 /. 7 D 0 /%. 7 0 %. ' /. 7 F10 /%. ;
% -
ÑaÒ!$CE$ ÙÌr0 l5EC701I?K57Ò!$@I?$; ÍK7=?ë$:06Ö/5EÒ!$@'< $CW6Ï1$ -
å C7$:Ìr0 9Ke5W6l5EK46*G ' =JK)(9KM$G¾1K
6(æ=?IU=J6CEÐ<.6CE=? P!IU$:5E0Ì;6IJÌr(!IJ.5E$:57Ò!$I?$; ÍlÐ¾'< $CW6Ï1$
0 Ö/57Ò*$Ï CW G=?$l5 A

065E$º5EÒ*.55EÒ!=JKKM$50 Ö
CE(!IU$K=?K$ K7Ðú570¼Ìr01ÓÎ*(57$#69GçKM57CW6=?Ï Òl5MÖÝ01C7ÑR6CWGå570ú=?ÓL
Î!IU$;Ó$l5;Acäå$KM=?ÓÎ!IUÐ Ò*'< $570Í $;$Î 57CW ÌWÍ#06Ö46ú G!G=U57=?0 96I®< $Ìn5701C@=U AÜ,8A * -C,57Ò*$
'< $CW6Ï1$;G:Ï CW G!=U$;15 ' A.ÞRÒ!$)!0 CEÓ 06Ö5EÒ!=?K<1$;Ìn5E0 C/57Ò!$;Ìr01l57CE0 IJKc57Ò*$4K7=Uë;$0 Ö57Ò!$4IU$6CE!=?!Ï
CE657$ »K7$$ AÜ,8A - %rA ÞRÒ!$
IUÏ10 CE=¦5EÒ!ÓÚÖÝ01IUI?0.ÑK5EÒ!$
K7=UÓÎ!I?$=?15E(!=U57=?0 C1Ö»6C4'Ñ4'ÐÖÝCE0 Ó57Ò*$
Ó=?!=?ÓÙ(!Ó »I? C7Ï1$ÙG=JKe5W6*Ì$
=U5DÎ!CE0Ìr$$G!KD=U P!=UÏ3Ke5E$Î*K:69G ÌrI?01K7$5E057Ò*$Ó=?!=UÓ(!Ó
=¦5
!!$; I?K45EÒ!$I?$;6CE!=?!ÏCE657$ ×ÖÝ01CR57Ò!$;0 CE$r57=JÌ I%G$r5W6=?I?K
KM$;$ E * UrA
9 "!. +!"0"!C*C36"0<3C # "04
&" 9# 3C"0
IU57Ò!01(!Ï Ò¼Ó01KM5K7ÐKe5E$ÓK(*K7$!0G$;KÙP* K7$;Gå01zG065ÙÎ*C70G(*Ìr5EK] *G¼K7=UÏ1Ó01=?G*K-%Ó lÐ
0657Ò*$CR5eÐ>Î9$KR06Ö(!!=U5EK Ý01CaI?'Ð1$CWK#Ì ¾P,$(*K7$;GcA" Ìr01ÓÓ013 I¦5E$CE*.5E=U<1$D=JK457Ò*$:CW G!=? I
P* K7=?KÖÝ(!*Ìr57=?0 ÀN2Rî !$5eÑ)01C7Í'»K7$$ (- E &!- "!- E UáhºN2Rîå!$r5eÑ40 CEÍK- 57Ò!$:G065RÎ!CE0>GL
(*Ìn50 Ö95EÒ!$aÑ)$;=UÏ1Òl5 *G=U*Î!(5< $Ìn5E0 C®=?KC7$;Î!I?1Ìr$GÙÑa=U57ÒGA(*ÌIU=JG$6G=JKe5W6*Ì$4P,$r5eÑ4$$;
L<M7N ¹*¹ dj6'_Ebe^;_WjlfW_d°/seo1_>dna©>pR beikjl^@seol_Rªlj>\,gOsM\;^;_Dd°%k_n\bej1i¦j1^seol_
\n'_EbM\;^;_>dnTikg
\qlq bedr½ ikZ]\se_Ek]d;jl_a ikZ_Wjlghikdjl\;sedn\rbM1g®seol_)ZikjlikZ:1ZR\;jlÙisikg\D^;d.d6\;q1q1bedn½ ikZ]\|
seikd;jd;°seo1_aZikjlikZ:1Z_Wik^;_Wj6\¦1_1ibe_WfEseikd;jd;°cseo1_ _Wghghi¦\j9©
57Ò!$D=U!Î*(54 *GÑ)$;=UÏ1Òl5)6*G5EÒ!$DK7=?Ï Ó0 =JG=?KC7$;Î!IJ Ìr$GP>Ð $ræÎ901!$l5E=? IÀAlÞRÒ!$0 (!57Î!(5
Ìn5E=U<>=U5eÐ=JKÌr01ÓÎ*(57$Gc-!$ A Ï*A>ÖÝ01Ca0 !$@01(57Î!(!5;-*1K
J

, $ræÎ
F E ,F F , F
, Kv+
ÑaÒ!$CE$ , , =JK5EÒ!$
Ó$6 »KM5E *G! CEGG$;<l=J.5E=U01 ®0 Öc57Ò!$ [L[57Òº8@ (*K7K7=J6%AlÞRÒ!$;K7$(!!=U5EK
Ì6 C7$;Î!I?1Ìr$0 CDÌ0>$ræ=?KM5DÑa=U57Ò 57Ò!$KM5E69G!6CWG#(!!=U5EK@ *G57Ò*$ÐÁ6CE$](*K7(* IUI?Ð357CW6=?!$;G#P>Ð

Ìr0 ÓP!=?*.5E=U0130 ÖÏ CW G=?$l5

G$K7Ì$l5 ×ÖÝ01C
0 (!57Î!(5:(!!=U5E K a69G(!*K7(!Î9$;C7<>=JKM$G#ÌrI?(*KM57$CE=?!Ï
ÖÝ0 CG$57$CEÓ=U*=U!Ï5EÒ!$Ó$; *Ka6*G¾Ña=JG>57Ò*K06Ö57Ò!$]N2Rî¼(*!=¦5WKA
!I?=UÍ1$K7=?Ï Ó0 =JG!6I(!!=U5EK@ÑaÒ*=?ÌWÒåÌ6åÌr0.<1$C5EÒ!$$l57=?CE$K7Î* Ì$ -¾K7=?!Ï I?$N
24îì(*!=¦5
Ìr0.< $;CEK:01!IUÐå#KMÓ IUI)IU0Ì IC7$;Ï =?0 ú06Ö457Ò*$=U*Î!(5K7Î* Ì$ A/ÞRÒ*=?K]Ì;6úP,$º6z1G<.6l5E Ï $
P9$Ì6(9KM$I?$;6CE!=?!Ï#Ì P,$Ö» KM57$;C;A%N2Rî (!!=U5EKÓ'Ð I?K70ºÖÝ01C7Ó÷P9$5M57$;CÙK7$r506ÖRP* K7=JK
ÖÝ(!*Ìn5E=U01*K:5703Ó0G$I57Ò*$=?!Î!(!5K7Î*1Ìr$57Ò* K7=UÏ1Ó01=?G (!!=U5EK;-%6IU57Ò!01(!Ï ÒÁ57Ò!=JK@=?K:Ò!=?Ï Ò*IUÐ
Î!C701P!I?$ÓõG$Î,$*G!$l5;A*BD57Ò*$D!$;Ï1657=?< $K7=?G$1- 5EÒ!$
I?0Ì IU=U5eÐÎ*C701Î9$;CM5eÐ06Ö/N2Rî®K)Ó'ÐP,$
ÁG=JK71G<.6l5E Ï $Î* CM5E=?Ì(!I? C7I?Ð =?zÒ!=?Ï ÒzG!=UÓ$9KM=?0 * I4K7Î*1Ìr$;K]P9$Ì6(9KM$Ó'Ðå(!!=U5EK6CE$
!$$G$;Gç570úÌr0.<1$C57Ò!$#K7Î* Ì$;K;A®N
24î®K6CE$3Ó01C7$36Î!Î*C701Î!C7=J.5E$º=? »IU0.Ñ G=UÓ$;*KM=?0 96UI
(!Î!Î,$CIJ'Ð $;CEK46*GK7=UÏ1Ó0 =JG!K4=U ÝÒ*=UÏ1ÒG!=UÓ$9KM=?0 * 8I )I?0.Ñ)$;CRI?'Ð1$CWKA
Ç) Ä )ò%Æ!ôò/ÄDÊò Ç #öxÆ!óÈ
ËMò/Ä)Å xò >Êò/Ä)Å

8 @B"0 .0% %1:'
áhÙ5EÒ!=?K®K7$;Ìr57=?0 ÙÑ4$4$ræ!6Ó=?!$4K70 Ó$406Ö957Ò!$)57Ò!$;0 CEÐP9$;Ò!=?*G]5EÒ!$457CE=?ÌWÍKÎ!CE$;K7$l57$G$; C7I?=?$CA
ä $P,$Ï1=U=?301!$]G=?Ó$*K7=U01¾ÑaÒ*$CE$D5EÒ!$](!ÎG!657$$;Ül(*657=?0 ¾ÖÝ0 CaÏ1CE1G=U$;l5aG!$;KEÌr$l5Ì
P9$ÑaCE=¦5757$;1K
. 7 < /%. F 0 & 5 E

ä ]$ Ñ40 (*I?GI?=UÍ1$@570Í>!0.Ñ Ò!0.Ñ 57Ò*$]<.6I?(!$]06Ö 0 PV$;Ìr5EK
Ì0 >< $;C7Ï1$*Ì$: *G357Ò*$]IU$6CE!=?!Ï
KMÎ,$$GcA/î=UÏ1(!CE$ &3=?IUI?(*KM57CW.57$K:57Ò!$I?$; C7*=U!Ï3P,$Ò*'<>=?0 C:ÖÝ0 C]K7$<1$CW6IG=/V,$;C7$;15ÙK7=?ë$;K:0 Ö 0
ÑaÒ!$º57Ò!$DÑ)$;=UÏ1Ò15 Ke5W6C75EK)0 (54=Uº57Ò!$D<>=?Ì=U!=U5eÐ06ÖvÙIU0Ì;6I*Ó=?!=?ÓÙ(!ÓA>áhº01!$DG=?Ó$L
KM=?0 %-v=¦5=?KÙ$; K7ÐÁ5E0 G$â*!$5EÒ!$º01Î57=?Ó6II?$; C7!=?!Ï#CW.5E$ - 0 (W-v1K]P,$=?!Ï#57Ò*$IU$6CE!=?!Ï
CE657$¾57Ò*65Ña=UI?IaÓ0.< $¾57Ò!$Ñ4$=?Ï Òl55E0å57Ò!$Ó=U*=UÓ(!Ó- ,7.c-=?ãÎ!CE$;Ì=?K7$I?Ð¼0 *$KM57$;Î
»K7$$@îv=?Ï (*C7$ & »8= eP nA!áOÖ 0 =?KaK7Ó6I?IU$;C5EÒ*6 0 (57Ò!$;º5EÒ!$Ke5E$Î*K7=?ë$:Ña=UI?IcP9$K7Ó6I?IU$;C4 *G
Ìr0 ><1$CEÏ $9Ìr$aÑa=?IUI,5E Í $Ó(!IU57=?Î!IU$D57=?Ó$;KM57$;Î*KA> áOÖ 0 =JK)P9$5eÑ)$;$ 0 ( 6*G E 0 ( 57Ò!$;57Ò*$
Ñ)$;=UÏ1Ò15RÑa=?IUI01KEÌr=?I?I?657$D C701(!*G ,/. P!(5aÑa=?IUI$;< $l5E(*6I?IUÐÌ0 >< $;C7Ï1$ »î=?Ï (!CE$ & ÝU= eÌ nAáOÖ
0 =JKÓ0 CE$@57Ò96#5eÑa=JÌr$@5EÒ!$K7=Uë;$]06Ö 0 ( Àî=UÏ1(!C7$& ÝU= eG 457Ò!$;#57Ò*$ÙKe5E$Î*K7=?ë$]=JK
K70I? C7Ï1$
57Ò*65a57Ò!$Ñ4$=?Ï Òl5a$*G*KR(!Î3Ö»6C757Ò!$;CRÖÝC701 Ó ,7. 57Ò* 3P,$rÖÝ01C7$1A!Û
=?< $;C7Ï1$*Ì$@C7$KM(!IU5EK;A
E(ω) E(ω)
E(ω)
η < ηopt η = ηopt

η = ηopt
ω ω
ωmin b) ωmin
a)
ω
ωc ωmin
E(ω) E(ω)
η > 2 ηopt dE/dω
η > ηopt
ω
ωc ωmin
dE(ωc)
dω
ω ω ∆ω
ωmin ωmin
i

c) d)
iki
L<M7N ¹*¹ M
b
\ ¦
i E
_ 6
j )
s W
_ h
g W
f W
_ .
j
s J
°
d R
b
i !
¿ E
_ e
b W
_ .
j
s k
n
_
\ e
b 1
j i¦j1^@bM\se_WgW©
äxÒ*.5=JK57Ò!$#01Î57=?Ó6Ia<.6I?(!$06ÖD57Ò!$#I?$; C7!=?!ÏåCE657$ 0 (Á"%$r5(9Kâ*CEKM5Ìr01*KM=JG$;C
57Ò!$ÙÌ; K7$@=U rLhG=?Ó$*K7=U01%A DK7K7(!Ó=U*Ï57Ò*65: Ì;6P,$Ù6Î!Î*C70'æ=?Ó657$GP>Ð¾Ül(*1GCE657=JÌ
ÖÝ(!*Ìn5E=U01%- 0 ( Ì;6#P,$G$;C7=?< $GPlÐ3â*CWKM5D$ræÎ* *G=?!Ï =? Þ'Ð>IU01C
K7$CE=U$K
6P,0 (!5
57Ò*$
Ìr(!CEC7$;l5RÑ4$=?Ï Òl5- DC
< & 7 F &
- 7 E F F 6F & F - 7 565D54
E

ÑaÒ!$CE$]Ñ4$](*K7$]57Ò*$K7Ò!0 C757Ò* *G K A,áOÖ =JK
Ül(*1GCE657=JÌ@57Ò*$K7$;Ìr01*G
0 CWG$C]G$;C7=?<..5E=U<1$=JK]Ì0 *KM5E 15] *G 57Ò!$Ò!=?Ï Ò*$C]01CEG$;C:57$CEÓK<.6!=JK7Ò%A/ÛDW= V$CE$l57=J.5E=U*Ï
P90 57Ò#KM=JG$KRÑa=¦5EÒ3CE$;K7Î,$;Ìn5a5E&0 5EÒ!$Ï =?< $K
& &
- 7 F
- 6F & - 5 EXE

F
?>$r5757=?!BÏ ,7.#6*G#*0657=?!Ï5EÒ*.5 & ,7. -9Ñ)$6CE$]IU$Ö×5:6Ö×57$CDC7$6C7L
CE !Ï =?!ÏÑa=¦5EÒ +

F D
,7.& F
F
- 5
-
E

&)0 ÓÎ*6CE=?!Ïå57Ò!=JKÑa=U57Òì57Ò*$Á(!ÎG!.5E$Á$;Ül(*657=?0 E n-Ñ4$â**GT5EÒ*.5¾Ñ)$ÁÌ;6TC7$ ÌWÒì
Ó=?!=?ÓÙ(!Ó =U01!$Ke5E$Î=¦Ö
0 (<

F & + 5 E

F
®$CEÒ*6Î*K® $; K7=?$CÑR'Ð]57001P5E =U5EÒ!=?K)K7 Ó$aC7$KM(!IU5=JK=?IUI?(*KM57CW.57$G=Uî=?Ï (!CE$ & »=UU= nA
ÞRÒ!$¼P90 5M5E0 Ó Ï1CE Î!Ò Î*IU0 5EK357Ò*$¼Ï CW G=?$l5#06Ö 1K#ãÖÝ(!*Ìn5E=U01 06&Ö A ?>=?*Ìr$ =?K
Ü1(9 GCW.5E=?Ì -v5EÒ!$#Ï CW G!=U$;15=JKKM=?ÓÎ!IUÐã KM57CW6=?Ï Òl5IU=?!$#Ña=U57ÒT<' IU(*$3ë;$CE0å6557Ò!$#Ó=?!=¦L
ÓÙ(!Óõ6*G .55EÒ!$
Ì(!C7CE$l5Ñ4$=?Ï Òl5 A DF 6F =JKK7=UÓÎ!I?Ð]5EÒ!$
K7IU01Î9$R06Ö5EÒ!=?K
IU=?!$]69Gº=JKÌr01ÓÎ!(57$G3(*KM=?!Ï57Ò*$]Ke5W6*G! CEG¾KMI?0 Î,$@ÖÝ0 CEÓÙ(*I?
F F & F ,/. F 5 E "

?>0 I?<>=U!ÏÖÝ01:C ,7.º5EÒ!$Ï1=U<1$;K4$;Ül(*657=?0 E nA
äxÒ!=UI?$#5EÒ!$ÁI?$;6CE!=?!Ï¼CW.57$5EÒ*.5¾Ï =?< $;KÖ» KM57$;KM5¾Ì0 >< $;C7Ï1$*Ì$¾=JK 0 ( -57Ò*$ÁI? C7Ï1$;KM5
IU$6CE!=U*ÏTCW.5E$å57Ò*65 Ì6 P,$¼(*K7$;G Ña=¦5EÒ!0 (!5 Ì (*KM=?!ÏìG=?< $CEÏ $;*Ìr$ú=JK » I?K70xKM$;$¼îv=?Ï6L
(!C7$'& ÝU= eG
0 + E 0 ( 5 E &
áO1Ö =?K@!0 5$ræ! Ìr57I?ÐÁÜ1(9 GCW.5E=?ÌÙ57Ò!$; 5EÒ!$Ò!=?Ï Ò!$;C@0 CWG$CD57$;C7ÓK@=? $Ü1(9.57=?0 E :6CE$
!065Î!CE$;Ì=?K7$I?Ð¼ë$;C70ú6*G E Ù=?K01!I?Ðz6ì6Î!Î!CE0'æ=UÓ657=?0 %AáhTK7(*ÌWÒì Ì; K7$ -®=U5Ó'Ð
5E6Í1$]ÓÙ(*I¦5E=UÎ!I?$Ù=U57$;CE657=?0 *K570¾IU0Ì657$5EÒ!$Ó=?!=?ÓÙ(!Ó $;< $#ÑaÒ*$Á(*K7=?!Ï 0 ( -9Ò!0.Ñ4$<1$C-
Ìr0 ><1$CEÏ $9Ìr$DÌ;6Ke5E=UI?I%P,$]Ül(!=¦5E$@Ö» KM5;A
áh ÓÙ(*I¦5E=UÎ!I?$G=UÓ$;*KM=?0 9K-cG$57$CEÓ=U*=U!Ï 0 (a=?K@¾P!=¦5@Ó0 CE$G=UÕÌr(!IU5P9$Ì (*KM$57Ò*$
C7=?Ï Òl5
K7=?G$0 Ö E 4=?KÓ657CE=¦æ +DÑaÒ!$;C7$ =JK
Ì;6I?IU$Gº5EÒ!$]Ø$K7K7=J6ÑaÒ!01K7$]Ìr01ÓÎ90 L
!$l5EK
6CE$DÏ1=U<1$3PlÐ
,! ,DF E X(
Ña=¦5EÒ
¼-* *G $Ü1(96I5705EÒ!$@570 5E6I/l(*ÓÙP,$Ca06Ö®Ñ4$=?Ï Òl5EK;A
=JK
Ó$;1KM(*C7$406Ö957Ò!$aÌr(*C7<..5E(!CE$)0 Ö A.áh5eÑ)0@G=?Ó$*K7=U01*K;-'57Ò!$RI?=U!$K®06ÖÌr0 9Ke5W6l5

ÖÝ01C®:Ül(* G!CE657=JÌÌr0lKe56CE$40.<' I>=UKMÒ* Î9$a1KK7Ò!0.Ña]=?îv=?Ï (*C7$ (>A.ÞRÒ!$R$=?Ï $;>< $;Ìr5701CEK/06Ö

Î901=Ul5)=?5EÒ!$DG=?CE$;Ìn5E=U01*K0 Ö57Ò!$DÓQ e0 C *GÓ=?!0 C4.æ$;K;A ÞRÒ!$D$=?Ï $><.6I?(!$KÓ$; K7(!CE$
57Ò!$]KM57$;$Î!!$K7KR0 Ö 6I?0 *Ï5EÒ!$]Ìr01C7CE$;K7Î,0 *G=?!Ï$=?Ï $9G=UCE$;Ìr57=?0 %A

.%& áhz57Ò!$¾IU$ KM5ÙÓ$;6ãK7Ül(* C7 $ À"/Q ? ]6I?Ï 01C7=U57Ò*Ó3-vÑ)$ºÒ*'< $º KM=?!Ï I?$ºIJ'Ð $;C
IU=?!$; Ca!$r5eÑ40 CEÍÑa=U57Ò$CEC701C4ÖÝ(!*Ìn5E=U01
& E I J 9 F J , , F E *
6Kv+ ,
ω2 ν2
Eigenvectors
of H
P
E
ωmin,2
\ {
ωmin,1 ω1 ν1

L<M7N ¹!¹ y9ikjl_Wgd°cfWdj1gOsM\;j.s©
ÑaÒ!$CE!$ Ií=?K:5EÒ!$>(!ÓÙP,$C@0 Ö57CW6=?!=?!Ï< $Ìn5701CEK;AÞRÒ!$Ø$;KEK7=? =? 5EÒ!=JKÌ K7$57(!CE*K@0 (!5
57Ò!$P,$@57Ò!$]KE6Ó$ K45EÒ!$]Ìr0.<. C7=J6*Ì$
Ó.5EC7=Uæº0 Öv5EÒ!$=U*Î!(5EK;-
I J + 5 E -

ÞRÒl(9K-!$ ÌWÒ$=?Ï $;><' IU(*$:0 Ö =JK6IJK70Ó$ K7(!C7$0 Öv5EÒ!$ÙÌ0.<' C7=J69Ìr$:0 CK7Î!C7$ G¾06Ö®57Ò*$
=U!Î*(5EK IU01!Ï57Ò!$]Ì0 CEC7$KMÎ,0 9G=U*ÏÙ$=?Ï $;*G=?C7$Ìn5E=U0131KaKMÒ!0.Ña3=?3î=?Ï (!CE$'*!A
x2
x1
ldb@seol_y Á\;k^dbeiseolZ}cseol__Wik^_Wj6'_EfEsed;beg]\j>#_Wik^_Wj6;\;kl_Egd;° Z_n\;gh be_Ùseo1_

hgL<q M7beN _n¹\*¹ d;°seo1_ai¦j1ql segi¦jikjlql sghq>\fW_©

K7=U!ÏKEÌ6IJ6CRI?$; C7!=?!ÏCW.57$@=JKRÎ!CE0 P!I?$Ó.5E=?Ì:=?ÓÙ(!IU57=?Î!I?$]G=UÓ$;*KM=?0 9KA!äå$ÑR6l5
0 E5 0ÙP,$DIJ6CEÏ $K70]5EÒ*.5RÌr01l<1$CEÏ $;*Ìr$a=JKÖ»1Ke5R6I?0 !Ï@5EÒ!$@KMÒ* IUI?0.ÑxG=?C7$Ìn57=?0 9K06Ö0 »K7Ó IUI
$=?Ï $><.6I?(!$KD0 Ö r-cÒ!0.Ñ)$;< $;C;-,=¦Ö 0 =JKD570>0I? C7Ï1$]57Ò*$Ñ)$;=UÏ1Ò15WK
Ña=?I?IG=?< $CEÏ $6I?0 !Ï¾57Ò*$
Ke5E$$ÎìG=?C7$Ìn5E=U01*K'ÝIJ6CEÏ $3$=?Ï $><.6I?(!$KÙ0 Ö rAÞ/0¼K7$$¾57Ò!=JKÓ0 CE$3K7Î9$Ìr=Uâ9Ì IUI?Ð -I?$r5(*K
6Ï1 =U¾$ræÎ* *'G -!P!(5a5EÒ!=JKR57=?Ó$]6P,0 (5Ó=U!=?ÓÙ(*Ó
& & ,7. 7 E F ,7. +
F ,7. -5 #
Û
=/V,$;C7$;l57=J.57=?!Ï# 4 *Gº(9KM=?!Ï57Ò*$C7$KM(!IU5a=U35EÒ!$(!ÎG!.5E$$;Ül(*.5E=U01 E Ï1=U<1$;K
/. 7 < /%. F10 /%. #"
/%. F10
%. F ,/. -5
# E
?>(!P5ECE1Ìn57=?!&Ï ,7.¾ÖÝC701ÓP,0657Ò#K7=?G!$;KRÏ =?< $K
. 7 F ,/. < F 0 ,7. %6 %. F ,/. -5
#
áOÖ/57Ò*$:Î*C7$Ö» Ìn5E0 GC F 0 ,7. %)=JKaÙÓ.5EC7=Uæ57CW69KeÖÝ01C7Ó.5E=U0157Ò9.5a6I?Ñ4'ÐK4KMÒ!CE=?!ÍK
Á< $Ìn5E0 C »=ÀA $ A®=U5EK$=?Ï $;><' IU(*$;KÙ IUIRÒ*'<1$Ó6Ï1!=¦5E(*G$¾IU$K7K57Ò96 5EÒ!$ç57Ò!$3(!ÎG!.5E$

$;Ül(*.5E=U01¾Ña=?I?I/Ìr01l<1$CEÏ $1A
Ø0.Ñ G0>$;KÙ57Ò!=JKÒ!$I?ÎzÑa=U57ÒãÌWÒ*0l0lKM=?!Ï#5EÒ!$ºI?$; C7*=U!Ï CE657$;K ºáeG$; IUI?ÐúÑ)$ÑR6l5G=¦Ö×L
ÖÝ$CE$l5ºI?$;6CE!=?!ÏúCE657$K6I?0 !Ïú5EÒ!$ G/= V$CE$l5º$=?Ï $;*G=?C7$Ìn5E=U01*KAÞRÒ*=?Kº=?KK7=?ÓÎ*IU$Á=UÖ57Ò*$
$=?Ï $9G=UCE$;Ìr57=?0 *K:6CE$ÙI?=?!$;G (!Î Ña=U57Ò 57Ò!$Ì0l01CEG!=U*657$.æ$;KD06Ö5EÒ!$Ñ)$;=UÏ1Ò15WKA,áhúKM(*ÌWÒ
Ì K7$ ->5EÒ!$Ñ4$=?Ï Òl5EKR6CE$@(!*Ì0 (!Î!I?$;G369G¾Ñ4$Ì 31K7K7=UÏ1º$ ÌWÒ3Ñ)$;=UÏ1Ò15a=U5EKR0.Ña3IU$6CE!=?!Ï
CE657$
P9 K7$;G0157Ò*$:Ì0 CEC7$KMÎ,0 *G!=U!ÏÙ$=?Ï $><.6I?(!$1AlØ
0.Ñ)$;< $C- =UÖ%5EÒ!$DÑ4$=?Ï Òl5WK) C7$:Ìr0 (*Î!IU$G
57Ò!$;Ñ)$aÓ(*Ke5â*CEKM5CE065E657$ K7(*ÌWÒ57Ò*65 =?KG=? Ï 01*6I[-.=ÀA $ A 57Ò!$DÌr0>0 CWG=?*.5E$R6æ$;KI?=U*$
(!ÎÑa=U57Ò5EÒ!$
$;=UÏ1$*G!=UCE$;Ìr57=?0 *K »K7$$Dî=?Ï (!CE$ (.P nA>ÞRÒ!=JK=JK57Ò!$DÎ!(!CEÎ90lKM$D06Ö/G=J6Ï10 * IU=?ë=?!Ï
57Ò!$]Ø
$;KEKM=J6G=JK7Ì(*K7K7$;G¾$; C7I?=?$CA
"%$r5 P,$@57Ò!$CE065W.57=?0 ¾Ó.57CE=Uæ¾KM(9ÌWÒº5EÒ*.5
+
#
ÑaÒ!$CE$ =?KG!=? Ï 0 96Ic6*G + 9A!ÞRÒ!$]Ìr0lKe5RÖÝ(!*Ìr57=?0 357Ò*$ÁÌ ¾P,$ÑaCE=¦5757$#1K

& & ,7. 7 E F ,7. + +

+ F ,7. # 5 "

QÁ6Í>=?!ÏããÌWÒ*6!Ï1$Á06ÖÌr0>0 CWG=?*.5E$;K5E0 F ,7. ¾KM=?ÓÎ!IU=Uâ*$K5EÒ!$å P90.<1$

$;Ül(*.5E=U01º5E0

& & 7 E + # &

6*Gº57Ò!$@5ECE *KMÖÝ0 CEÓ$Gº(*Î,G!657$$Ü1(9.57=?0 3P,$;Ìr01Ó$;K
. 7 < F 0 %. -5
/ # X(
065E$57Ò9.5 F 0 =?KDG=J6Ï 01*6IcÑa=U57Ò G=? Ï 01*6I/Ì0 ÓÎ,0 !$;15WK F 0 , AÞRÒ!=?K
$;Ül(*657=?0
Ña=UI?I%Ìr01l<1$CEÏ $=UÖ F 0 , ->=ÀA $ A 0 F ÖÝ0 Ca6I?I WA>áOÖÌr01*KM57CW6=?!$;G5E0Ò*'< $: nÿ lû

K7Ì;6IJ6CRIU$6CE!=?!ÏCW.5E$:ÖÝ0 C6I?I%Ñ4$=?Ï Òl5EK)5EÒ!$Ñ4$ÓÙ(*KM5aC7$Ü1(*=UCE$

0 E+
# *

=Uì0 CWG$;C5E0z'< 01=?GãG=U<1$CEÏ $;*Ìr$1-ÑaÒ*$CE$ + =JK57Ò!$#IJ6CEÏ $Ke5$;=UÏ1$><.6I?(!$¾0 Ö zAî!01C
Ö» KM57$;KM5
Ìr01>< $CEÏ $;*Ìr$DÑ)$@Ò*'<1$

0 ( # -
+ 5

áOÖ /, .3=JKIU0 5
KMÓ IUI?$CR5EÒ*6 + 57Ò!$;Ì0 >< $;C7Ï1$*Ì$DÑa=?I?I%P9$<1$CEÐºK7IU0.Ñ 6I?0 !Ï57Ò*$
,7.ÙG=?C7$Ìn5E=U01%A;áhÙÖ» Ìr5;-.Ìr01l<1$CEÏ $;*Ìr$5E=UÓ$4=?KÎ!CE0 Î,0 C757=?0 96I65E05EÒ!$aÌr0 9G=¦5E=U01]>(!ÓÙP,$C

+ ,/.ÁK705EÒ*.5:=¦5:=?K:G$;K7=?CE P!IU$]570Ò*'<1$Ù KDKMÓ IUI6Á$;=UÏ1$><.6I?(!$ÙK7Î!C7$ G#1K

Î90l K7K7=UP* IU$1A

Ø0.Ñ4$< $;C;-cK7=U9Ìr$Ñ4$Ò*'< $CE065E657$G 5E0#P9$º IU=?Ï *$;GåÑa=¦5EÒú57Ò!$¾Ìr0>0 CWG=?*.5E$6æ$;K;-

X( aÌr01*K7=?KM5EK1Ìn5E(*6I?IUÐ306Ö à=U*G!$Î,$*G$;l5 rLhG=?Ó$*K7=U01*6I/$;Ül(*.5E=U01*K;A*ÞRÒ!$CE$rÖÝ01C7$1-!Ñ4$
Ì6ãÌWÒ!0>01K7$ºÁI?$; C7!=?!ÏÁCW.5E$ÖÝ01C$ ÌWÒ¼Ñ4$=?Ï Òl5=?*G$;Î9$;*G$l506Öa5EÒ!$¾0 57Ò!$;CEK;Aä $3KM$;$
57Ò*65a57Ò!$01Î57=?Ó6IcCW.57$:ÖÝ01CR57Ò!$ ( Ñ4$=?Ï Òl5 , =JK 0 ( , + A

0. %1
@B"03%12' A%( & ' îv=?Ï (*C7$ G=JKMÎ!IJ'ÐKåK7$r50 Ö $æ!6ÓÎ!I?$;KGCW'ÑazÖÝCE0 Ó 5eÑ40
@8 6(*KEKM=J6G=?KM57CE=?P!(57$GÙÌI?1K7K7$;KvÌr$l5E$CE$;G]65 ÝL A - L A * *G A - A * nA'ÞRÒ!$)$=?Ï $><.6I?(!$K

06Ö57Ò*$)Ìr0.<. C7=J6*Ì$Ó.57CE=Uæ@ C7$ A * 6*G A &*Aräå$57CW6=?ÙK7=?!Ï I?$IJ'Ð $C/IU=?!$; C/!$5eÑ)01C7Í

Ña=¦5EÒ E =?!Î!(5WK- 0 (!57Î!(5- E Ñ4$=?Ï Òl5WK-c69GP!=J K»K7$$î=?Ï (!CE$B#- ;
(9KM=?!Ï357Ò!$"vQ?
6I?Ï 0 CE=U57Ò!Ó =?ìP*.5WÌWÒãÓ0G$1A)î=UÏ1(!CE$ #G=?K7Î!IJ'ÐK57Ò!$#Ñ4$=?Ï Òl557CWP e$Ìn5701C7Ðz *GT$CEC701C
G(!CE=U!ÏI?$; C7!=?!Ï3ÑaÒ!$ú(*K7=?!Ï#¾I?$; C7*=U!Ï3CW.57$K:06Ö 0 X5 "¾ *G E A "A 0 57$57Ò9.557Ò*$
IU$6CE!=U*ÏCW.5E$ ÀKM$;$ AÜ,A * 0 + E + E $ 5 * E 5 *Ña=?IUI/Ì;6(*K7$G=?< $CEÏ $;*Ìr$K
KR=JKR$<>=?G!$l5aÖÝ0 C 0 E 5 "!A

1.4
1.2
0.8
y 0.6
0.4
ω2
0.2
ω0 ω1
−0.2
−0.4
χ0 χ1
−0.6
−0.8
L<M7N ¹*¹ 6ikZqlk_Rkikjl_n\rb4j1_Es[d;be©

−1
−1.2
−1.4
−1.4−1.2 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
u®d]fW¦\ghgh_Wga bM\Wj°UbedZÚ^'\lghghi¦\;j¾1ikgOshbei|
l{L 1M/N seik¹ d;Ojl9g¹ fW_Wj.se_Ebe_W\s |À © £ } |À6© ¤ )\j> 6© £1} © ¤7©
îv=?Ï (*C7'$ E K7Ò!0.ÑK:5EÒ!$ºKE6Ó$$ræ!6ÓÎ!I?$(*K7=U*ÏÁKe5E0>ÌWÒ9 KM57=JÌ=?*KM57$ Gå06ÖaP*65EÌWÒúÓ0>G!$
IU$6CE!=U*Ï*A6Ø$;C7$1- @I?$;6CE!=?!Ï@CE657$R06Ö 0 5 E =JK(*K7$;GcAlBD*$Ì6K7$$R5EÒ*.55EÒ!$a57CWPe$Ìn5E0 CEÐ
=?KÓÙ(9ÌWÒç!01=?K7=U$;CÙ5EÒ*6ã=UãP*.5WÌWÒçÓ0G$KM=?*Ìr$30 *IUÐz ç$;KM57=?Ó.5E$¾0 Ö57Ò!$3Ï CW G!=U$;15=?K
(*KM$Gã65$; ÌWÒT=U57$CW.5E=U01%AÞRÒ!$#Ìr0lKe5=JKÎ!I?065M5E$;Gì K ÖÝ(!*Ìr57=?0 ì06Ö:$Î,0ÌWÒ%A<ì$Î,0ÌWÒ
Ò!$CE$@=?KK7=?ÓÎ*IUÐºG!$râ*!$G GK =?!Î!(!5
Î!CE$;K7$l5E657=?0 *K4ÑaÒ!=JÌWÒ%->ÖÝ0 CKM570ÌWÒ*1Ke5E=?Ì:I?$; C7!=?!Ï9-
Weight space Weight space
2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1
Log MSE (dB) Log MSE (dB)
0 0
−5 −5
−10 −10
−15 −15
−20 −20
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
epochs epochs
\
{
L<M7N ¹POO>¹ _Wikô.s:shbM\ O_EfEsed;bh \j>#_Ebhbed;b@fW1be'_1 beikjl^k_n\bej1i¦j1^º°?d;b \
í~ ¡º\;jl {

z¢ ¡6©

Ìr0 CECE$;K7Î901*G!Kc5E0 Ñ)$;=UÏ1Ò15(!ÎG!.5E$;K;A.áhÙP*65EÌWÒ%-.6$Î,0ÌWÒÌr01C7CE$;K7Î,0 *G!K%570:0 !$4Ñ4$=?Ï Òl5
(!Î,G*.57$1A
Weight space Weight space

2 2
1.8
1.8 1.6
1.4
1.6
1.2
1
1.4
0.8
1.2 0.6
0.4
1 0.2
0
0.8
−0.2
0.6 −0.4
−0.6
0.4 −0.8
−1
0.2
−1.2
0 −1.4
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.6 −1.2 −0.8 −0.4 0 0.4 0.8 1.2 1.6 2 2.4
Log MSE (dB) Log MSE (dB)

0 0
−5 −5
−10
batch −10
−15 −15
−20 −20
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
epochs epochs
_Wikô.sÚshbM\ O_WfEsedbh \;jlé_Ebhbed;b _Wikô.s3shbM\ O_Wf7sed;beik_WgÁ\;jlx_Ebhbed;beg

EfL<1M7beN '¹P_ÙO *1¹ beikjl^gOsed.fMo>\gOsei¦fk_n\bej1i¦j1^°Jd;b
°Jse<L d ki faM7bN k¹~E_n|e\O ~Ebe*¹|ej1~i¦j1jl^1_E© sÀ®dbeshbM\;ikjl_nÁlghikj1^3gOsed.fMo>\;gO|

¢6©

*C.0".0 &%(' A%( :' î=UÏ1(!C7$ K7Ò!0.ÑK)57Ò*$]6CWÌWÒ!=¦5E$;Ìr57(!CE$:ÖÝ0 C< $;C7ÐK7=?ÓÎ*IU$Ó(!I¦L

57=?I?'Ð1$C®!$r5eÑ40 CEÍA.áO5Ò* K a=?!Î!(!5;-"aÒ!=JG!G$;%- *GB401(57Î*(5!0G$1A6ÞRÒ!$CE$6CE$ E Ñ4$=?Ï Òl5WK

6*G E P!=J K7$;K;A*ÞRÒ!$ Ìn5E=U<.657=?0 3ÖÝ(!*Ìr57=?0 #=JK 1 5 $( c5E !Ò % E nA9ÞRÒ*$57CW6=?!=?!Ï
KM$5:Ìr01l5E6=?*K $æ ÓÎ!IU$KaÖÝCE0 Óï$;1ÌWÒ#06Ö E ÌI?1K7K7$;K;A924065EÒÁÌrIJ KEKM$K
C7$Ù8@ (*K7K7=J6ÁG=JKeL
57CE=UP!(!57$;GºÑa=¦5EÒ3KM5E *G! CEGG!$<>=?657=?0 A A&)IJ KEK DÒ*1K)ÙÓ$;6¾06Ö% L D *GºÌI?1K7K E Ò9 K4
Ó$6¾06Ö 7 1A>Þ6CEÏ $5)<.6I?(!$K4 C7$
#L DÖÝ0 CRÌrIJ KEK : *G 7
ÖÝ0 CaÌrIJ KEK E Aî=UÏ1(!C7!$ KMÒ*0.ÑK
57Ò!$]KM570ÌWÒ*1Ke5E=?ÌD57CWP e$Ìn5E0 CEÐÖÝ0 CR57Ò*$$ræ!6ÓÎ!I?$ A
y
ω3
ω1
ω2
ω0
L<M7N ¹PO,¹ u®ol_aZikjlχikZ]\*ZDlseik¦\n'_Ebjl_EsÀ®dbe!©

3 * ')!3> &'
2"0<3C !3 9 ?'-'):' *C'->! % ')!3C>:'
!"0<3
% #""0% 9
ä $#Ì ç(*K7$¾57Ò!$CE$;K7(!I¦5WK06Ö5EÒ!$Î!C7$;<>=U01(*KK7$;Ìr57=?0 ç570Be (*KM57=UÖÝÐçKM$;< $;CE I)0 Ö57Ò!$35EC7=JÌWÍK
G=?KEÌr(9K7K7$;G¾$; C7I?=U$;C;A
* ')! =%
%1!3 >')
% "3 * #!2')" .%1
ÞRÒ!$DCE$;1KM01ÖÝ0 CR5EÒ!$6P,0.< $
57CE=?ÌWÍ=JK457Ò*65Ù!01!ë$;C70Ó$;63=?º5EÒ!$@=U!Î*(5a<' C7=J6P*IU$K
rÌ CE$;.5E$;K$5 ü Uý.ü[û
$=?Ï $;l<. IU(!$1A*ÞRÒ!=JKÓ$; *KR5EÒ!$ÙÌr01*G=U57=?0 #>(!ÓP9$;CÑa=UI?I/P,$]IJ6CEÏ $1-
=ÀA $ A.5EÒ!$
Ì01KM5K7(!C7Ö» Ì$aÑa=UI?I!P,$Ke5E$$Î=?KM01Ó$RG!=UCE$;Ìr57=?0 *K *GKMÒ96I?IU0.Ñã=U065EÒ!$CWKK70@57Ò*65

Ìr0 ><1$CEÏ $9Ìr$ÙÑa=?I?IP9$<1$CEÐÁKMI?0.ÑAcÞRÒ!$KM01IU(5E=U01 =JK:570#K7=UÓÎ!I?ÐÁÎ!CE$Î*C70Ìr$K7KD5EÒ!$=U!Î*(5EK

PlÐ¾KM(!P!57CW Ìn5E=U*Ï5EÒ!$=?CÓ$; *K;A
î!0 CK7=U*Ï I?$:I?=?!$; Ca!$(!CE0 /->57Ò!$$;=UÏ1$>< $Ìn5701CEK40 Ö/57Ò*$]Ø$;KEK7=? »Ña=¦5EÒÓ$6*KaK7(!PL
57CW Ìn5E$;G Î901=Ul5 IU01!Ï57Ò!$¾Î!CE=U*Ì=UÎ96I6æ>$K06ÖR5EÒ!$¾ÌIU01(*Gú06ÖR57CW6=?!=?!Ï#< $Ìn5E0 CW&K ÝCE$;Ì;6I?I
îv=?Ï (*C7$ * nA>áh!Î*(5EK45EÒ*.5RÒ*'< $:ÙIJ6CEÏ $
<' C7=J.5E=U01=?3K7Î!CE$; Gº6I?0 !ÏGW= V$CE$l5aG=?C7$Ìn57=?0 9K
06Ö%5EÒ!$D=?!Î!(!5RK7Î* Ì$
Ña=?IUIÒ*'<1$
]IJ6CEÏ $DÌr0 9G=¦5E=U01º>(!ÓP9$;C4 *GK7IU0.ÑxI?$; C7!=?!Ï9A 9G¾K70
Ñ)$@CE$;Ìr01ÓÓ$;*G C
A&'
!.0" % %$#!2')"0!3 %1=<>C%A"3 * #!2')" .%1

áOÖ:5EÒ!$Á=?!Î!(53<.6CE=? P!IU$K6CE$ÁÌr01C7CE$IJ.5E$;Gc-®5EÒ!=?KºÑa=UI?I
*065ºÓ6Í1$57Ò!$ $CEC701CKM(!C7Ö» Ì$
KMÎ!Ò*$CE=?Ì;6I[-P!(5=U5Ña=UI?I%Î,01KEKM=?P!I?ÐC7$G(*Ì$:=U5EK$;ÌÌ$l57CE=JÌr=U5eÐ A
&)0 CEC7$;I?657$;G=U!Î*(5a<' C7=J6P*IU$K)(9KM(* IUI?ÐÌ (*K7$D5EÒ!$@$=?Ï $;l<1$;Ìr570 CWK0 Ö 570P,$:CE065W.5E$;G
'Ñ4'Ð:ÖÝCE0 Ó 57Ò*$Ìr0>0 CWG=?*.5E$).æ$1K Àî=UÏ1(!C7$ (6
<1$CWKM(*K ('P %5EÒ>(*K®Ñ)$;=UÏ1Òl5(!ÎG!657$;K C7$4!0 5
G$;Ì0 (!Î!I?$;G%A*Û
$Ìr01(!Î!I?$;G3Ñ)$;=UÏ1Òl5EKRÓ6Í1$D5EÒ!$Áè70 !$@I?$;6CE!=?!ÏCE657$@Î,$CÑ4$=?Ï Òl5EêÓ$r5EÒ!0G
0 Î5E=UÓ6I[-5EÒl(9K-Ñ4$@Ò*'< $:57Ò*$@ÖÝ0 I?IU0.Ña=?!Ï57CE=?ÌWSÍ C
%( :'-')%1.02%A% "3 * #!2')" .%1
0.ÑõKM(!Î*Î90lKM$5EÒ*.557Ò!$º=U!Î*(5<.6CE=J6P!I?$;K0 Ö#!$(*C701¼Ò*'< $P,$$;çG$;Ì0 CEC7$;I?657$Gc-

57Ò!$]Ø
$;KEKM=J63ÖÝ0 CR5EÒ!=JK!$;(!C701¾=JKR5EÒ!$#G=J6Ï10 * I%6*G3=U5EKa$;=UÏ1$><.6I?(!$;KRÎ,0 =?l5
6I?0 !Ï57Ò*$
Ìr0>0 CWG=?*.5E$º.æ$;K;A/áhTKM(9ÌWÒç Ì; K7$57Ò!$3Ï CW G!=U$;15=?K!06557Ò!$3P,$;KM5G$K7Ì$l5G=?CE$;Ìn5E=U01
KÌ6¼P,$KM$;$z=?çî=UÏ ('P%A 45Ù5EÒ!$3Î901=Ul5 ®-®6ã6CEC70.ÑõKMÒ*0.ÑK57Ò9.5Ï1CE1G=?$l5G0>$;K
!065@Î901=Ul5@570.ÑR6CWG!K57Ò!$Ó=U*=UÓ(!ÓA%Ø0.Ñ4$< $;C;-,=UÖÑ)$=?*Ke5E$;1G 1K7K7=UÏ1Á$;1ÌWÒÁÑ4$=?Ï Òl5:=U5EK
0.Ña I?$;6CE!=?!Ï3CE657'$ Ý$;Ül(* I57Ò!$=?l<1$CWKM$06Ö57Ò!$Ìr01C7CE$;K7Î,0 *G=?!Ï¾$=?Ï $><.6I?(!$ a5EÒ!$å57Ò*$
G$;KEÌr$;15G=?CE$;Ìn5E=U01¾Ña=?I?IcP9$=?º5EÒ!$]G=?C7$Ìn5E=U01º06Ö57Ò*$:0 57Ò!$;C6CEC70.Ñì5EÒ*.5aÎ,0 =?l5EKaG=?C7$Ìn5EIUÐ
570.ÑR6CWG!K)57Ò!$Ó=?!=UÓ(!Ó'C
# % % !'-!% .%12')3C"3C4 ')2% > &' %1!- %1"4&;
åøMó (>Ë7Êó®ø lò/ÊÇÄDÈ ÇÆ*Èò/ÆTÇ Å!Ë õËeùcóÅ!ËeÇ)Ä òcÅ ÇaÈ?

áhÙ5EÒ!$4ÖÝ0 I?IU0.Ña=?!Ï:Ñ)$RÑa=?IUIP!CE=?$*Ð=?l57CE0>G!(*Ìr$457Ò*$
$ÑR5E0 %-6Ì0 Pe(!Ïl.5E$)Ï1CE1G=U$;l5;-68@6(*KEKML
$ÑR5E0 %-l"%$;< $>P,$CEÏQÁ6CWÜl(*6CWG>569GÙ5EÒ!$ D(*1KM=UL $ÑR5E0 @»2Rî8 ?"®Ó$r5EÒ!0G@»K7$$
I?K70
- - *- " *nA
A% <3 8 .04<&'-"0
Þ/0Ï $5@ (**G$CWKM5E69G=U*Ï¾0 Ö57Ò*$

$ÑR5E0 åÓ$57Ò!0GåI?$r5(*K@CE$;Ì Î!=U57(!IJ.5E$Ù5EÒ!$C7$KM(*I¦5WK
ÖÝC701ÓK7$;Ìn5E=U01 "A7 AXDK7K7(!Ó=U*Ï]Ü1(9 GCW.5E=?Ì4I?01KEKÖÝ(!*Ìn5E=U01& »K7$$ A)Ü9A8 E ;® KG$;Î!=JÌn57$G
=Uî=UÏ1(!C7$ & Ý=?=8r-6Ñ)$RÌ Ì0 ÓÎ!(5E$4 57FDÒ! $aÑ4 $+=?Ï Òl5®(!Î,G*.57$ IU01!ÏD 57Ò! $RI?=U!$K06ÖSAÜ,A8 E OL- E
0
0 +

F
ÑaÒ!$CE$ 0 Ó(*KM5]570#P,$ºÌWÒ!0lKM$;å=?ú57Ò*$CE !Ï $ 0 ºKM=?*Ì&$ =?K=?¼Î!CW Ìr57=JÌr$!0 5
Î9$;CMÖÝ$Ìn57I?Ð#Ül(* GCW.5E=?Ì A!áh#5EÒ!=?K
$;Ül(*657=?0 #=?ÖÝ0 CEÓ.57=?0 Á P901(55EÒ!$Ø
$;KEKM=J6 é=JK5W6Í1$
=Ul5701ÌÌr01(!l5;A>áOÖ/5EÒ!$$CECE0 C)ÖÝ(!*Ìn5E=U01¾ÑR KRÜl(*1GCE657=JÌ
01!$Ke5E$Î3Ñ)01(!IJGºP,$KM(ÕÌ=U$;15R5E0
Ìr0 ><1$CEÏ $ A

K7(*6I?IUÐç57Ò!$ $!$;C7Ï1ÐãK7(!CMÖ»1Ìr$Á C701(!*Gç57Ò!$ Ó=?!=?ÓÙ(!Ó =JKCW.57Ò*$C$I?I?=UÎ*K70 =JGc-)0 C=?
57Ò!$$ræ>5EC7$;Ó$I?=UÍ1$º5E1Ìr0KMÒ!$;IUI[-cG$Î,$9G=U*Ï301 5EÒ!$Ìr0 9G=¦5E=U01!=?!Ï¾0 Ö5EÒ!$Ø$;KEK7=? %A
ÑaÒ!=¦5E$!=?!Ïa5ECE *KMÖÝ0 CEÓ3-nÑ4$I?I Í>!0.Ña:ÖÝCE0 Ó KM=?Ï * I6Î!CE0Ìr$;KEK7=U!ÏRI?=¦5E$CW.5E(!C7$ E - 1Ì ÌWÒ*6!Ï1$
57Ò!=JK$;IUI?=UÎ9KM01=?GK7Ò*6Î,$R570ÙK7Î!Ò!$;C7=JÌ I!KMÒ96Î,$R57Ò!CE0 (!Ï1Ò B + EF »K7$$
î=?Ï (!CE$ ": *G
AÜ,7A # %nPA ?>0D57Ò!$4=?l<1$CWKM$)Ø$;KEK7=? Ù=? AÜ,8A P*1KM=JÌ IUI?ÐKMÎ*Ò!$CE$;Kv0 (557Ò*$)$;C7CE0 CKM(!C7Ö» Ì$

IU0Ì IUI?Ð AlÞRÒ!$aÖÝ01IUI?0.Ña=U*Ï@5eÑ)0Ù Î!Î!CE011ÌWÒ!$;KÌ;6P9$:KMÒ*0.Ña5E0ÙP,$$;Ül(!=?<.6I?$lD5 C » (*K7$a57Ò*$

$ÑR5E0 Á IUÏ10 CE=¦5EÒ!Óí=?Á6#(!l5ECE *KeÖÝ01C7Ó$G¾Ñ4$=?Ï Òl5
K7Î* Ì$]6*G »P G!0(*K7(*6I%Ï1CE1G=?$l5
G$;KEÌr$;15a=?#ÑaÒ*=¦5E$!$G#Ìr0>0 CWG=?*.5E$:K7ÐKe5E$Ó ÀKM$;$]îv=?Ï (*C7$ " - OA
?>(!ÓÓ6CE=Uë;=U!Ï9->57Ò!$
$ÑR5E0 #6I?Ï 01C7=U57Ò!ÓÌ0 >< $;C7Ï1$;K)=U01!$]Ke5E$Î=UÖv5EÒ!$$CEC701C4ÖÝ(!*ÌnL
57=?0 ç=JKÜl(* G!CE657=JÌº69G Ý(*!IU=?Í $Ï1CE1G=U$;l5G$K7Ì$l 5 Ù=¦5=JK=U><.6CE=? l5Ña=U57ÒTC7$KMÎ,$;Ìr55E0
IU=?!$; C57CW6*KMÖÝ0 CEÓ.57=?0 9KÙ0 Ö57Ò!$=?!Î!(5<1$;Ìr570 CWK;AÞRÒ!=JKÓ$;69K57Ò*655EÒ!$#Ìr0 ><1$CEÏ $9Ìr$
57=?Ó$#=JK!0 5Q V,$Ìn5E$;GTP>ÐçKMÒ*=¦Ö×5WK-KEÌ IU=?!Ï¼6*GãCE065E657=?0 ã06ÖD=?!Î!(!5< $;Ìr5701CEK;AØ0.Ñ4$< $;C
0 !$¾06Öa57Ò*$ºÓ6=?çGCW'ÑaP* ÌWÍK=JKÙ57Ò*656 Ø
$;KEKM=J6zÓ657CE=¦æúÓÙ(*KM5P,$Ke5E0 CE$;G
6*G=?>< $C757$Gc-!ÑaÒ!=JÌWÒ5E Í $K ÁV aÎ,$C=U57$;CE657=?0 *K
6*G=JKa5EÒ!$CE$rÖÝ01C7$=?ÓÎ!CE1Ìn5E=?Ì;6IcÖÝ01C

Ó01C7$@5EÒ*6#ÖÝ$;Ñ <. C7=J6P!I?$;K;A ?>=?*Ìr$]57Ò!$]$;C7CE0 C4ÖÝ(!9Ìn57=?0 #=JK=?#Ï $;!$CW6Ic*0 LhÜl(* GCW.5E=?Ì -

57Ò!$;C7$=JK!0Ï (96CW6l57$;$0 Ö4Ì0 >< $;C7Ï1$*Ì$ AáOÖ)5EÒ!$ºØ

$;KEKM=J6å=?K]!065]Î,01K7=¦5E=U<1$G$râ9!=¦5E$ Ý=UÖ
=¦5:Ò* KDKM01Ó$Ùë$CE001C
$<1$#!$;Ï1657=?< $ A=?Ï $;><' IU(*$;KaÑaÒ!$;C7$]57Ò!$$CEC701C
KM(*CMÖ»1Ìr$]=JK 965D01C
KM01Ó$:G=?C7$Ìn57=?0 9K4 C7$DÌr(!CE< $GºG0.Ña>ÑR6CWG n- 57Ò!$;57Ò*$
$ÑR5701¾ IUÏ10 CE=¦5EÒ!ÓõÑa=UI?IcG=U<1$CEÏ $1-
KM057Ò*$]Ø$;KEK7=? RÎ,01K7=¦5E=U<1$G$râ**=¦5E$ AB
ÖÌr01(!CWKM$D57Ò!$]Ø
$;KEKM=J6Ó.5EC7=Uæº0 ÖÓ(!I¦5E=¦L
I?'Ð1$C
*$r5eÑ40 CEÍ>K
=?K:=? Ï1$!$;CE I/!0 5:Î,01K7=¦5E=U<1$G!$râ*!=U57$$<1$CEÐ>ÑaÒ!$CE$ A,î!0 CD57Ò*$;K7$CE$;1KM01*K

57Ò!,$ $;ÑR5701 6I?Ï 01C7=U57Ò*Ó=U =¦5WK@0 CE=UÏ1=U* I/ÖÝ01C7Ó=JK:!065@(*KE6P!I?$ÖÝ0 C@Ï $;!$CW6I!$(*CE I*$r5ML
Ñ)01C7ÍI?$; C7!=?!Ï9AlØ
0.Ñ)$;< $C4=U5Ï =?< $;K4Ï10l0Gº=U*K7=?Ï Òl5EK)ÖÝ0 CG!$< $;IU01Î!=?!ÏÙÓ0 CE$KM01Î!Ò!=JKe5E=?Ì;.5E$;G
6I?Ï 0 CE=U57Ò!ÓK;->1KaG=JK7Ì(*K7K7$;G¾=?¾5EÒ!$@ÖÝ0 I?I?0.Ña=U!Ï9A
U
ω
Λ-½ Θ′
ΘΛ -½
Newton Algorithm here ...... ....is like Gradient Descent

there
output output
ω ω
Network
U Λ -½ Θ Network
input input
L<M7N ¹PO *¹ _Esef7od;°seo1_4olise_Wj1ikjl^@q bedq_Ebhseik_Wgd;°seo1_a4_Esed;j\¦^;d;beiseolZ©

<3*4&2%') 9 "%13;
ÞRÒ!$CE$ C7$K7$< $;CE I/=?ÓÎ901CM5W6l5Î!C701Î9$;CM5E=U$K
=?úÌr0 Q e(!Ï1657$Ï CW G=?$l5@0 Î5E=UÓ=?ë;.5E=U01 C
¦= 5Ù=JK @Ó$r57Ò*0>G%- E :=¦5G0>$;K7 R 5](9KM$5EÒ!$Ø$K7K7=? ú$ræÎ!I?=?Ì=¦5EIUÐ1- :=¦565M5E$ÓÎ5EK
570â*9G#G$;KEÌr$;15:G=UCE$;Ìr57=?0 *Ka57Ò*6557CEÐº5E0Ó=U*=UÓ6I?I?Ð3K7Î901=UI/57Ò!$ÙC7$KM(!IU5D1ÌWÒ!=U$;< $Gº=?#57Ò*$

Î!C7$;<>=U01(*K=U57$;CE657=?0 *K;- =¦5R(*K7$;K4]I?=U*$:K7$; CEÌWÒ%-69GÓ01KM54=?ÓÎ901CM5W6l57I?Ð -S#" =U54Ñ40 CEÍK

0 !I?ÐÖÝ0 CaP9.5EÌWÒI?$; C7!=?!Ï9A
ÞRÒ!$5EÒ!=UCWGÎ!CE0 Î,$C75eÐ:=?KKMÒ*0.Ña=Uîv=?Ï (*C7$ &!A
KEKM(!Ó$)Ñ)$)Î!=?ÌWÍDG$;KEÌr$;15G=UCE$;Ìr57=?0 %-
$ A Ï*A,57Ò!$Ï1CE1G=U$;l5;-957Ò!$; Ñ4$Ó=?!=?Ó=?ë$ IU01!Ï¾¾I?=U!$=?Á5EÒ!=?KG!=UCE$;Ìr57=?0 »IU=?!$K7$;6CWÌWÒ rA
?>(!P*K7$;Ül(!$;15EIUÐúÑ4$ºK7Ò!0 (!IJGå57CEÐå570#â9*Gz G!=UCE$;Ìr57=?0 z6I?0 *Ï#ÑaÒ!=?ÌWÒz57Ò!$¾Ï CW G=?$l5ÙG0>$;K
!065ÌWÒ* !Ï $3=¦5WKG=?C7$Ìn57=?0 /-P!(5Ó$CE$I?Ðz=¦5WKIU$;!Ï65EÒ ÀÌr01P e(!Ï1657$#G=?C7$Ìn5E=U01 n-®P,$;Ì (*K7$
Ó0.<>=?!Ï IU01!Ï57Ò!=JKG=UCE$;Ìr57=?0 zÑa=?I?I)!0 5K7Î901=UI57Ò!$3CE$;K7(!I¦506Öa57Ò*$ºÎ!CE$<>=?0 (*KÙ=¦5E$CW.57=?0 /A
ÞRÒ!$$<10 I?(57=?0 ¾06Ö57Ò!$]G$K7Ì$l5G=?C7$Ìn57=?0 9K $ .5a=U57$;CE657=?0 º=JKRÏ =?< $;31K
$ F $ 7 $ $ +

ÑaÒ!$CE$457Ò!$
ÌWÒ*0 =JÌr$R06Ö $ Ì P,$G01!$R$=U57Ò!$;C Ì;Ìr0 CWG=?!ÏD570]îIU$5EÌWÒ!$;C6*GNa$;$<1$;K
first descent
direction gradients
conjugate
L M/N ¹PO *¹ _Esef7oÙd;°fWdj Ol^\se_^;bM\;1ik_Wj.s)1ibe_WfEseikd;jlgi¦j\¢px_7bhbed;bgh1bh°J\;fW_©

direction
$ & $ + & $ E
& $ + + & $ +
0 C ®0 IJ6Í *G3Na=UP*=U$;C7$
$ & &$ F & + $ + ; + $ 5
$ + $ +
ÞRÑ)0G=?C7$Ìn5E=U01*K $ 69G $ + 6CE$@G!$râ*!$G KaÌr01Pe(!Ïl.57$:=UÖ
+ $ +
$
=ÀA $ AÌr0 Qe(!Ï1657$ºG=?CE$;Ìn5E=U01*KÙ C7$01CM5EÒ!0 Ï10 * IG=?C7$Ìn57=?0 9K]=?¼57Ò!$3KMÎ9 Ìr$0 Ö
6¼=JG$l5E=¦5eÐ
Ø$;KEK7=? ]Ó.57CE=UæÀKM$;$)î=UÏ1(!CE$: (XnA$CEÐ@=UÓÎ,0 C75E6l5%ÖÝ01CÌ0 >< $;C7Ï1$*Ì$=UÙP90 57ÒÌWÒ*0 =JÌr$;K/=?K
ρκ−1
ωκ
ρκ
L M/N ¹PO !¹ _Esef7oÙd;°fWdj Ol^\se_^;bM\;1ik_Wj.s)1ibe_WfEseikd;jlgi¦j\¢px_7bhbed;bgh1bh°J\;fW_©

Ï10l0GÙIU=?!$RKM$6CWÌWÒ]Î!CE0>Ì$;G(*C7$1Aî*0 C
Î,$C7ÖÝ$;Ìr57I?Ð]Ül(* GCW.5E=?ÌÖÝ(!*Ìr57=?0 Ña=¦5EÒ í<' C7=J6P*IU$K
Ì0 >< $;C7Ï1$*Ì$aÑa=¦5EÒ!=U ÔKM57$;Î*KRÌ6ºP9$@Î!CE0.< $GcA1î*0 C4!0 LhÜl(* G!CE657=JÌaÖÝ(!*Ìr57=?0 *K®01I? Í
6*GTNa=?P!=U$;C7$ R KÌWÒ!01=?Ì$3K7$$;ÓKÓ01C7$3C701P!(*KM5;A&)0 Qe(!Ï1657$Ï CW G=?$l5' Ì6T6IJKM0åP,$
<l=?$Ñ4$;Gì1K¼KMÓ6C75ºÌWÒ!01=?Ì$¾ÖÝ01CºÌWÒ!0>01K7=?!Ïå57Ò!$ÁÓ01Ó$;l57(!Ó 57$CEÓ Íl*0.Ñaã=?x!$(*CE I
!$r5eÑ40 CEÍ5ECE =U!=?!Ï9A*áO5@Ò* K
P9$;$ Î!Î!I?=U$G#Ña=¦5EÒÁIJ6CEÏ $]K7(*ÌÌ$;KEK=?ÁÓÙ(!IU57=ULOI?'Ð1$C!$5eÑ)01C7Í
57CW6=?!=U*Ï0 Î!CE0 P!I?$ÓK®5EÒ*.5R6CE$aÓ0>G!$CW.57$K7=?ë$;GÑa=U57ÒºCE657Ò!$;CIU0.ÑTCE$;G(!9G!6*ÌÐÙ=?57Ò*$
G!.5W!A/Þ)Ð>Î!=?Ì;6I Î!Î!I?=?Ì;.5E=U01*KCW6!Ï1$ÖÝCE0 Ó ÖÝ(!*Ìr57=?0 ç Î!Î!CE0'æ>=?Ó.5E=U01%-C701P90 57=JÌÌr0 l5EC701I
- O->57=?Ó$LOK7$CE=?$;KRÎ!CE$;G=JÌn5E=U013 *G3065EÒ!$CaCE$; I,<. IU(!$G3Î!C701P!I?$ÓKRÑaÒ!$CE$@Ò!=?Ï Ò# Ì;Ìr(!CW ÌÐ
=?KÙÑ4 15E$;GcA&)IU$6CEIUÐ 0 ¼IJ6CEÏ $ *GúC7$G(!*G*6l5 »ÌI?1K7K7=Uâ9Ì657=?0 :Î!CE0 P*IU$;ÓKKe5E0ÌWÒ* KM57=JÌ
P* ÌWÍ>Î!CE0 Î* Ï1657=?0 =JK@Ö»1Ke5E$CA IU57Ò*0 (!Ï1Òz.5M5E$ÓÎ5WK]Ò*'<1$P,$$;zÓ G$5E0 G$â*!$Ó=?!=¦L
P*.5WÌWÒ!$;K E " O-.57Ò!$DÓ6=?G=?KE G!<' l5E6Ï1$a06Ö%Ìr01P e(!Ïl.57$
Ï CW G=?$l5Ó$r5EÒ!0G!KCE$Ó6=?*K57Ò*65
=¦5=JKåP*65EÌWÒãÓ$57Ò!0G ÝÎ* CM5EIUÐzG!(!$3570 57Ò!$Î!CE$;Ì=?K7=U01çCE$;Ül(!=?C7$;Ó$l5EK=UTI?=U!$#K7$; CEÌWÒ
Î!C70Ì$;G(!CE$ rA
=*C!" A% &3 + ! A

ÞRÒ!$ D(* K7=¦L $;ÑR5701 À24î8 ?")Ó$r57Ò*0>G % 4=U57$CW.5E=U<1$I?ÐÌr01ÓÎ*(57$Ka63$;KM57=?Ó657$@06Ö57Ò*$

=U>< $;CEK7$Ø$K7K7=? %-< E :=?KÙ6 F @6I?Ï 01C7=U57Ò*Ó3-<# :CE$;Ül(!=?C7$K@IU=?!$K7$; CEÌWÒú6*G @=U5
Ñ)01C7ÍK40 !I?ÐÖÝ01CaP*.5WÌWÒ3I?$; C7*=U!Ï9A

ÞRÒ!$
Î,01K7=¦5E=U<1$
G$â*!=U57$:$;KM57=?Ó657$0 Öc57Ò!$:=?l<1$CWKM$Ø
$;KEKM=J6=JK4G0 !$:G=?C7$Ìn57I?ÐÑa=¦5EÒ!0 (!5
C7$Ü1(*=UCE=U*ÏÓ657CE=¦æ3=U><1$CWKM=?0 #69G¾P>Ð30 !I?Ð3(*KM=?!ÏÏ1CE1G=U$;l5=UÖÝ01C7Ó657=?0 %A I?Ï 0 CE=U57Ò!Ó=UL
Ì6I?I?Ð#57Ò!=JKÙÌ;6åP9$G!$;KEÌrCE=UP,$;Gú K:ÖÝ01IUI?0.ÑDK C< Dâ9CEKM5Ù3Î90lKM=U57=?< $G$â*!=U57$Ó.5EC7=Uæ =?K
ÌWÒ!01K7$%-$1A Ï9A *- E )5EÒ!$35EÒ!$]KM$6CWÌWÒ¾G=?CE$;Ìn5E=U01¾=JKK7$r5R570
%. %. & /%. %4
®@I?=?!$
K7$;6CWÌWÒÙ=JKÎ,$C7ÖÝ0 CEÓ$;G6I?0 !Ï ,- ÑaÒ!=JÌWÒÏ =?< $K57Ò!$(*Î,G!657$RÖÝ01C57Ò!$
Î*6CW6Ó$r5E$CWK
.5a57=?Ó$'.
%. /. F F10 %. S/%. 45
îv=?* IUI?Ð 57Ò*$#$;KM57=?Ó657$#0 ÖD5EÒ!$Á=?l<1$CWKM$Ø$K7K7=J6T=?K(*Î,G!657$GcAR&)01ÓÎ96CE$;Gç570ú57Ò*$
$ÑR5E0 º IUÏ10 CE=¦5EÒ!ÓÚ5EÒ!$ D(9 K7=¦L $;ÑR570 Î!Î!CE01 ÌWÒ0 *IUÐ!$$G!KÏ1CE1G=?$l5=?ÖÝ0 CEÓ.57=?0 /A
ÞRÒ!$
Ó01KM5)K7(*ÌÌ$;KEKeÖÝ(*;I D(*1KM=UL
$ÑR5E0 6I?Ï 01C7=U57Ò*Óà=JK5EÒ!$
24C70.ÐG$;LhîvI?$r5WÌWÒ!$C7Le8:0 IJG>Ö»6CEPL
?>Ò*6*!0 »2Rîv8 ?"Ó$57Ò!0GcA1ÞRÒ!$a(!ÎG!.5E$aC7(!I?$4ÖÝ0 C®5EÒ!$a$;KM57=?Ó.57$a0 Ö*5EÒ!$a=U><1$CWKM$aØ$K7K7=J6
=?K
/%. < /. F
7 + + + + F
+ +7 +

ÑaÒ!$CE$KM01Ó$ P!P!CE$<>=?657=?0 *K4Ò*'<1$@P9$;$(*K7$;GºÖÝ0 CR57Ò*$@ÖÝ0 I?IU0.Ña=?!Ï @< $Ìn5E0 CWK

& /%. ; F . F %

%. F /. F 45 "
IU57Ò!01(!Ï Ò%-® KÙÓ$l57=?0 !$G¼6P,0.< $1-%57Ò*$¾Ìr01ÓÎ!IU$æ=¦5eÐå=?KÙ0 !I?Ð ÁF n-Ñ)$¾6CE$ºKe5E=UI?I4C7$L
Ü1(*=UCE$;Gz570¼Ke5E0 CE$31 Ó657CE=¦æc-KM0 57Ò*$ IUÏ10 CE=¦5EÒ!Ó =JK01!I?ÐzÎ!CW Ìn5E=?Ì;6I)ÖÝ0 CK7Ó IUI

!$r5eÑ40 CEÍK]Ña=U57Òç*0 LOC7$G(!*G*6l5]57CW6=?!=?!Ï K7$r5WKA®Na$Ìr$l5EIUÐ¼K70 Ó$º<.6CE=J6l5EKÙ$ræ=?KM557Ò*65

6=?Ó 570CE$;G(9Ìr$Ke5E0 CW6Ï1$DCE$;Ül(!=?C7$;Ó$l5EGK »K7$$$1A Ï9A UrA

!* A% &3 !3 9 @B%(#&%13 %(')4 2'
*2' 9

8@6(*KEKeL $;ÑR5701ú *G¼"%$;< $;lP,$CEÏ#QÁ6CWÜl(*6CWG>5Ù IUÏ10 CE=¦5EÒ!Ó % @(*K7$5EÒ!$ºKEÜ1(96CE$ 11Ìr0 P*=

6 Î!Î!CE0'æ=UÓ657=?0 %- E v C7$4Ó6=?!I?Ð]G$;K7=UÏ1!$;G]ÖÝ0 CP*65EÌWÒIU$6CE!=U*Ï*-$ /Ò9'< $)@Ì0 ÓÎ!I?$ræ=¦5eÐ
06Ö #ÁV D69G Ó01KM5@=UÓÎ,0 C75E6l5-*5EÒ!$Ð#Ñ40 CEÍ¾01!I?ÐÖÝ0 C:Ó$6åK7Ül(* C7$G$CEC701C
I?01KEK
ÖÝ(!*Ìn5E=U01*K;A,ÞRÒ!$8@6(*KEKML
$ÑR5701Á6I?Ï 01C7=U57Ò!Óï=JK
I?=UÍ1$]5EÒ!$
$ÑR5701 6I?Ï 01C7=U57Ò*Ó3-9Ò!0.Ñ4$< $;C

57Ò!$]Ø
$;KEKM=J6=JK6Î!Î*C70'æ=?Ó657$GP>Ð5EÒ!$ÙKEÜ1(96CE$D0 Öv5EÒ!$ l Ìr01P!=J6 ÀKM$;$]6IJKM0K7$;Ìr57=?0 (>A E
ÖÝ0 CÖÝ(!C757Ò!$;C
G=JK7Ì(*K7K7=?0
J

+

+
& -5 &
%

ÞRÒ!$:"/$<1$>P9$;C7ÏQÁ6CWÜl(*6CWG>5Ó$r5EÒ!0G=JK4IU=?Í $
57Ò!$8@6(*KEKML
$ÑR5701º6P,0.< $1- P!(!54=U54Ò9 K4
C7$;Ï (!IJ6CE=Uë.5E=U01ÙÎ* CE Ó$57$;C ¾5EÒ*.5)Î!C7$;< $;15WK®=¦5ÖÝC701ÓP!I?0.Ña=?!Ï@(!Î%- =UÖ%KM01Ó$a$=?Ï $><.6I?(!$K
6CE$@K7Ó6I?I +
J
+

7 & -
X(
ÑaÒ!$CE$ G$*0657$K#5EÒ!$ã(*!=¦5eÐ Ó.5EC7=UæcADÞRÒ!$x8@6(9K7K $ÑR5E0 ÚÓ$57Ò!0GÚ=JKÁ<. IU=JGßÖÝ01C
Ü1(9 GCW.5E=?Ì:Ìr0lKe54ÖÝ(!*Ìr57=?0 *KaÒ*0.Ñ)$;< $C4KM=?Ó=UIJ6CRÎ!CE0Ìr$;G!(!C7$@6IJK70Ñ40 CEÍKÑa=U57ÒH@(!I?IUP*1ÌWÍlL
"%$=?P!I?$CÌ01KM5a *G3=?KaÌ;6I?IU$G D.5E(!CE I/8:CE1G=?$l5 »K7$$$ A Ï*A - - E *nA
Æ*Ë7Ê Å>Ç ÊÇ DÉÅ>òíÅ ò ò 1>ËeóÄ ËMÄ WÇÆ óÅ!ËeÇ)Ä ËMÄ
ÚÉDøhÅ!ËMøMó ò%ÆzÄ
òc6Å úÇÆ

ä $)Ña=UI?I1*0.ÑúG!=?KEÌr(*KEK/K7$<1$CW6I.5E$;ÌWÒ!*=?Ül(!$Kv =UÓ$;G65Ì0 ÓÎ!(5E=U!Ï

ÖÝ(!I?I101C/Î* CM5E=? I1Ø$K7K7=J6
=UÖÝ01C7Ó657=?0 3PlÐ » â9!=¦5E$ÙG/= V,$;C7$;*Ìr$@Ó$r5EÒ!0Gc- »P RKEÜ1(96CE$ l Ì0 P!=J636Î!Î*C70'æ=?Ó657=?0
×ÖÝ0 C38@6(*KEKeL $;ÑR5701ì *Gì"%$;< $>P,$CEÏ6LhQÁ6CWÜl(*6CWG>56I?Ï 0 CE=U57Ò!&Ó n- »DÌ Ìr0 ÓÎ!(!5E.5E=U01ì06Ö
57Ò!$G=? Ï 01*6I%0 Ö®57Ò!$Ø$;KEK7=? Á6*G ÀG aP>Ð30 P5W6=?!=?!ÏºÎ!CE0G(*Ìr5D06Ö®5EÒ!$Ø
$;KEKM=J6Á *G
< $;Ìr5701CaÑa=¦5EÒ!0 (5:Ìr0 ÓÎ!(!57=?!Ï57Ò!$]Ø
$;KEKM=J6/A,B
57Ò!$;C
K7$Ó=¦Lh696I?Ð15E=?Ì;6I5E$;ÌWÒ!!=JÜl(!$;Ka57Ò*65
6I?IU0.Ñß5EÒ!$Ìr01ÓÎ*(5E657=?0 å06Ö5EÒ!$ÖÝ(!I?IØ$K7K7=? C7$0 Ó=U5M57$G P,$;Ì;6(*K7$57Ò!$;Ð 6CE$CE657Ò!$;C
Ìr0 ÓÎ!I?=JÌ.5E$;G36*G I?K70CE$;Ül(!=?CE$:Ó lÐÖÝ0 CEÑ4 CEG 2.P* ÌWÍ>ÑR6CWGÎ!C701Î*6Ïl.5E=U01ºKe5E$Î*K "!- * OA
!"3C"0%
" B%(')%(3 %
ä $]Ì;63ÑaC7=U57$@5EÒ!$ l L[57Ò3IU=?!$06Ö57Ò*$]Ø$;KEK7=?

$
& % & 7 $ F &
$

ÑaÒ!$CE$ $

65D565-
DX
65D5654
=?KD< $Ìn5E0 C0 Öë$CE01K
6*G#01!I?Ð¾01!$&Ù65
57Ò*$ÙÍlLÀ5EÒ
Î90lKM=U57=?0 %AÞRÒ!=JKDÌ;6ÁP,$Ù=?ÓÎ!I?$Ó$l57$GÁÑa=U57ÒåKM=?ÓÎ!I?$ÙCE$;Ìr=?Î,$XC DÌr01ÓÎ!(57$Ù57Ò!$Ù570 5E6I

Ï CW G=?$l5)P>ÐÓ(!I¦5E=UÎ*IU$:ÖÝ0 CEÑR6CWG6*GºP* ÌWÍ>ÑR6CWGÎ!CE0 Î* Ï1.5E=U01Ke5E$Î*K;A E DG!G @57057Ò*$

lLÀ5EÒ Î96CW6Ó$r5E$C6*GåÌr01ÓÎ!(57$ Ï16=? 57Ò!$Ï CW G!=U$;15-c6*G â** IUI? Ð @KM(*P57CW Ìr5@P,065EÒ

C7$KM(!IU5EK469GG=?<l=JG$DPlÐ >A>Û
(*$
5E0Ù>(!Ó$CE=?Ì;6I9$CEC701CEK=?5EÒ!=JK)Ì0 ÓÎ!(5W.57=?0 ¾K7ÌWÒ!$;Ó$57Ò*$
C7$KM(!IU57=?!ÏåØ$;KEK7=? zÓ=UÏ1Ò15!065P,$¾Î,$C7ÖÝ$;Ìr57I?ÐúK7Ð>ÓÓ$57CE=?Ì Aváhç57Ò!=JKÌ1KM$=U5KMÒ!01(!IJGzP,$

[ün/ÿ T Eþ@1KaG$;KEÌrCE=?P9$G¾P,$I?0.ÑA

*C!'-%! 0" :3 ') "
!"0<3 > &'A%
!* A% <3 :3 9

@ % #<1% 3 %('-4 2'
*C2' 9 =!.04<&'-"0

KEKM(!Ó=?!ÏÓ$;6KEÜ1(96CE$;GºÌ01KM5aÖÝ(!*Ìr57=?0
& E J F
; +1 F
; *

57Ò!$;¾5EÒ!$Ï CW G!=U$;15a=JK
& F F
% +
-
J %
6*Gº57Ò!$]Ø
$;KEKM=J6¾ÖÝ0 I?I?0.ÑKa K
J
+
7 J F
; + F
5 )"

K7=?ÓÎ*IU=UÖÝÐ>=U!Ï Î!Î!CE0'æ>=?Ó.5E=U0106Öv57Ò!$Ø$K7K7=J6º=JK457Ò*$@KEÜl(*6CE$:06Ö/5EÒ!$'11Ìr01P!=? ÑaÒ!=JÌWÒ
=?KÎ,01K7=¦5E=U<1$KM$;Ó=ULhG$râ**=¦5E$Ó657CE=¦æ¾06ÖG!=UÓ$9KM=?0 C%
J
+

)$"

ÑaÒ!$CE$5EÒ!$aKM$Ìr01*G5E$CEÓ ÖÝC701Ó AÜ,8A #" cÑR K®GCE0 Î!Î,$;G%A'ÞRÒ*=?K®=JK$;Ül(!=?<' IU$;l5v5E0@1K7K7(!Ó=?!Ï
57Ò*6557Ò!$!$r5eÑ40 CEÍ=JK
I?=?!$; CaÖÝ(!*Ìr57=?0 Á0 Ö5EÒ!$]Î* CE Ó$r57$;CEK A
Ï16=?57Ò*=?K
=?K
C7$ G=?IUÐ
=UÓÎ!I?$Ó$l5E$;G#ÖÝ01C5EÒ!$ >LÀ5EÒ Ìr01IU(!ÓÁ0 Ö57Ò!$ 11Ìr01P!=? C!ÖÝ01CD IUIv57CW6=?!=?!ÏÎ*65M57$;C79K-
Ñ)$RÖÝ01C7ÑR6CWGÙÎ*C701Î*6Ïl.57$1-'57Ò!$; E ®KM$557Ò*$
Ìr57=?<l=U5eÐ06Ö,57Ò!$01(57Î*(5)(!*=¦5WK570 6*G0 *IUÐ
57Ò!$ lLÀ5EÒ¼0 (5EÎ!(5]5E0 1- ]P*1ÌWÍlÎ*C701Î*6Ïl.57=?0 úKM57$Î¼=JK5W6Í1$z69Gå57Ò!$Ï1CE1G=?$l5]=?K
ÌÌ(!ÓÙ(*I?657$;G%A

+ ! C') :4&2"034 %( &3 9 9 %(')" #!!" #<%1
"%$r5(9KÌr0 9KM=JG$C:Ó(!I¦5E=¦LOIJ'Ð $CK7ÐKe5E$ÓÚÑa=U57ÒK70 Ó$4ÖÝ(!*Ìn5E=U01*6I*P!IU0ÌWÍKÑa=¦5EÒ , =?!Î!(5WK-
0 (5EÎ!(5EKa *G ÷Î* CE Ó$57$;CEK)06Ö%5EÒ!$:ÖÝ0 CEÓ
r2A 0.Ñ 1K7K7(!Ó$DÑ4$:Íl*$Ñ
DF @F -cÑaÒ!=?ÌWÒå=JK Ó.57CE=UæA%ÞRÒ*$å=¦5]=JKKe5ECE =UÏ1Òl5DÖÝ01C7ÑR6CWG570#Ì0 ÓÎ!(5E$

57Ò!=JKaÓ.5EC7=Uæ

6F + 6F F )" E
F F 7 F5

ä $RÌ ]GC701Î57Ò*$)K7$;Ìr01*G@5E$CEÓ =? A)Ü98A #" E /6*G@5EÒ!$CE$;K7(!IU57=?!Ï

$Ke5E=UÓ657$0 Ö57Ò!$RØ$K7K7=J6
=?K)Î90lKM=U57=?< $@K7$Ó=¦LhG$â*!=U57$ A ÖÝ(!CM5EÒ!$CRCE$;G(9Ìn57=?0 º=?K4 ÌWÒ!=?$<1$;Gc-l=UÖ/Ñ)$D=UÏ1!0 CE$
IUI,P!(5R57Ò*$
G=? Ï 01*6I,57$;C7ÓKR0 Ö' C
DF DF
$ F 5 )"
,F J $ F$ ,
ßKM=?Ó=UIJ6CG$;C7=?<..5E=U013Ì;63P9$]G01!$:57001P5E =U¾57Ò!$ , 57=?Ó$;K , Ó657CE=¦æ DF F6A

+ ! C') :4&2"034 % 9 "0!4<<3C!.A%1"0:3 "3 3C%1*C'-:. 3%(

241ÌWÍ>Î!C701Î*6Ïl.5E=U01]Î!CE0>Ì$;G(*C7$K®ÖÝ0 CÌ0 ÓÎ!(5E=U*Ï57Ò!$DG=? Ï 01*6I!Ø

$;KEKM=J66CE$aÑ4$I?I*Í>!0.Ña
*!- 4- - [A9áO5D=JK
KEK7(!Ó$;G357Ò9.5D$ ÌWÒ#IJ'Ð $;Ca=U#5EÒ!$Ù*$r5eÑ40 CEÍºÒ*1Ka57Ò*$]ÖÝ(!*Ìr57=?0 * I%ÖÝ01C7Ó
, , , , , »K7$$Ùî=?Ï (!CE$ *ÖÝ01C5EÒ!$K7=UÏ1Ó0 =JG!6I%*$r5eÑ40 CE$Í r A
K7=?!Ï57Ò*$
8@6(*KEKeL $;ÑR5701Ù6Î!Î*C70'æ=?Ó657=?0 ÀGCE0 Î!Î!=?!Ï
57Ò!$457$;C7Ó 5EÒ*.5Ìr01l5E6=? %/Ñ4$)01P5E =U C
6F 4F ) % F
)"
$F F$ $
6F DF
$F , % $F F, )" "
6*G F6 F6 F , 5
,F J $ $F $ )"&
äx=¦5EÒ1úP,$=?!Ï¾38@6(*KEK7=? Á!0 *IU=?!$; C7=U5eÐ K:KMÒ!0.Ña =Uåîv=?Ï (*C7$ *ÖÝ01C
5EÒ!$N
24îì*$r5ML
Ñ)01C7ÍK4Ñ)$@01P5E =U F F , F ,F
$F , $F $ )" X(
6*G DF 4F , F , F 5
F, J $ $F $ )"*
ÞRÒ!$Ìr01KM5@06Ö4Ì0 ÓÎ!(5E=U*Ïº5EÒ!$G=? Ï 01*6IK7$;Ìr01*G G!$CE=U<..5E=U<1$;KDP>ÐÁC7(*!!=?!Ïº5EÒ!$;K7$$;Ül(*6L
57=?0 *K@ÖÝCE0 Ó 57Ò!$IJ KM5]IJ'Ð $C:5E057Ò!$â9CEKM50 !$=JK$;KEK7$l57=J6I?IUÐÁ57Ò!$KE6Ó$ K:5EÒ!$CE$Ï1(!I? C
P* ÌWÍ>Î!CE0 Î*657=?0 Î9 KEKR(*KM$GÖÝ01CR57Ò!$Ï1CE1G=?$l5;-$æÌ$Î5R5EÒ*.5a5EÒ!$]K7Ül(* C7$:06Öv57Ò!$Ñ4$=?Ï Òl5WK
6CE$(*KM$G#=?#57Ò!$ÙÑ)$;=UÏ1Ò15E$;G#K7(!ÓKAÞRÒ!=JKa5E$;ÌWÒ!!=JÜl(!$]=JKD Î!Î!I?=U$G=UÁ57Ò!$ èM0 Î!57=?Ó I/P!CW6=?
G!6Ó6Ï1$;êÎ!CE(!!=?!ÏÎ!C70Ì$;G(!CE$ »K7$$ E *nA
z
f( )
y
y 1/2 ||ω−x ||2

ω ωx x
x
L<M7N ¹PO *¹ ®\;fM6q1bedql\;^\seikjl^@seo1_a1i¦\;^;djl\; W_ ghghi¦\;j*¯>ghik^Zd;iU g ¦ _7°?s \;jlm4g eb i¦^;o.s 7©

*C"3C4 %'- 9 *6,<>% A%1"0:3 !3 9 #&%16:'

áhÙÓ6>ÐÓ$r5EÒ!0G!Kv57Ò*65®Ó Í $(9KM$406Ö!5EÒ!$aØ$K7K7=? %-;57Ò!$aØ$K7K7=J6]=JK®(*K7$;G]$æ!ÌrI?(*KM=?< $;IUÐ@=?

Î!C70G(9Ìn5EKÑa=¦5EÒ¾<1$;Ìr570 CA áhl5E$CE$;KM57=?!Ï I?Ð - 57Ò!$;C7$=JK4Ñ4'Ð06Ö%Ì0 ÓÎ!(5E=U!ÏKM(9ÌWÒÎ!C70G(9Ìn5EK
ÿ RÏ 01=U!Ï 57Ò!CE0 (*Ï Òz5EÒ!$¾57CE0 (!P!I?$306ÖDÌ0 ÓÎ!(5E=U!Ï 57Ò*$Ø
$;KEKM=J6ç=¦5WKM$;I¦ÖeAÞRÒ!$¾â*!=U57$
GW= V$CE$9Ìr$Ó$r5EÒ!0G3Ì;6¾ÖÝ(!IUâ*IUI%57Ò!=JKR5E1KMÍÖÝ0 C6# C7P!=U57CW6CEÐ< $Ìn5701C

7: F
)"-
(*KM=?!Ïå0 !I?Ðú5eÑ40 Ï CW G!=U$;15Ìr01ÓÎ*(5E657=?0 *'K »65Î,0 =?l5 69G 7 íCE$;K7Î9$Ìn57=?< $;IU"Ð r-
ÑaÒ!=?ÌWÒ#Ì;63P9$CE$;1G=UI?ÐÌr01ÓÎ!(57$G3Ña=¦5EÒ3P9 ÌWÍ>Î!CE0 Î z=JKaKMÓ6I?I/Ì0 *KM5E 1-5 nA
ÞRÒ!=?K)Ó$r5EÒ!0GºÌ P,$: Î!Î!I?=U$G570Ìr01ÓÎ*(57$D57Ò!$:Î!CE=U9Ìr=?Î*6I,$=?Ï $;l<1$;Ìr570 C)6*Gº$=?Ï $!L
<' IU(*$06Ö ÷P>Ð5EÒ!$Î90.Ñ4$CRÓ$r5EÒ!0GcA*24Ðº=U57$;CE657=?!Ï6*G3K7$r5M5E=U*Ï
/. 7 < %.
#&
/.%
57Ò!$< $;Ìr5701C /%. Ña=UI?IÌ0 >< $;C7Ï1$:5705EÒ!$ÙIJ6CEÏ $Ke5$;=UÏ1$>< $Ìn5701Ca06Ö *G %. :5E057Ò*$
Ìr0 CECE$;K7Î901*G=?!Ïú$=?Ï $><.6I?(!$ E *4- 4- [1A ?>$$ I?K70 ÖÝ0 C36x$<1$xÓ0 CE$Á Ì;Ìr(!CW.5E$
Ó$57Ò!0G¾57Ò9.5 RG0>$;KR!0 5(*K7$Dâ9!=¦5E$]G/= V,$;C7$;*Ìr$K6*G E 4Ò* KaK7=?Ó=?IJ6CÌr01ÓÎ!IU$æ=¦5eÐ1A
xÄ
ó®ø lË ¼ÇÅ
ò ò1>ËeóÄ ËMÄ ÚÉDøhÅ!Ë
.øeóò/ÆçÄòcÅ åÇÆ

Oá 5=?K=Ul57$;C7$Ke5E=U*Ï@570](!9G$CWKe5W6*GÒ!0.ÑìK70 Ó$a06Ö,57Ò!$57CE=?ÌWÍKK7Ò!0.ÑaÎ!C7$;<>=U01(*KMI?ÐÙ=?*(*$*Ì$
0 57Ò!$DØ$;KEK7=? %-.=ÀA $ A1Ò!0.ÑãG!0l$K®57Ò*$
Ø$K7K7=? ÌWÒ96!Ï1$4Ña=U57Ò CEÌWÒ*=¦5E$;Ìn5E(!CE$R *GG$r5W6=?I?K06Ö
57Ò!$=UÓÎ!I?$Ó$;15W.5E=U01%A,Þ)Ð>Î!=JÌ6I?I?Ð -!5EÒ!$$=?Ï $><.6I?(!$ÙG!=?KM57CE=UP*(57=?0 Á0 Ö57Ò!$Ø$K7K7=? ÁI?0l01ÍK
IU=?Í $5EÒ!$]01!$]KMÍ1$r5WÌWÒ!$;G3=? î=UÏ1(!C7$ E C*ÖÝ$;ÑÚKMÓ IUI%$;=UÏ1$><.6I?(!$;K;-!Ó6>Ð¾Ó$G=?(!Ó 0 !$K
6*GÖÝ$;Ñ< $;C7ÐIJ6CEÏ $
01!$;K;A äå$:Ña=UI?I!0.Ñ6CEÏ (!$57Ò*65)5EÒ!$ ?ý.üÀû ÿ¦û :5.ý ! Ña=?IUIcÌ;6(*K7$
57Ò!$@5EC701(!P!I?$:=?357Ò!$@5ECE =U*=U!ÏÎ!CE0Ìr$K7K4P,$;Ì (*K7$ E !- E E

!0 !L[ë;$CE0Ó$; ¾=?!Î!(!5EK01CR!$(!CE0 KM5E.5E$;K EXE

Ña=?G!$<' C7=J.5E=U01*K406Ö57Ò!$]K7$;Ì0 *G3G$;C7=?<..57=?< $K)ÖÝC701ÓI?'Ð1$C4570IJ'Ð $;C
Ìr01C7CE$IJ.5E=U01ºP,$r5eÑ4$$KM5E.5E$<.6CE=? P!IU$KA
Þ/0ã$ræ$ÓÎ!I?=¦ÖÝÐì5EÒ!=JK-4Ñ)$åKMÒ*0.Ñ 57Ò!$ $=?Ï $><.6I?(!$åG=?KM57CE=?P!(57=?0 06Ö¼*$r5eÑ40 CEÍç5ECE =U!$G
0 çB&4NõG!.5W=Uzî=?Ï (!CE$ E Av&)IU$6CEIUÐ1-c57Ò!$;C7$=JK]Ña=JG$ºK7Î!CE$; G 06Öa$;=UÏ1$><.6I?(!$;K ÀKM$;$
îv=?Ï (*C7!$ - *GºÑ4$:0 P*K7$CE< $
57Ò*6545EÒ!$@CE657=?0P,$r5eÑ4$$;º$1A Ï9A15EÒ!$:â*CWKe569G57Ò*$:$;IU$;< $;15EÒ
$=?Ï $><.6I?(!$R=JK6P,0 (5 *!A6ÞRÒ!$I?0 *ÏD5W6=?I!06Ö5EÒ!$a$=?Ï $;l<. IU(!$G!=?KM57CE=UP*(57=?0 @ »K7$$î=?Ï (!CE$ E
=?K:CW.57Ò*$C:Î*6=?ÖÝ(!IP,$;Ì;6(*K7$Ù5EÒ!$CE657=?0¾P,$r5eÑ4$$;Á57Ò*$IJ6CEÏ $Ke5@6*G KMÓ6I?I?$;KM5:$;=UÏ1$><.6I?(!$
Ï =?< $;K:5EÒ!$ºÌ0 *G=U57=?0 *=U!Ï0 Ö45EÒ!$I?$;6CE!=?!ÏÎ!CE0 P!I?$Ó0A IJ6CEÏ $CW.5E=U0#Ìr01C7CE$;K7Î,0 *G!KD5E0Á
P!=UÏG!W= V$CE$*Ì$]=?57Ò!$Ù6æ=?KR0 Ö5EÒ!$]$;IUI?=UÎ9KM01=?G! I/K7Ò*6Î,$;G3$CECE0 CRÖÝ(!*Ìr57=?0 C*57Ò!$ÙI? C7Ï1$CR57Ò*$
CE657=?0*-957Ò!$Ó0 CE$]Ñ)$â**G 5W Ìr0 LOK7Ò!$;IUIKMÒ* Î9$GÁÓ=?!=UÓ*-,ÑaÒ*=?ÌWÒ C7$$ræ>57CE$Ó$;IUÐ#KM57$$;Î
570.ÑR6CWG!K)57Ò!$]K7Ó6I?I%.æ=?Ka *G¾< $CEÐ 9656I?0 !Ï57Ò*$IU01!Ï6æ=?K;A
!0 57Ò!$;CDÏ1$!$;CE IvÌWÒ* CE1Ìn57$;C7=JKM57=JÌ06Ö5EÒ!$Ø
$;KEKM=J6 =U ÓÙ(!IU57=UL[IJ'Ð $;C
!$5eÑ)01C7ÍK
=?KD57Ò*$
KMÎ!CE$;1GçP,$r5eÑ4$$;ãIJ'Ð $;CEK;A®áhTî=?Ï (!CE$ E Ñ4$3CE0 (*Ï Ò!I?ÐzKMÍ1$r5WÌWÒãÒ!0.Ñí57Ò!$ÁK7Ò* Î9$06Ö:57Ò*$
Ø$;KEK7=? ¼<' C7=?$;K]ÖÝC701Ó÷P,$=?!Ï CE657Ò!$;C 965Ù=Uz57Ò!$ºâ*CEKM5IJ'Ð $;C570 P9$;=U!ÏåÜl(!=¦5E$¾KM57$;$Îz=?
0
−0.5 the ratio between the 1st and
−1
the 11th eigenvalues is 8
−1.5
Log10 Eigenvalue
−2
−2.5
−3
−3.5
−4
−4.5
−5
−5.5
−6
0 100 200 300 400 500 600 700 800
Eigenvalue order
/ik^_Ej ;\;k1_ghq_EfEshbelZikj]\R£U\W._7bghol\be_n:_Wikô.segj1_Es[d;be ¢;¡~r¢;¤;£~n
hsL<bMM7\N ikjl¹P_nO *¹ d;j¢;@o>\j>6beikshse_Ej1ikîsegW©
57Ò!$3I?1Ke5IJ'Ð $CA ÞRÒ!=?KQV,$Ìn5EKÙ57Ò*$3I?$; C7*=U!Ï KMÎ,$$Gç69GçÌ zÎ!CE0.<>=?G$36ã=U!Ï1C7$G=?$l5
5703$ræÎ!IJ6=?#57Ò!$K7IU0.ÑßI?$; C7!=?!Ï¾=U IU0.Ñ4$CDI?'Ð1$CWK
69G57Ò!$Ö» KM5 »K70 Ó$r5E=UÓ$01KEÌr=?IUIJ.5E=U!Ï
IU$6CE!=U*Ï=U5EÒ!$
IJ KM5I?'Ð1$CAXì5EC7=JÌWÍ]5E0ÙÌr01ÓÎ,$9K7657$a5EÒ!=?K)G=/V,$;C7$;154K7Ì;6I?$a06ÖcIU$6CE!=?!Ï=?K
570(*K7$
5EÒ!$@=U>< $;CEK7$DG!=? Ï 0 96IcØ$K7K7=? 5E0Ì0 l57CE0 I,57Ò*$:I?$; C7*=U!ÏCE657$ »K7$$6IJK70K7$;Ìn5E=U01
&!A
Dø Ë7Äô
4òvÊÇÄ
È zÆ*Èò%
Æ òcÅ Ça?È Å
Ç É
øhÅ!Ë7øeó ò/Æ
òc6Å úÇÆ
2)$ÖÝ0 CE$ÙÑ4$Ì0 *Ì$l57CW.5E$Ù=? 5EÒ!=JK:K7$;Ìr57=?0 0 Ò*0.Ñ 57035E =UI?0 C@KM$Ìr01*G#0 CWG$;C
5E$;ÌWÒ!!=JÜl(!$;K
ÖÝ0 C57CW6=?!=?!ÏDIJ6CEÏ $)!$r5eÑ40 CEÍK-I?$r5(*Kvâ*CWKe5CE$Î,$;65®K70 Ó$4CE657Ò!$;CÎ,$;KEK7=UÓ=JKe5E=?ÌÖ» Ìr5EK6P,0 (!5
6Î!Î!I?Ð>=U*ÏÌrIJ KEK7=?Ì;6I9K7$;Ì0 *G0 CWG$C)Ó$57Ò!0G!K;AÞ/$ÌWÒ!!=JÜl(!$;K)(*K7=U!ÏÙÖÝ(!I?IØ
$;KEKM=J6º=UÖÝ01C7Ó6L
57=?0 À8@6(9K7K4L $;ÑR570 /-!"%$<1$>P,$CEÏ6LhQÁ6CWÜl(*6CWG>5) *G¾2Rî8 ?"4Ì630 *IUÐ Î!Î!I?Ð570< $;C7Ð
KMÓ6I?IR!$r5eÑ40 CEÍK]5ECE =U*$;Gz=?TP*.5WÌWÒçÓ0G$ -®Ò!0.Ñ4$<1$C57Ò!0lKM$KMÓ IUIR!$5eÑ)01C7ÍK6CE$º!0 5
57Ò!$01!$;K@5EÒ*.5]*$$;GúK7Î9$;$;G=?!Ï#(!Îú5EÒ!$Ó01KM5;A/QÁ01KM5ÙKM$Ìr0 9G 01CEG$;CÓ$r57Ò*0>G*&K »Ì0 P e(L
Ï1.5E$Ï CW G=?$l5-%24î8 ?-A;A#A :CE$;Ül(!=?CE$I?=?!$rLhKM$6CWÌWÒå *GúÌ 5EÒ!$CE$rÖÝ01C7$!0 5]P9$º(*KM$G
=Uú5EÒ!$ºKM570ÌWÒ* KM57=JÌÓ0G$ A/Q 6>Ð 0 Ö45EÒ!$5EC7=JÌWÍKG=?KEÌr(9K7K7$;GåÎ!CE$<>=?0 (*K7IUÐå6Î!Î*IUÐ 0 !I?ÐÁ5E0
P*.5WÌWÒ#IU$6CE!=U*Ï*A9î*C701Ó 01(!C
$æÎ9$;C7=?$*Ì$ÙÑ4$]Í>!0.Ñ 57Ò*65:Ì;6CE$rÖÝ(!I?I?Ðº5E(!!$;G Ke5E0ÌWÒ* KM57=JÌ
Ï CW G=?$l5G$;KEÌr$;15=JKÒ*6CWGç570úP9$.501ãIJ6CEÏ $ÌrIJ KEKM=Uâ9Ì;.57=?0 ãÎ!CE0 P!I?$ÓK;A®î*0 CK7Ó6I?IU$;C
Î!C701P!I?$ÓKa5EÒ*.5@C7$Ül(!=UCE$Ù1ÌÌr(*CE657$CE$;6IULO<' IU(*$;G0 (!57Î!(5WK:IU=?Í $]=?ÁÖÝ(**Ìn5E=U01 6Î!Î*C70'æ=?Ó6L
57=?0 ì0 CÌ0 l57CE0 IRÎ!CE0 P!I?$ÓK;-Ñ)$#K7$$5EÒ*.5ºÌ0 P e(!Ïl.5E$3Ï1CE1G=?$l'5 »Ña=¦5EÒ ®0 IJ6ÍlLON=UP!=?$CE$
Number of Eigenvalues
20
19
18
17
16
15
14
13
12
11 Big killers
10
9
8
7
6
5
4
3
2
1
0
0 2 4 6 8 10 12 14 16
Eigenvalue magnitude
/ik^_Ej ;\;k1_ghq_EfEshbelZikj]\R£U\W._7bghol\be_n:_Wikô.segj1_Es[d;be ;
¢ ¡~r¢;¤;£~n
hsL<bMM7\N ikjl¹_n 9¹ d;j¢;@o>\j>6beikshse_Ej1ikîsegW©
L<M7N ¹ O>¹ lseikU\W._7be_nD\rbef7o1ise_WfEse1be_;¯Wseol_/gh_WfWd;j>1_Ebeik;\seik'_%ikgd;°Use_WjghZ]\;kk_Ebikjkdn®_Ebc¦\W._EbegW©

AÜ,A7 %)0QV$CWK457Ò!$P,$;KM5DÌ0 ÓÙP*=U*657=?0 306ÖKMÎ,$$Gc-!CE$I?=? P!=?IU=U5eÐº69G¾K7=UÓÎ!I?=JÌr=U5eÐ A ?>$;< $;CE I
.5M5E$ÓÎ5WK@(*KM=?!ÏTèMÓ=?!=P*.5WÌWÒ!$;KEê=?ú6Î!Î!I?Ð>=U*Ï3Ì0 Pe (!Ïl.5E$Ï1CE1G=U$;l5
5703IJ6CEÏ $ *G C7$L
G(!*G! l5@Î*C701P!IU$;ÓKDÒ9'< $ÙP,$$; Ó1G$CE$;Ì$l57I?Ð (>- E "- "OA <' C7=J6l5:06Ö4Ì0 Pe(*Ï1.5E$
Ï CW G=?$l5a0 Î5E=UÓ=?ë;657=?0 » Ì IUI?$;GKEÌ IU$G&R8 R K7$$;ÓK=Ul57$;C7$Ke5E=U*Ï C Ò!$;C7$:57Ò*$IU=?!$ÙK7$; CEÌWÒ
Î!C70Ì$;G(!CE$@=?KRCE$Î*I?1Ìr$;G¾P>ÐºB;ÛÚ"/$<1$>P9$;C7ÏQÁ CEÜl(* CEG>545eÐ>Î9$] IUÏ10 CE=¦5EÒ!Ó E OA
8 -! "0 9 "!4<&3!. @B%(#&%13 %(')4 2'
*2' 9
%( 9
Þ/030 P5W6=? 3Ke5E0ÌWÒ* KM57=JÌÙ<1$CWKM=?0 #0 Ö5EÒ!$"%$<1$>P,$CEÏºQÁ6CWÜl(*6CWG>5@6I?Ï 0 CE=U57Ò!Ó 5EÒ!$=JG$
=?K]570 Ì0 ÓÎ!(5E$57Ò*$ºG=J6Ï 01*6I)Ø$;KEK7=? ú57Ò!CE0 (!Ï1Ò¼#C7(*!!=?!ÏÁ$;KM57=?Ó.5E$06ÖR57Ò*$¾K7$;Ìr01*G
G$CE=U<.657=?< $@Ña=¦5EÒ#C7$KMÎ,$;Ìr5R570$;1ÌWÒ¾Î96CW6Ó$r5E$CA!ÞRÒ!$=U9Ke5W6l5E !$01(*KaKM$Ìr01*G¾G!$CE=U<..5E=U<1$
Ì6#P,$Ù01P5E =U*$;G<>=?P9 ÌWÍ>Î!CE0 Î* Ï1.5E=U0131KK7Ò!0.Ña#=U5EÒ!$]ÖÝ01C7Ó(!IJ Ka0 ÖK7$;Ìr57=?0 (A
K
KM0>0 1KÑ4$aÒ*'< $R5EÒ!01K7$aC7(*!!=?!Ï]$;KM57=?Ó.5E$;KÑ4$Ì (*K7$a57Ò!$;Óà5E0ÙÌr01ÓÎ!(57$
=U*G!=U<>=JG(*6I
IU$6CE!=U*ÏCW.5E$;K)ÖÝ0 Ca$ ÌWÒ3Î*6CW6Ó$57$C
0$, 7
#"&

ÑaÒ!$CE$ G$!0 57$K5EÒ!$Ï I?0 P96I*I?$; C7*=U!ÏCW.5E$ -l6*G =JKÙC7(!*!=U*Ï]$;KM57=?Ó.57$
06Öc57Ò*$
G=? Ï 01*6I/K7$;Ì0 *GG$;C7=?<..5E=U<1$Ña=¦5EÒÁC7$KMÎ,$;Ìr5570 , A ú=?KDÎ*6CW6Ó$57$CR5E0Î!CE$<1$l5 0 ,

ÖÝC701ÓßP!IU0.Ña=?!Ï:(!Î=UÌ K7$)57Ò!$aK7$;Ìr01*GÙG!$CE=U<..5E=U<1$$ =JKKMÓ IUI[-'=ÀA $ A.ÑaÒ!$;]5EÒ!$R0 Î5E=UÓ=?ë;.5E=U01$

Ó0.<1$;K4=? ,.5Î* CM5WKR06Ö/5EÒ!$$CECE0 C4ÖÝ(!*Ìr57=?0 %A*ÞRÒ*$C7(!*!=U*Ï$Ke5E=UÓ657$@=?K
Ìr0 ÓÎ!(!57$;G1K
6F F , . % F 6F F , 7 6F GF ,
#& E
$ $ $

ÑaÒ!$CE$ =?K3çK7Ó6I?I:Ì0 *KM5E l557Ò*65G!$r57$;C7Ó=?!$;Kº57Ò*$å Ó01(!l5¾0 Ö]Ó$;Ó0 CEÐã5EÒ*.53=?K

P9$;=U!Ï¾(*K7$;GcA,ÞRÒ!$K7$;Ì0 *G#G$;C7=?<..5E=U<1$;K
Ì;6#P,$Ì0 ÓÎ!(5E$;G#Î!CE=?0 Ca5E057CW6=?!=?!Ï0.<1$Ca$ A Ï*A
KM(*P*KM$5D0 Ö5EÒ!$]5ECE =U*=U!Ï¾KM$5;A ?=U*Ì$]5EÒ!$Ð#ÌWÒ* !Ï $01!IUÐ3< $;C7ÐK7IU0.ÑaI?Ð5EÒ!$Ð0 *IUÐ3!$$G
5703P9$C7$;$;KM57=?Ó.57$G#$<1$CEÐ3ÖÝ$ÑÚ$;Î90ÌWÒ*K;<A
065E$Ù5EÒ*.5:57Ò!$ G*G=¦5E=U01*6IÌ01KM5D0.< $;C
CE$Ï1(!I? C
P* ÌWÍ>Î!CE0 Î* Ï1657=?0 #=JKD*$Ï I?=?Ï =?P!IU$ *G Ì0 >< $;C7Ï1$*Ì$Ù=JK 1K:¾CE(!IU$0 Ö57Ò>(!ÓP Á6P,0 (!5
57Ò!CE$$:57=?Ó$;KRÖ»1Ke5E$CR5EÒ*6#Ì C7$ÖÝ(!IUI?Ð5E(!!$;GKM570ÌWÒ*1Ke5E=?Ì:Ï CW G!=U$;15 IUÏ10 CE=¦5EÒ!ÓA
áhîv=?Ï (*C7$ E E 6*G E @Ñ)$DKM$;$a57Ò!$DÌr0 ><1$CEÏ $9Ìr$R06Ö57Ò*$DKM570ÌWÒ* KM57=JÌaG=J6Ï10 * I*"%$;< $!L
P9$;C7ÏQÁ CEÜl(* CEG>54Ó$r57Ò*0>G #"& ÖÝ01C]5E0.Ð$ræ!6ÓÎ!I?$@Ña=¦5EÒº5eÑ40G/= V,$;C7$;15K7$r5WKR06ÖvI?$; C7!L
=U!ÏCW.5E$;K;AlBDP><>=U01(*KMI?Ð5EÒ!$$ræÎ,$CE=UÓ$;154KMÒ!0.Ñaî=?Ï (!CE$ EXE Ìr01l5E6=?*K®ÖÝ$;Ñ)$;C *(*Ìr57(*657=?0 *K
57Ò* 3=?#î=UÏ1(!C7$ E G(*$@570K7Ó IUI?$CRI?$; C7!=?!ÏCW.57$KA

*C"3C4 %'-"3C " !. "4&%13;#!!.*C%
!#&%16:'=<>C% A%1"!3
áh57Ò!$aÖÝ01IUI?0.Ña=?!Ï@Ñ)$Ï1=U<1$457Ò!CE$$a5EC7=JÌWÍKÖÝ0 C4Ìr0 ÓÎ!(!57=?!Ï@57Ò!$
Î!C7=?*Ì=UÎ* I!$=?Ï $;!<.6I?(!$2 $;ÌrL
570 C¾06Ö5EÒ!$ Ø
$;KEKM=J6xÑa=U57Ò!01(5¾Ò9'<l=?!Ï¼5E0ãÌ0 ÓÎ!(5E$#57Ò!$úØ$K7K7=J6x=U5EK7$IUÖeA4Na$;Ó$ÓÙP,$C
57Ò*65=?TK7$;Ìr57=?0 A!(Ñ4$3 I?K70 =?l57CE0>G!(*Ìr$Gç Ó$57Ò!0Gz570ú6Î!Î!CE0'æ=UÓ657$5EÒ!$KMÓ6I?I?$;KM5
$=?Ï $><1$;Ìn5E0 C06Ö:57Ò!$ÁØ
$;KEKM=J6 ÝÑa=U57Ò!01(5ºÒ*'<>=?!Ï 570¼Ìr01ÓÎ*(57$5EÒ!$ÁØ$K7K7=? Ù57Ò!CE0 (*Ï Ò
'< $CW6Ï1=U*Ï »K7$$ I?K70 E * *nA
Weight space
2
Learning 1.8
rates: 1.6
η0 = 0.12
η1 = 0.03
1.4
η2 = 0.02 1.2
Hessian
0.8
largest
eigenvalue: 0.6
λ max= 0.84 0.4
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Maximum Log MSE (dB)

admissible 0
Learning
rate (batch): −5
η max= 2.38 −10
−15
−20
0 1 2 3 4 5 6 7 8 9 10
epochs
.sed.f7ol\;gOseikf@ i¦\;^d;j>\vy*_W'_Wj6{_Ebe^| \b .>\rbM s4\;k^dbeikseo1Z©!pa\sM\gh_Es4°?bedZõ¢ \lgO|

hgL<i¦\M7N jl¹g) *i¹ seo~W_E½1\;Zq1¦_EgW©>u®o1_
j1_Es[d;beo>\;gdj1_¦ikj1_n\b)lj1iksn}¢:ikjlql seg\;j>3~d;1seq11sn}1iÝ© _©

seo be_W_aq>\rbM\;Z_Ese_7beg ¢
®_Wik^;o6segW}9~a{li¦\g 7©
Weight space
2
Learning 1.8
rates: 1.6
η0 = 0.76
η1 = 0.18
1.4
η2 = 0.12 1.2
Hessian
0.8
largest
eigenvalue: 0.6
λ max= 0.84 0.4
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Maximum Log MSE (dB)

admissible 0
Learning
rate (batch): −5
η max= 2.38 −10
−15
−20
0 1 2 3 4 5 6 7 8 9 10
epochs
.sed.f7ol\;gOseikf@ i¦\;^d;j>\vy*_W'_Wj6{_Ebe^| \b .>\rbM s4\;k^dbeikseo1Z©!pa\sM\gh_Es4°?bedZõ¢ \lgO|

hgL<i¦\M7N jl¹g) *¹iseo~W_E½1\;Zq1¦_EgW©>u®o1_
j1_Es[d;beo>\;gdj1_¦ikj1_n\b)lj1iksn}¢:ikjlql seg\;j>3~d;1seq11sn}1iÝ© _©

seo be_W_aq>\rbM\;Z_Ese_7beg ¢
®_Wik^;o6segW}9~a{li¦\g 7©
%(' % C 9 ä $CE$Î,$;65a57Ò!$CE$;K7(!IU506Ö®0 (!CG=JKEÌr(*KEKM=?0 =?#KM$Ìn57=?0 (A ""C!Ke5W6C757=?!Ï
ÖÝC701ÓíCE *G0 Óí=U!=U57=J6I/< $;Ìr5701C -!57Ò!$=U57$;CE657=?0
.

Ña=UI?I*$;< $l5E(*6I?IUÐÌr01l<1$CEÏ $4570]57Ò!$Î*C7=?*Ìr=?Î* I*$=?Ï $;>< $;Ìr5701C »0 C4@< $;Ìr5701C=U5EÒ!$Î!CE=?*Ìr=?Î*6I
$=?Ï $9KMÎ*1Ìr$ ) *G
Ña=?I?I/Ìr01l<1$CEÏ $D57057Ò*$]Ìr0 CECE$;K7Î901*G=?!Ï$;=UÏ1$><.6I?(!$ 4- OA
.0&' !3"0<3
!065EÒ!$C@Ó$r5EÒ!0GÁÓ6Í1$;KD(*KM$06Ö5EÒ!$Ö» Ìn5:57Ò9.5KMÓ6I?IÎ9$;CM5E(!CML
P*.5E=U01*KR06Ö57Ò!$Ï1CE1G=?$l5R I?K70IU$ Gº57057Ò*$Î!C7=?*Ì=UÎ* Ic$=?Ï $;l<1$;Ìr570 CR0 Ö
.
% 7 F
#&

ÑaÒ!$CE$ #=?K@K7Ó6I?I!Ìr0 9Ke5W6l5;A1BD!$R=U57$;CE657=?0 06Ö*5EÒ!=JKÎ!CE0>Ì$;G(*C7$4C7$Ül(!=UCE$;Kv5eÑ40:ÖÝ0 CEÑ4 CEG
6*Gº5eÑ)0P*1ÌWÍ>Ñ4 CEGºÎ!C701Î*6Ïl.5E=U01ºKe5E$Î*KaÖÝ0 Ca$;1ÌWÒ¾Î9.5M5E$CE¾=?357Ò!$@5ECE =U*=U!ÏK7$r5A
3.0"03%
*2"&3 &> ÞRÒ!$ÖÝ0 I?IU0.Ña=?!ÏCE(!I?$ÙÓ6Í1$;KD(*KM$06Ö5EÒ!$CE(!!*=U!Ï3'<lL
$CW6Ï $D57001P5E =U¾57Ò*$I? C7Ï1$;KM54$;=UÏ1$><.6I?(!$@06Ö/5EÒ!$]'< $;CE Ï $:Ø$;KEK7=? ¾<1$CEÐÖ» KM5
. F 7
7 F 5 #&

Þ/0K7(!ÓÓ C7=?ë$1->57Ò!$$;=UÏ1$><.6I?(!$ 2.< $Ìn5701C4Ì0 ÓÎ!(5W.5E=U01*6K C
AaCE *G0 Óí< $;Ìr5701C4=JKÌWÒ!0lKM$;ÖÝ01C=U*=¦5E=? IU=?ë;657=?0 306Ö -
E Aa6=?!Î!(5
Î9.5M5E$CE=?K
Î!C7$KM$;l57$;G¾Ña=U57ÒÁG$KM=?C7$G30 (5EÎ!(5;-,ÖÝ0 CEÑ4 CEG36*G3P*1ÌWÍ>Ñ4 CEG
Î!CE0 Î* Ï1.5E=U01%-Ke5E$Î3=?KÎ9$;CMÖÝ01C7Ó$;G369G5EÒ!$Ï CW G=?$l5WK a C7$@KM570 CE$;G%-
!A

=?Ka1G!G$Gº5E057Ò!$]Ì(!C7CE$l5aÑ4$=?Ï Òl5a< $;Ìr5701C -
AaÖÝ01C7ÑR6CWG6*G#P9 ÌWÍ>Ñ4 CEG3Î!CE0 Î96Ï1657=?0 Ke5E$Î =?K
Î9$;CMÖÝ01C7Ó$;GÁÑa=¦5EÒ#57Ò!$Î,$C757(!CEP9$G
Ñ)$;=UÏ1Òl5R<1$;Ìr570 Ca *G¾57Ò!$Ï1CE1G=U$;l5EK R C7$KM570 CE$;G%-
"A457Ò!$G!W= V$CE$*Ì$ 1 F ;@=?KÌ0 ÓÎ!(5E$;Gå6*G 57Ò*$CE(!!*=U!Ï#'<1$CW6Ï $Ù06Ö
57Ò!$$;=UÏ1$>< $Ìn5E0 CR=?KR(*Î,G!657$Gc-
&!ARÑ)$@I?0>0 Î3ÖÝC701Ó E O-L & (*15E=UIvC7$ K70 * P!IUÐKM5E P!IU$@CE$;K7(!IU5=?KR01P5E =U!$GÖÝ01C -
(>A457Ò!$01Î57=?Ó6I%I?$; C7*=U!ÏCW.57$@=JK457Ò!$;#Ï =?< $;¾ K
0 ( 5
áhî=?Ï (!CE$ E Ñ4$aKM$;$457Ò!$R$<10 I?(57=?0 06Ö957Ò!$R$=?Ï $;><' IU(*$41KDÖÝ(!9Ìn57=?0 06Ö957Ò!$R>(!ÓP9$;C06Ö
Î*.5757$CEÎ!CE$;K7$l5E657=?0 *K,ÖÝ01CvR!$;(!CW6I6!$5eÑ)01C7ÍD=U]Ò*6*G!ÑaC7=U5M5E$]ÌWÒ* CE1Ìn57$;CC7$Ìr0 Ï1!=U57=?0
5E K7ÍAáh Î!CW Ìr57=JÌr$Ñ)$ G*6Î5@57Ò*$I?$; ÍÁKM=?ë$06Ö)57Ò!$CE(!!!=?!Ï'< $;CE Ï $]=? 01CEG!$C:5703Ï $5
ÖÝ$Ñ4$C *(*Ìr57(*657=?0 *@K À K6IJK70 =?*G=JÌ657$;Gã0 ç57Ò!$â9Ï (!CE$ rAáhã57Ò*$¾â9Ï (!CE$Ñ)$KM$;$357Ò*65
.Ö×57$;C]ÖÝ$;Ñ)$;CÙ5EÒ*6 Î*65M5E$CEzÎ!CE$;K7$l5E657=?0 *K@5EÒ!$¾Ì0 CEC7$Ìn5Ù01CEG!$C]0 ÖÓ Ï !=U57(9G$ÖÝ01C
57Ò!$¾$=?Ï $><.6I?(!$1-%=ÀA $5EÒ!$I?$; C7!=?!ÏÁCW.5E$=?KÙC7$ ÌWÒ!$GcAvî*C701ÓÔ5EÒ!$º$æ>Î,$CE=?Ó$;l5EK]Ñ4$º I?K70
0 P*K7$CE< $D57Ò*65a57Ò!$ 9(*Ìn5E(*.5E=U01*KR06Ö57Ò!$]'<1$CW6Ï1$DØ$K7K7=J630.< $C)57CW6=?!=?!Ï6CE$KMÓ6I?I[A
80
70
60
50
eigenvalue
40
30
γ=0.003
γ=0.01
20
10
0
0 50 100 150 200 250 300 350 400
γ=0.1 γ=0.03 Number of pattern presentations
/'d;k1seikdj#d°seo1_]_Wik^_Wj6;\;kl_Ù\;g:\°J1jlfEseikdj#d°seol_]j6lZ:{_7bd;°q>\rshse_Ebejq1be_Egh_Wj1|
MsL<\rM7seN ikd¹jl g@,¹ °?d;b]\gho>\rbe_n#®_Wikô.s]j1_Es[d;be3iseoå¡º¦\W._EbegW} ;£ ;¤3fWdj1jl_Wf7sei¦d;jlg\j>¼~r¢;;¤º°Ube_W_
ql\bM\Z_Ese_EbegW©lu®ol_RshbM\;ikj1i¦j1^@gh_EsfWd;jlghikgOseg)d°®~n;o>\j>6beikshse_Ej1ik^;iksegW©
áh î=UÏ1(!C7$ E "º6*G E &Ñ4$KM5E CM5:Ña=¦5EÒÁ5EÒ!$KE6Ó$=U!=U57=J6IÌ0 *G=U57=?0 9K-6*G Î9$;CMÖÝ01C7Ó
â!æ$;G3>(!ÓÙP,$C0 Ö$;Î90ÌWÒ*KÑa=¦5EÒ3I?$; C7*=U!ÏCW.5E$;KÌ0 ÓÎ!(5E$;G3P>ÐºÓÙ(*I¦5E=UÎ!I?Ð>=U*Ï57Ò!$]Î*C7$L
G=?Ìr57$GI?$; C7*=U!Ï:CW.57$4P>ÐÙ:Î!CE$;G$â*!$;GÌr01*KM5E6l5A6&)Ò*0l0lKM=?!Ï@Ìr01*Ke5W6l5»=ÀA $ A (*KM=?!Ï:57Ò*$
Î!C7$G=JÌn57$G0 Î5E=UÓ I/CW.57$a6I?ÑR'Ð>KÏ =?< $;KC7$KM=JG(* I%$CEC701CEKRÑaÒ*=?ÌWÒ C7$<1$CEÐ¾ÌIU0lKM$@5E057Ò*$
$CEC701C ÌWÒ!=?$<1$;GP>Ð57Ò!$DP9$Ke5RÌWÒ!0 =JÌr$0 Ö57Ò!$:Ìr0 9Ke5W6l5;A1áh065EÒ!$C)Ñ)01CEG!K;-.5EÒ!$ºè7Î!C7$G=JÌn57$G
0 Î5E=UÓ6IcCW.5E$;ê=JKR0 Î5E=UÓ6I/$!01(!Ï Ò%A
2.5
2
MEAN SQUARED ERROR
1.5
1 epoch
2 epochs
1
3 epochs
4 epochs
5 epochs
0.5
0
0 0.250.50.75 1 1.251.51.75 2 2.252.52.75 3 3.253.53.75 4
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
_W\;jzg .>\be_Wz_7bhbed;b\g\°J1jlfEseikdjçd;°seo1_ºbM\rsei¦d {_EsÀ®_W_Wjzk_n\rbejlikjl^ÁbM\se_¾\;jl
qL<beM7_nN 1¹ik fEse*¹_W d;q1seikZ]\;k_n\rbejlikjl^ºbM\se_°?d;bÙ\°JlkÁfWd;jljl_EfEse_nÁjl_EsÀ®d;be ¤;£ #; ~n 7©u®o1_
shbM\ikjlikjl^@gh_7s)fEdjlghikgOsegd;°/;:ol\;j>6beishse_Wj1ik^;iksegW©
2.5
2
MEAN SQUARED ERROR
1.5
1 epoch
1
2 epochs
3 epochs
4 epochs
0.5
5 epochs
0
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
_W\;jzg .>\be_Wz_7bhbed;b\g\°J1jlfEseikdjçd;°seo1_ºbM\rsei¦d {_EsÀ®_W_Wjzk_n\rbejlikjl^ÁbM\se_¾\;jl
qL<beM7_nN 1¹ik fEse*_W¹ ådq sei¦Z]\)k_n\rbejlikjl^bM\rse_°Jd;b\gho>\rbe_n _Wikô.sj1_Es[d;beÁiseoz¡3¦\n'_Ebeg ~W'¢r£
~n¡ ;¤ Á¢ #£ å~n ú~n 7} £ ;¤ kd.fn\ ÙfWd;jljl_EfEseikdjlg\;jlã~r¢;;¤°?be_E_ºql\bM\Z_Ese_Ebeg
ghol\be_nÙ®_Wikô.seg 7©lu®ol_RshbM\i¦j1ikjl^gh_EsfWd;jlghikgOsegd;°~n;@ol\;j>6beishse_Wj i¦^;isegW©
Â Ë >ÊcÉ?1>ËeÇÄïóÄ
È åÇÄ
Êcø7É?>ËeÇÄ

ÌÌ0 CWG=?!Ï¼570ã57Ò*$ CE$;Ì0 ÓÓ$*G!657=?0 *KºÓ$l57=?0 !$G6P,0.< $1-)çÎ!CE1Ìn5E=¦5E=U01!$CºÖ» Ìr=?!ÏT
ÓÙ(!IU57=UL[IJ'Ð $;CR!$(!CW6Ic*$r5a57CW6=?!=?!ÏÎ!CE0 P!I?$ÓÑ40 (*I?G¾Ï 057Ò*C701(!Ï Ò¾57Ò!$@ÖÝ01IUI?0.Ña=?!ÏKM57$;Î*KDC
KMÒ>( $@5EÒ!$$ræ!6ÓÎ!I?$;K
Ìr$;15E$CR5EÒ!$=U*Î!(5
<.6CE=J6P!I?$;K4P>ÐºK7(!P57CW Ìr57=?!Ï57Ò!$Ó$6
!0 CEÓ6I?=Uë;$:57Ò!$=?!Î!(5<. C7=J6P!I?$:570Ke5W6*G*6CWG¾G$;<>=?657=?0 306Ö
=¦Ö®Î,01KEKM=?P!I?$ -*G$Ìr01C7CE$IJ.5E$57Ò*$=U!Î*(5
<.6CE=? P!I?$;K;A
Î!=JÌWÍº!$r5eÑ40 CEÍÑa=U57Ò357Ò!$]K7=?Ï Ó0 =JGÖÝ(**Ìn5E=U01K7Ò!0.Ña3=U3â*Ï1(!CE$
KM$5a57Ò!$@5W6CEÏ $54<.6I?(!$KaÑa=¦5EÒ!=?¾5EÒ!$CW6!Ï1$D0 Öv5EÒ!$]KM=?Ï Ó01=?Gc->5eÐ>Î!=JÌ IUI?Ð 7 @6*G¾ L A
=U*=¦5E=? IU=?ë$@5EÒ!$Ñ)$;=UÏ1Òl5EK4570CW6*G!0 Ó<.6I?(!$Ka KRÎ!CE$;KEÌrCE=UP,$;G¾P> Ð &*A
ÞRÒ!$@Î!C7$ÖÝ$CEC7$GºÓ$57Ò!0G¾ÖÝ0 CR5ECE =U!=?!Ï57Ò*$!$r5eÑ40 CEÍKMÒ*0 (!IJGºP,$Î!=JÌWÍ $G¾1K4ÖÝ0 I?IU0.ÑKDC
=¦Ö957Ò!$)57CW6=?!=?!Ï@KM$5=JKI? C7Ï1:$ ÝÓ0 CE$)57Ò*
ÖÝ$ÑzÒ>(!*G!C7$GÙKE6ÓÎ!I?$; K %69G]CE$;G(!9G!6l5;-
6*G=UÖ57Ò*$]5E1KMÍ3=JK
ÌrIJ KEK7=¦â9Ì;.5E=U01%-*(*K7$ÙKM570ÌWÒ*1Ke5E=?ÌÏ1CE1G=?$l5Ña=¦5EÒ Ì C7$ÖÝ(!I%5E(!!=?!Ï*-
0 Ca(*K7$:57Ò!$]KM570ÌWÒ*1Ke5E=?Ì@G=J6Ï10 * Ic"%$<1$>P,$CEÏQÁ CEÜl(* CEG54Ó$r5EÒ!0GcA
=¦Ö57Ò*$@57CW6=?!=U*ÏK7$r5a=JKR!065a5E0l0IJ6CEÏ $ ->01Ca=¦Öv57Ò!$@5W K7Í=?KRCE$Ï1C7$K7K7=?0 %-l(*K7$@Ì0 P e(*Ï1.5E$
Ï CW G=?$l5A
&)I?1K7K7=?Ì;6IcK7$;Ìr01*G>LO0 CWG$C4Ó$r5EÒ!0G!K6CE$@=UÓÎ!CW Ìr57=JÌ Ic=U#6I?Ó01KM56I?Ic(*KM$ÖÝ(!I/Ì; K7$;K;A
ÞRÒ!$D*0 LOIU=?!$6CRGÐ>*6Ó=JÌK)06ÖKM570ÌWÒ*1Ke5E=?Ì
Ï CW G=?$l5RG$;KEÌr$;15R=?ºÓÙ(*I¦5E=¦LOI?'Ð1$C4!$(*CE I
!$r5eÑ40 CEÍK-6Î96C757=JÌr(!IJ6CEIUÐ K=¦5RÎ9$;CM5W6=?*K5E0ÙÏ1$!$;CE IU=?ë;657=?0 %-.=JK)KM57=?I?I*Ö» CÖÝCE0 Ó P9$;=U!ÏÑ)$;IUI
(!*G$;CEKM570>0GcA1QÁ0 CE$a57Ò!$;0 CE$r5E=?Ì;6I!Ñ40 CEÍ *GKMÐKM57$;Ó657=JÌa$ræÎ,$CE=UÓ$;15W6I9Ñ)01C7Í=?K!$$G$;GcA
8 3C ,.% 9 4<%
%(3; A "A "A 2:A HA LhNA6Q¼A6Ï1CE657$ÖÝ(!IUI?ÐÙ1ÌWÍl*0.ÑaIU$GÏ $4ÓÙ(5E(*6I
$ræ!ÌWÒ*6*Ï $:Ï CW6l5EK)ÖÝCE0 Ó :Û :
Ûà6*$G ?!î)A
ò Eò/Æ!ò/Ä
ÊòC
~©
©!t)Z]\rbei×©4)_W1bM\k_n\bej1i¦j1îkjºgOshbelfEse be_nq>\rbM\;Z_Ese_7bRghq>\fW_Wg jl\se bM\;,beik_WZ]\jljli¦\j
^bM\ i¦_Ej6sn©YÀj i¦fMo>\_Wv )© dW_Ebn} ikfMo>\;_E/YM©*d;bM1\;j*},\j>u®old;Z]\;ga`_7seghf7o1_}*_n1ised;begW}
}'dklZ_ }'ql\;^;_R~r¢.©.u®o1_ YOu
`cbe _Wgh
gW}9 ~n6©
"!$#%&
'()# +*,#" - .0/21-'3%+
¢6©
©vt)Z]\bei×© R\rse1bM\;^;bM\;1ik_Wj.s®dbe6g5_ 4fWik_Wj.se i¦j k_n\bej1i¦j1^1©
768# %9':
'()# }
~W ¢ 7¯ ¢;¡6~ ; ¢ }%~n;¤6©
©am© ®\shseisei×©*ibegOsh|c\;jl@gh_WfWd;j> |»dbM1_EbvZ_Eseold6 gc°Jd;bvk_n\bej1ikjl^1¯._Es[_W_Wj]gOse_W_Wq_EgOs® _WghfW_Wj.s
\j>Ùjl_Esed;=j < gZ_Eseold6!© }>£1¯k~E£1~ ;l~ }c~n;'¢.©
£1©
©>v_Wf7_Eb)\;j>A @D©>y*_r lj*>©YÀZ
q1?be68dr6# ik%jl^D9se':o1
_a'()fW# d j6'_7be^_Wj1fW_d;°%{>\f76{ bedq>\^'\rsei¦d;jk_n\bej1i¦j1^
iseo¾gh_WfWdjl3d61_7bZ_Eseold1gW©YÀj¾pa\n6i¦3u,d;1be_EseE6.} _Ed;¿be_E ikj.sedj*},\j>¾u,_Ebhbe_Wj1fW_
6_ Oj1dngh i×}l_n1ised;begW} }
ql\;^;_Wg)¢;R ;.'.©!y,\Wbe_W*j1fWB_ #" % be k{>\. l0ZÚ#B!Ct'ghD2gh&d.fWEi¦\F seGH_WGIgW}968~n#2¤;J © 5'()#2K-'MLN#"5 O/2%+%7/P DQ#"#
¡6©
)© ©.ikghold;q9© SB
Q'(TM#-UV!$# M*
'('3XWY -#-. 2K'((# ©, ¦\be_Wjl1d;j`cbe_WghgW}1)½6°JdbM*}
~W¡6©
©y%©ndshsed*©64jlkikjl_/\¦^;d;beiseolZg,\;jlRgOsed.f7ol\;gOseikf\qlq bedr½ ikZ]\seikd;jlgW©;Y[japa\n6i¦a1\;\*}W_n1ised;bn}
}
®Z \Z: KP{ beiU [\^_
}*Q~n;K.]¤6©Ku®ôl_a)_E
,sed>j'(Y[T?jl# gOseiUse` se_a_EFH F_Ebebdik_WcAgW}># ®\;UZDBDQ{1#be9di¦1
^;'0_a'4DQAj1i¦_EbeTghi':s[# e`% be2_Egh-gW'(K© '('3af
6©apD©1!©1vbed.dZo1_n\Ù\;jlÙpD©ly*dn®_© 1sei¦;\bei¦\{lk_R°J1jlfEseikdjÙikj.se_Ebeqd¦\seikd;j\;jl\;l\;q seik'_
¤ © j1_Es[©*y/d;©*be6gW1© j668seik# j1%_:9P\ j> /Qt
1 ©9-'3© %+ }_W¢6ik^¯ '_Ej>¢.~*;6© ¡¡.d}/Z~Wql ¤;se¤ ik© jl^]gh_WfWd;j>d;bM1_7b1_Ebeik;\seik'_Wg4i¦j¾l_W_W |

1d;bh\rbM3jl_7s[®dbe6gW¯,tbe_W6ik_Ea© 2B
2
H 5'((# 2X#
>'(T?# U }~W6©u,d
©
\ )ql©!qpR_n\r\bnbe©'_Wjº\j>1© © d.d6 .©a)d;se_:d;j¦_W\bejlikj1^ÙbM\rse_:ghfMol_n lk_Wg4°?d;bagOsed.fMo>\gOsei¦f:dq1sei|

Zikn\rseikdj9©>Y[j@m©.`©.y*ikqlqlZ]\jlj*} © © d.d6 .}6\j>:pD©6!©'u,d be_EseW..};_n1ised;begW} Y

0>B
$!$# %
'((# >*,#" - ./Q1 -'3%0 };'d;¦1Z_6}rq>\^_Wg%¤;'"¢ ;.¤;¤ © dbe^'\j:¬\; °?|
~W ©a¬@©\Y7jl©.j*p4}1iU1\\;Z]j \; j.sM\\sebM_W\d g®}!\ ®j>t
Ù}*!~n©H;@D ~;©'©¬a1jl^1© *,P 5 9S
=68# %O9S# P2'\SB
2>'(T?# U © ikk_E.}
]
Z
~;~©a)m_E©v k@v_Esed;f7beo1_E}9bn~W© *,©
H 5'((
L 'DQ#" #B! Z 9'(%+ "
'()# }vf7ol\;q1se_7bÙ¤ © ¯`/d;k j1dZi¦\;seikZ_
~n¢6©
\k© ^ d;be_WiZ]seo1\;Zj*gW} }lql©.\;vî¦_E_Eg
jl_W~Wj1¤RgOse;ld.~nf7¤;¤ }©\;;j>d
olm
j © pRikd;¦1_7bege\rwsn©>64djl_E1gW}1bM\;46_7j1 _Es[@vd;d;bebe6}g/gh\;_Wj>fWd;j>seol_®_W1{1ii¦se\;ikg dj9;}*\be~ni¦;\¤'jl.fW© _

~W ©y%ik© ¦_E ZdZ]¦1\ gOse© _Ei¦j*©

_n8\68j# g %O.9>\rbe':
_4'(d(# q se}likZ]£ \;~ ki7s[¯k]~ ; ik¡;jÙ¤6se}co1~n_4;'fWd;¢6j.© sei¦j61dlgvsei¦Z_md{l{1ikjlg d;j1bed:q bed;|

fE_n1 be_©*u,_WfMolj1i¦fW\;>m_Wqd;bhspRm)v|À 6}.pR_Wq sn©'d° \rseol_WZ]\rseikfWgW}64j1i¦_Ebeghis[Dd;° d seol_Ebej

~E£1© ® \© kik°?d;© be j1iUd;\6k}!l{y,t\;jl}9~W )¤6©!© © v\;jy*d'\;j*© LN
'( 68# %9':
'()# Q ¢j>_W*©)do1jlg d;ql6ikjlg
~n¡6©)u4jl© ik'© _Ebe ghi_WsÀghÙ'_E`%gDbe_W\ghj>gW}>®\;©,se¬ikZ\qld;qbe_;_W},j*~n©;¤46j © |»¦ikj1_¦_W\bejlikj1^q1bed.fW_Wghgh_EgDikj#\rbhseikªlfWi¦\;vjl_W bM\;j1_Esh|

d;be6gW©/Y[j1© ©lu\W ¦_7bn}l_n ised;bn} LI

'D2%
'() -
M9H9#"
D2':#&
P>'(T?# U }'d|
~ ©a1mZd{_a_7¡.bhs%~}>t
ql©\;'^;\_WfWgad{1~ngW"© ; Y[jl¢f7be _n© \;%gh_nkgh_WbM6\ik_Esebn_W}gctd°>ZfWgOd;sej6_E'bM_Elbe\^Z_Ejl}fW~n_/se;o1 be© d1ô
k_n\bej1i¦j1^)bM\rse_\1\;q1sM\rseikdj9©
~n6©t© ©r
¬RP>bM\Z'(T?_Ebc# \Uj> }~t
¯ ¢©r'1"¡ \;;.j1îk6dr}c\~njl;j1¤ik¤6| ©)i¦j1fW_Wj.se_Wkki×© 4fEi¦_Ej6s,q>\rbM\;kk_W1k_n\rbejlikjl^)\;k^;d;beiseolZg
°?d;bj1_W1bM\cjl_EsÀ®d;be6gW©
Y[jp:©9©*u,d be_EseW..}!_n iksedbn} Y
+KN
V $!$# %
'()#

*,#" --K./21-'3%+ =*,#" - K. >#B!C'D20E FGHG 68# "!"P }1q>\;^;_Wg£'" ;.£'¤6}! \;j \se_Wd }
~W¤ C© @D ®t©ry9}9_n~W l¤;j9© © LNd#"be^'5\ j¬ -# \;Q1J°?Z] \;)#j12j92© '3YH
9H9-Q'(K-
"._3 -# 2P 5'()# 2 '\ -
QK.O%#"

~W C© @D5 :©;f y9©v_r`/ o11pTj9© se ol_E_Eghjli¦_EgWbM}1\4¦ikj1n\rik'se_EikdbeghjÙis \;_jl`:©ljl_E_Es sÀ® d©beD 1 be_Wikgh_ ik^ j`cgO\shbebMikg\rse4_W^Y ik7_W}9gW~W©YÀ¤j:6m© ©'`c°J_Eik°?_Ebn} ©.6f7o be_Ese_Ebn}

©/1d^_E¦Z]\j9}/\j>Áy/©v.se_W_WkgW}%_n1isedbegW} 68#2J 5'()#2K-% K`*Y)9Q 5'(K }tZgOse_EbMl\Z}

~W¤; © %¦gh_E ik_Ebn©9`cbed.fW_W_n ikjl^gd°!seol_Y[j.se_Ebejl\seikdjl\;! d;j1°J_7be_WjlfW_R dj1jl_WfEseikd;jlikghZ ikj]`_Ebh|

¢ C© @Dq©6_Ey9fEse_nik '_l}1j94}.)jlik© '_7bedghghiksÀ_7]bn}ld; ° ©1!1« ©.bepRikfM_Eo9jl}*'~n_E6bnQ©}p:;© ~W ©!_W4j> f7_Esebedgh{d_Ej*b}.~nm
;©¤¤6© © dr®\bM!} © 1{l{l\bM*}'\;jl
h
g
y%©!p:©*'\f7'_EÝ© \;jl beishse_Wj¾1ikîsRbe_WfEd^j1isei¦d;j¾iseo¾\Ù{l\;fM q bedql\;^'\rseikdj3j1_Es[d;be©Y[j

pD©!!©lu,d;1be_Esegh. .}>_W1ised;bn} Y
0K
$!$# %
'((# *,#" --K.A/Q1 -'3%+ ?#
¢.~C© @D}©' y*\;_rj lj*\r}.se1_Wd1© }©' ®pRt
_Wj1}*'~n_E;bn} \;© jl d;!be© ^'t
\©'j ¬d;k\;U\6 ©9°JZ]4\;q1j*seik© Z]\; {1bM\ikjl\Z]\;^_;©Y[j@pD©.©'u,d;1be_Esegh..}

_W1ised;bn}
YK>B
2 "!$# %
'()# &*B#" --K./Q1 -'3%+ \# }'ql\;^;_Wg¡;;R¤ ; '¡.}
¢;¢6©C@D~W©6;y9 _n© lj9}6YM©.¬\;j.se_Ebn}.\;j>Ù© t
© 6dk¦\ ©%6_WfWd;j>d;bM _Eb®q bedq_Ebhseik_Wg®d°,_7bhbed;bgh bh°×\;fE_WgW©cY[j

YK >
Q "!$# %
'()# *,#" --K.&/21-'3%+ # } 1\j \rse_Wd1}1 ®t}>~n; ~;©

¢ C© @D©1dy9be^'_r\ j1¬j9}1\;`1S©°?Z]@D\;©lj1 j9ikZ]© \bM!}l\j>)©1`_n\bekZ: shse_Ebn©%t1sedZ]\rseikfRk_n\bej1i¦j1^:bM\se_4Z]\½ ikZi¦W\|

seikd;jå{.Ádj |Àkikjl__WgOseikZ]\seikdjúd°Rseol_ol_Wghghi¦\;j < gÙ_Ei¦^;_Wj6'_WfEsedbegW©zY[j ik¦_EgW} \jlghd;j9}v\;jl

dn®\;j9}9_n iksedbegW}
N>B
$!$# %
'((# N*B#" -K.I/21-'3%+ # },1\j
¢r£1© \r© se _Wd1}¦k ®_Ebnt
©ã}*~ntÚ;gh fn© \;k_n d;befW^d\;j jOl¬^\;\ se_°JZ]^;\bMjl\;j91©ik_Wj.sÙ\;k^;d;beiseolZ°Jdb]°×\;gOsÙgh1q_Ebe6i¦gh_W k_n\bej1ikjl^1©

P>'(T?# U } ¯ ¡;¢"¡ ;6¡;; }c~n;6©
"!
¢;¡6© © k¦_7bn© 1q_Ebe6i¦gh_WÁk_n\rbejlikjl^ºdjÁ¦\be^;_be_n ljll\;j.s
shbM\;ikjlikj1^¾gh_EsegW© 2'3 P
'()#P

}>£ ~7¯k~n¡"; ¢;¡6}c~W6©
"!
¢ © ©#S © P
\d.d6#B!6\;jlSB
)=©/Q11 ©r-'3pa%0\rbe '_Wj9 ©;>\gOsk_n\rbejlikjlîkj
j1_Es[d;be6g,d° ¦d.fW\;k |Ýse1jl_naq1bed.fE_Wghghikjl^
1jlisegW© },~;¯ ¢;¤ ~ ; ¢;£1}%~W¤6©
¢;6©a
© 1bM\sM\6S©B
_:? 68#%O
9S9
J':
5'()a#f ©`%olpTseo1_WghikgW}l)jlik'_EbeghisÀd°%u,d;6'd1}*~n¢6©
¢¤ ©a
© bM\sM\6}c¬@© |Àm© #l« kk_Ebn}/t
© !ik_Wo1_}%\j>#©ct)Z]\rbei×©t41\;q sei¦_dj1|»kikjl_]k_n\bej1ikjlîkj
fMo>\jl^;i¦j1^_Wj66ibedjlZ_Ej6segW©úY[j ikfMo>\;_Ea )© dW_Ebn} ikf7ol\;_WRYM©vd;bM1\;j9}®\;jlúu®old;Z]\;g

`_7seghf7o1_}_n ised;begW} }%'d;¦1Z_ }

ql\;^;_
¡;6©u®o1_ YOuz `%
be_Egh gW}, ~nK; '.©
"!$# %
'()# *B#" -K.I/21-'3%+
¢ ©t© D©4q1q_Wjlo1_WikZÔ\;jlúm
© © fMo>\°?_Ebn© . K':
/Q . P
*,#" - . ©`%be_Ej6seikfW_ \kÝ}
%jl^;k_E®d.d6 ki¿!gW},~W'¡.©

; © ©l)©>)bhbn© 1P
%+) 5
P0 .H# K'D%+ !$# 7/2':#" DS
-'() +
2K. ©/`/o1pTseol_WghikgW})be_W^d;j
bM\;1l\se_aY[j1gOseise1se_}!~n;'¡.©
6~© © )©D)bhbn© m_WZdr6i¦j1^¼jld;i¦gh_ ikj d;j1|»kikjl_ gh_n\befMolghikjl^ç\1\;q sei¦_ {l\sefMo ghikW_EgW©éY[j
ikfMo>\_Wv )© d;W_Ebn} ikf7ol\;_W/Y7©*dbMl\j9}9\;jl3u®old;Z]\;ga`_7seghf7o1_}*_n1ised;begW}
P &
}.'d;¦1Z_ }q>\^_¢;¢6©6u®o1_ YOu `%be_EghgW}l~n;'.©
¢6© © 1©y%
©>)"bh!$bn# ©* m%_W
^'()1# ¦+\be*iknB\#"se ikd; j-Kikj.0se/2ol1_-gh'3_Wk%+_WfE seikdjÙd;°*bM\ i¦\;!{>\ghi¦gv°Jlj1fEseikdj]fW_Wj.se_EbegW©

}! 7¯ R; '¢;6}c~n;'¡6©
; ©68©#t%O©>9`_W':\
be'()kZ:# shse_E bn©v>\;gOs_E½1\;f7s)ZDlseikqlkikfn\seikd;j{.Ùseo1_aol_Wghghi¦\j9© >
?68# %9':
'()# }
6¯k~W£.$ ;>~ }c~W£1©
£1© © ©,`cbe_WghgW})©`©,¦\;jlj1_Ebh.}!©9t
©,u,_W1'dkgh..}9\;j> ©,u4© /_Eshse_Ebekikjl^ © O% ) -

©# ®\Z:{ beiU ^_]4j1ik'_Ebeghis[`%be_WghgW}
®W\ 5Z: 9{ :beiU ^_ } %6jl ^=¦\D2j>A*
}*~n'>;¤#B!¤6© /P 5)2'( ? *B#-.
%0%+K.

¡6©apD©61\;\!}._n1isedbn© Z KP,[\
2K.&
Q'(T?# -U0_aEFHFb cA# UBDQ#9I
''D2T
©Áu®o1_)_Esedj#YÀjlgOseise1se_6_Ebeik_WgW}v ®\;ZD{1bei¦1^;_)jlik'_Ebeghis[3`cbe_WghgW}v ®\;Z@|
{':#bei¦1^ _;Q}9 '(~W'(S'3¤6a© f

©apD©91\\;¾\j>¾!©*t©* d;¦¦\6© c½1\;fEsaghd;k1seikdjº°Jd;bd;j1|»kikjl_:k_n\rbejlikjlîkj¾ZDlseikU\W._7bajl_W bM\;
j1_Es[d;be6gW© }!£1¯ £'$ ;.£;£'6}v~W¡6©
6© ©9 d;Zqdk*?ikjlDgh1 .-.)} -
DP©*WY\"be);T \;i×}9[\\;j>'('3 © ©9 _W1jl^1©D4j |»¦ikj1_:k_n\bej1ikjl^d° i¦fMoldsedZik_WgW¯9\;|
^;d;beiseolZg4\;jlk_n\rbejlikjl^ÙfW be'_WgW©)Y[jº;| ©*4o*}* )©!¬R®d;j9}!\j>º!©! old1}!_W1ised;begW}

}6ql\;^;_Wg)~W'"¡ ;l~n; ©l ikjl^\;qdbe_¯ dbe¦

6fWik'(_WT?j.se# iª>Uf}! ~n=;D2'Y¡.© /2':
'(K-'() -
2L DQ
2( 5,*Y :9Q 5'(K
;¤ ©am© !©> shsedj9©®t41\;q1seikj1^@{li¦\;g{.^bM\ i¦_Ej6s41_EghfW_Wj.sn¯>tjikjlfEbe_EZ_Wj.sM\;,_Ebeghikdjd° _WsM\|
{l\bh|À1_EksM\6©YÀj ikk¦i¦\Z6®\bhsed sn}n_n iksedbn} *,#" --K. #a!?'D2E 'D77
'()#
68# $!"-
}lql\;^_Eg
~n6~ ;l~r }91\jd;gh_}! ®t}>;l¾~W¢6© Y[uz`%be_WghgW©
; ©`#A©EY\j'(1 ? _E 5b)
Z]P \2^;'3sn5© K .ikjlikZ ikge\seikdjZ_Eseo1d61g°Jdb,shbM\;ikj1i¦j1^°J_W_n6|Ý°Jd;bh®\bMajl_7s[®dbe6gW©

}! ~ 7¯k~ ;>~~;}~n;;£ ©
£ ©
©'(vT?\;# q1jlUik © =D27
'(-C#B!/2':
'( '((
2[V-
QK. =D2# 1 ©6q1beikjl^;_Eb v_Ebe¦\^1}.)_E^ @vd;be}
~W¡6©
"
£ ~©
© \qljlik© /Q':
'(K-'() -
J[\
2. =D2# 1 ik¦_7.}l)_E @dbe!}!~W¤6©
£'¢6©t© \ik{_W×} u4© \;jl\;n\W®\ } © ikj.sedj*}.¬@©1 o1i¦;\;j1d1}6\j>¬@©61© y9\;j1^1©c`%old;jl_WZ_be_WfEd^;|

j1isei¦d;j]lghikjlâseikZ_E|À1_W¦\W@j1_W1bM\jl_EsÀ®dbe gW© 2B

2
H 5'()#27#XY -# -'() 5 /9Q- D
}tRl `%|À6¯ '¢R¤ ;6;6}c~n¤; ©

£ ©
P© /2 i¦. _E^_E
beikjl*,fM#" }1t
--© K¬a. dZd61\ }1\;jlu4© _Wgh_WgW©/6sed.fMo>\gOseikfa j>\ZikfWg®d;°¦_W\bejlikj1^:iseo
Zd;Z_Wj.selZ ikjjl_E1bM\;*jl_7s[®dbe6gW© }>¢6¯ £;£.¢$¡ ;.£;£''.}%~n£1©
£;£1© © © @®\jl^\j>©ctZ]\bei×©u®ol_5_ 4#SfE i¦P_Ejl
fE\#B!Y\;*?j>Dº1 se-ol) 5_:Obe d;{llgOsej1_Wghg
d;°j>\rse1bM\^;bM\;1ik_Wj.s
_WghfW_Wj.sk_n\rbejlikjl^Dbelk_©%Y[j ikf7ol\;_W*Y7©1dbMl\j9} i¦fMo>\_W91©l¬a_W\bejlgW}l\;jl \bM\@t©1 d;¦¦\6}
_W1ised;begW}
P K
2 $!$# %
'()# *,#" - .&/21 '3%0 }1dklZ_a~n6© u®ol_ YOu
`cbe_WghgW}9~n;¤ ©

Neurl 231 Net

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neurl 231 Net

Uploaded by

Copyright:

Available Formats

Courses/231/

Abstract of practice, focusing on learning algorithms aiming

With B = 1 we are back to ordinary online gradient

B have only a small effect on the optimal learning rate, with an

does depth of architecture matter? then we need to control ac-

A DAM : A M ETHOD FOR S TOCHASTIC O PTIMIZATION

We introduce Adam, an algorithm for first-order gradient-based optimization of

ments. The method is straightforward to implement, is computationally efficient,

2.1 A DAM ’ S UPDATE RULE

3 I NITIALIZATION BIAS CORRECTION

6.1 E XPERIMENT: L OGISTIC R EGRESSION

0.20 5 10 15 20 25 30 35 40 45 0.200 20 40 60 80 100 120 140 160

6.2 E XPERIMENT: M ULTI - LAYER N EURAL N ETWORKS

6.3 E XPERIMENT: C ONVOLUTIONAL N EURAL N ETWORKS

10-1 MNIST Multilayer Neural Network + dropout

0 50 100 150 200

3.0 CIFAR10 ConvNet First 3 Epoches CIFAR10 ConvNet

0.50.0 0.5 1.0 1.5 2.0 2.5 3.0 10-4 0 5 10 15 20 25 30 35 40 45

β2=0.99 β2=0.999 β2=0.9999 β2=0.99 β2=0.999 β2=0.9999

6.4 E XPERIMENT: BIAS - CORRECTION TERM

vt = β2p vt−1 + (1 − β2p )|gt |p (6)

7.2 T EMPORAL AVERAGING

Proof. We will prove the inequality using induction over T.

For the inductive step,

Apply Lemma 10.3,

Proof. Using Lemma 10.2, we have,

From the update rules presented in algorithm 1,

2αt β1,t β1,t ) m

above equation and use Young’s inequality, ab ≤ a2 /2 + b2 /2. Also, it can be

From the assumption, kθt − θ∗ k2 ≤ D, kθm − θn k∞ ≤ D∞ , we have:

Suyog Gupta suyog@us.ibm.com

Abstract At the same time, the natural error resiliency of

3.1. Rounding Modes The product of ai and bi produces a fixed-point

ROUND ROUND ROUND

overflow/underflow). Both operations can be efficiently 6. Conclusion

Abstract Using mini-batches of examples, as opposed to one exam-

4.2.2 Single-Network Classification to be trained when sigmoid is used as the nonlinearity,

Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and

Povey, Daniel, Zhang, Xiaohui, and Khudanpur, San-

Figure 5: Inception architecture

Matthieu Courbariaux*1 MATTHIEU . COURBARIAUX @ GMAIL . COM

Abstract tistical machine translation (Devlin et al., 2014; Sutskever

• We conduct two sets of experiments, each imple-

• We show that during the forward pass (both at run-

Algorithm 2 Shift based Batch Normalizing Transform,

where x is a vector of 1024 8-bit inputs, x81 is the most

(Tang, 2013; Lee et al., 2014). We regularize the model

and BN (with a minibatch of size 100) instead of the vanilla

CIFAR-10 is an image classification benchmark dataset. It

only costs a single slice.

4. Seven Times Faster on GPU at Run-Time trix multiplication kernel.

References Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre.

Srivastava, Nitish. Improving neural networks with dropout. Mas-

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever,

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre,

Tang, Yichuan. Deep learning using linear support vector ma-

Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Im-

Random Search for Hyper-Parameter Optimization

James Bergstra JAMES . BERGSTRA @ UMONTREAL . CA

Editor: Leon Bottou

≡ argmin Ψ(λ) (3)

≈ argmin Ψ(λ) ≡ λ̂ (4)